Reservations: a race condition?
Jul. 16th, 2009 08:20 pmI've encountered another interesting LoadLeveler oddity, again involving reservations, but this time due, I suspect, to a timing bug rather than an design effect.
Essentially, what appears to have happened is that a reservation was created with a start time of T+2 minutes and the request was granted by the first scheduler. Less than a minute later, the second scheduler allocated a queued job to the nodes that had just been reserved by the first scheduler. Then, when the reservation became active, the job associated with it was unable to run because it lacked the necessary resources.
My suspicion is that the problem is due to some sort of race between the two scheduler instances: that the two daemons are not communicating frequently enough to ensure that they both have upto date copies of the current reservation list. I'm not entirely surprised there are problems. I'm deliberately creating a reservation as near to the current time as possible — not something anyone in the right mind would normally want to do — and I'm relying on LoadL to coordinate across multiple machines, which is asking for trouble.
Time to dig through the manual: perhaps there's some parameter I can tweak to increase the scheduler synchronisation frequency...
Essentially, what appears to have happened is that a reservation was created with a start time of T+2 minutes and the request was granted by the first scheduler. Less than a minute later, the second scheduler allocated a queued job to the nodes that had just been reserved by the first scheduler. Then, when the reservation became active, the job associated with it was unable to run because it lacked the necessary resources.
My suspicion is that the problem is due to some sort of race between the two scheduler instances: that the two daemons are not communicating frequently enough to ensure that they both have upto date copies of the current reservation list. I'm not entirely surprised there are problems. I'm deliberately creating a reservation as near to the current time as possible — not something anyone in the right mind would normally want to do — and I'm relying on LoadL to coordinate across multiple machines, which is asking for trouble.
Time to dig through the manual: perhaps there's some parameter I can tweak to increase the scheduler synchronisation frequency...