Jul. 16th, 2009

sawyl: (Default)
I've encountered another interesting LoadLeveler oddity, again involving reservations, but this time due, I suspect, to a timing bug rather than an design effect.

Essentially, what appears to have happened is that a reservation was created with a start time of T+2 minutes and the request was granted by the first scheduler. Less than a minute later, the second scheduler allocated a queued job to the nodes that had just been reserved by the first scheduler. Then, when the reservation became active, the job associated with it was unable to run because it lacked the necessary resources.

My suspicion is that the problem is due to some sort of race between the two scheduler instances: that the two daemons are not communicating frequently enough to ensure that they both have upto date copies of the current reservation list. I'm not entirely surprised there are problems. I'm deliberately creating a reservation as near to the current time as possible — not something anyone in the right mind would normally want to do — and I'm relying on LoadL to coordinate across multiple machines, which is asking for trouble.

Time to dig through the manual: perhaps there's some parameter I can tweak to increase the scheduler synchronisation frequency...

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 12th, 2025 05:40 am
Powered by Dreamwidth Studios