sawyl: (Default)
[personal profile] sawyl
I had a classic bad sysadmin moment today, after making what should have been an innocuous change to LoadLeveler.

At the appointed hour, I tweaked the configuration to enable preemption and issued a reconfigure as I'd done tens of times while preparing the change on the test system. Initially everything looked good. Then the nodes start to drop out of the status display. Then LoadLeveler spat out hundreds of emails, each complaining that the start daemons had failed on the nodes. Then, with a sinking feeling, I checked the list of running jobs only to discover that they'd all all died.

Examining the logs, I eventually managed to piece together what had happened. A known problem had caused the negotiator to crash while attempting to reconfigure under heavy load. Simultaneously the start daemons reconfigured themselves and finding themselves unable to contact the negotiator, went into a rapid restart loop which, when it expired, caused all the daemons to exit and which in turn caused all the running work to exit.

It's odd that we've never seen this problem before, despite having experienced a number of negotiator crashes. My suspicion is that because previous changes have been limited to tweaks of the scheduling parameters, the reconfigures were limited to the administrative nodes. Whereas the nature of the preemption change was such that it required a reconfigure of all the daemons, which triggered the fatal timing problem which caused everything to collapse in a heap.

Fortunately, not all is doom and gloom: IBM believe they have a fix for the negotiator problem in the latest release of LoadLeveler. But in the meantime, it should be possible to work around the problem by reconfiguring the administrative nodes prior to signalling the compute nodes...
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 5th, 2026 10:31 am
Powered by Dreamwidth Studios