Down like dominos
Feb. 3rd, 2010 09:48 pmI had a classic bad sysadmin moment today, after making what should have been an innocuous change to LoadLeveler.
At the appointed hour, I tweaked the configuration to enable preemption and issued a reconfigure as I'd done tens of times while preparing the change on the test system. Initially everything looked good. Then the nodes start to drop out of the status display. Then LoadLeveler spat out hundreds of emails, each complaining that the start daemons had failed on the nodes. Then, with a sinking feeling, I checked the list of running jobs only to discover that they'd all all died.
Examining the logs, I eventually managed to piece together what had happened. A known problem had caused the negotiator to crash while attempting to reconfigure under heavy load. Simultaneously the start daemons reconfigured themselves and finding themselves unable to contact the negotiator, went into a rapid restart loop which, when it expired, caused all the daemons to exit and which in turn caused all the running work to exit.
It's odd that we've never seen this problem before, despite having experienced a number of negotiator crashes. My suspicion is that because previous changes have been limited to tweaks of the scheduling parameters, the reconfigures were limited to the administrative nodes. Whereas the nature of the preemption change was such that it required a reconfigure of all the daemons, which triggered the fatal timing problem which caused everything to collapse in a heap.
Fortunately, not all is doom and gloom: IBM believe they have a fix for the negotiator problem in the latest release of LoadLeveler. But in the meantime, it should be possible to work around the problem by reconfiguring the administrative nodes prior to signalling the compute nodes...
At the appointed hour, I tweaked the configuration to enable preemption and issued a reconfigure as I'd done tens of times while preparing the change on the test system. Initially everything looked good. Then the nodes start to drop out of the status display. Then LoadLeveler spat out hundreds of emails, each complaining that the start daemons had failed on the nodes. Then, with a sinking feeling, I checked the list of running jobs only to discover that they'd all all died.
Examining the logs, I eventually managed to piece together what had happened. A known problem had caused the negotiator to crash while attempting to reconfigure under heavy load. Simultaneously the start daemons reconfigured themselves and finding themselves unable to contact the negotiator, went into a rapid restart loop which, when it expired, caused all the daemons to exit and which in turn caused all the running work to exit.
It's odd that we've never seen this problem before, despite having experienced a number of negotiator crashes. My suspicion is that because previous changes have been limited to tweaks of the scheduling parameters, the reconfigures were limited to the administrative nodes. Whereas the nature of the preemption change was such that it required a reconfigure of all the daemons, which triggered the fatal timing problem which caused everything to collapse in a heap.
Fortunately, not all is doom and gloom: IBM believe they have a fix for the negotiator problem in the latest release of LoadLeveler. But in the meantime, it should be possible to work around the problem by reconfiguring the administrative nodes prior to signalling the compute nodes...