sawyl: (Default)
[personal profile] sawyl
When you're way down on the callout roster and the phone still rings, you know it's going to be bad. And so it proved. After being sold a sorry tale about power problems and the struggle to get the systems started again, my interlocutor broke the really bad news: LoadLeveler wouldn't run on either system making it impossible to run any production work and the others, having offered up their advice, were out of ideas and it was up to me to defend the honour of the team.

A little bit of digging soon revealed a particularly nasty scheduler problem: when the main daemons were restarted, the negotiator came up and started responding to queries, while the scheduler when into a crash/restart loop until the restart threshold was exceeded and the rest of the daemons were shutdown by the master process. Given the underlying cause of the problem, I suspected that the on-disc data structures had probably been corrupted by the power down. And when I checked the spool directories on the primary and alternate managers, I noticed that the time stamps on the database files matched the time when the system was brought down. Not conclusive, but I decided to take a gamble and force a cold-start by moving the current spool directories out of the way, creating a new, empty spools with the right permissions, and restarting the daemons on the managers. And it worked: the schedulers came back up, albeit with empty queues, and started accepting and scheduling new jobs. Crisis averted.

So today's lessons learnt. Trashing the LoadLeveler scheduler spools isn't fatal, provided you don't mind losing all your jobs; when it's a choice between something, however minimal, and nothing, always choose the something; and when you're running the same level of software everywhere, you can expect to see the same bugs everywhere. Today's corollary. When asked how you solved the problem, always always reply with a modified version of Gell-Mann's description of Feynman's problem solving method: "I wrote down the problem. I examined the logs. I thought very hard. I came up with the answer..."

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 4th, 2026 01:27 pm
Powered by Dreamwidth Studios