Another war story
May. 16th, 2007 08:06 pmAnother ancient war story from the Cray era, but this time, events were some what less self induced than yesterday's nightmare.
We'd just completed a triumphantly successful upgrade of Unicos/mk and were looking forward to a few days of peace and quiet. Instead, our calm was disrupted by the news that PE 0x37 had failed to report in and various others had issued WaitOnCirculate warnings. No problem, we thought, we'll just use the newly introduced renumber command to switch the bad PE with a set of four higher up the machine.
Initially, the renumber appeared to have worked, but after a few minutes the renumbered PEs panicked with auto-prod errors. As if this wasn't bad enough, GRM — the global resource manager — dropped out with GpeCall failure errors, the machine stopped accepting new logins and NQS claimed that the entire batch system, including the queues, had vanished. It was definitely time to dump the machine.
The dump started off normally, but failed whilst attempting to dump LPE 0x37 — a classic sign of a serious hardware error. So, what to do? We needed a dump to be able to determine the cause of the initial error and the renumber failure, but if we couldn't dump 0x37, we couldn't dump the rest of the machine. Eventually, we configured out the PE with
that they no longer matched the in-core PWHOs — but the dump eventually completed and we were able to hand the machine over to the engineers for maintenance.
After an hour of hammering and nailing, the engineers pronounced the machine good and we started to boot the machine up. We made it to single user without any problems but, when we attempted to
There's an amusing corollary to the story. I was due to attend a routine user liaison meeting in the afternoon, to tell the users how wonderfully everything was going and how smoothly the upgrade had gone. In the end, we had to defer the meeting because no-one from our group was available to attend...
We'd just completed a triumphantly successful upgrade of Unicos/mk and were looking forward to a few days of peace and quiet. Instead, our calm was disrupted by the news that PE 0x37 had failed to report in and various others had issued WaitOnCirculate warnings. No problem, we thought, we'll just use the newly introduced renumber command to switch the bad PE with a set of four higher up the machine.
Initially, the renumber appeared to have worked, but after a few minutes the renumbered PEs panicked with auto-prod errors. As if this wasn't bad enough, GRM — the global resource manager — dropped out with GpeCall failure errors, the machine stopped accepting new logins and NQS claimed that the entire batch system, including the queues, had vanished. It was definitely time to dump the machine.
The dump started off normally, but failed whilst attempting to dump LPE 0x37 — a classic sign of a serious hardware error. So, what to do? We needed a dump to be able to determine the cause of the initial error and the renumber failure, but if we couldn't dump 0x37, we couldn't dump the rest of the machine. Eventually, we configured out the PE with
t3ems and restarted the dump. This time, we got hundreds of errors about LPE to PWHO mismatches — probably because t3ems had reordered the LPE values sothat they no longer matched the in-core PWHOs — but the dump eventually completed and we were able to hand the machine over to the engineers for maintenance.
After an hour of hammering and nailing, the engineers pronounced the machine good and we started to boot the machine up. We made it to single user without any problems but, when we attempted to
fsck the file systems, one completely refused to fix itself despite clearing thousands of inodes. We eventually decided that it was more important to get the machine up than to sort out a single file system, so we commented it out of fstab and went into multi-user. With the machine up, we were able to hack on the disc and get it into a suitable state for mouting, whereupon we discovered that the quota information was completely screwed and that we couldn't recover it with the file system mounted and commented out of the fstab. After much hard graft, we finally recovered the quotas, fixed the trashed files and got the file system back in production, 12 hours after the start of the problem.There's an amusing corollary to the story. I was due to attend a routine user liaison meeting in the afternoon, to tell the users how wonderfully everything was going and how smoothly the upgrade had gone. In the end, we had to defer the meeting because no-one from our group was available to attend...