Jan. 13th, 2009

sawyl: (Default)
After someone semi-accidentally cut the power to one of our clusters yesterday, I passed a moderately interesting day trying to get everything back up and running.

Fortunately, there was no pressing need for the service to be restored, so we were able to go through the (deeply) complex cold-start instructions and shake out some of the bugs. Along the way we discovered a number of problems caused by timing and dependency issues between the different system components; a whole range of critical daemons which failed to restart correctly because they'd left stale lock files lying around all over the place; and a handful of hardware failures. I now think that we've got a decent handle on the order in which system components should be brought up, which components should be allowed to auto-boot and which should only boot under manual control, which daemons are critical to the system and should be added to the external monitoring software, how we might group together some of the components on the larger systems and how we might be able to script up and automate the whole process at some later stage.

All in all, a rather fun way to pass what would otherwise have been a stultifyingly dull day.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 5th, 2025 10:52 am
Powered by Dreamwidth Studios