A war story
May. 15th, 2007 08:34 pmThere's something wonderfully compelling about other peoples' Unix horror stories. A feeling of there-but-for-the-grace-of-god and all that because, God knows, we've all been there. In that spirit, here's an account of one of my skirmishes.
More than a few years ago, I was asked to investigate a performance problem with one of the HiPPI channels that connected the T3Es together. I immediately thought of an old friend,
Everything went perfectly, right up to the point where I attempted to run with 16 streams, whereupon I suddenly and simultaneously lost contact with both machines. Realising that something had gone Very Wrong, I examined the consoles only to find a slew of network errors on the Cray A console and the dreaded error, "
After much prodding and poking, I was eventually able to coax the catatonic A back into life but B turned out to be completely dead — not surprising given that 0x27f was the netdev actor — and we decided that we had to reboot. We managed to get the machine back up in time to start the afternoon schedule and, best of all, I'd managed to collect enough data to pinpoint the cause of the performance problem.
More than a few years ago, I was asked to investigate a performance problem with one of the HiPPI channels that connected the T3Es together. I immediately thought of an old friend,
vst, a simple utility that blasted data from one machine to another as fast as it possibly could, and pretty soon I'd fired up the client on Cray A and the sender on Cray B. My initial results were encouraging — I'd started to see problems with mbuf draining — so I began to increase the parallelism, running two, four and then eight streams in parallel.Everything went perfectly, right up to the point where I attempted to run with 16 streams, whereupon I suddenly and simultaneously lost contact with both machines. Realising that something had gone Very Wrong, I examined the consoles only to find a slew of network errors on the Cray A console and the dreaded error, "
0x27f - failed to report in", on the Cray B console.After much prodding and poking, I was eventually able to coax the catatonic A back into life but B turned out to be completely dead — not surprising given that 0x27f was the netdev actor — and we decided that we had to reboot. We managed to get the machine back up in time to start the afternoon schedule and, best of all, I'd managed to collect enough data to pinpoint the cause of the performance problem.