sawyl: (Default)
[personal profile] sawyl
There's something wonderfully compelling about other peoples' Unix horror stories. A feeling of there-but-for-the-grace-of-god and all that because, God knows, we've all been there. In that spirit, here's an account of one of my skirmishes.

More than a few years ago, I was asked to investigate a performance problem with one of the HiPPI channels that connected the T3Es together. I immediately thought of an old friend,vst, a simple utility that blasted data from one machine to another as fast as it possibly could, and pretty soon I'd fired up the client on Cray A and the sender on Cray B. My initial results were encouraging — I'd started to see problems with mbuf draining — so I began to increase the parallelism, running two, four and then eight streams in parallel.

Everything went perfectly, right up to the point where I attempted to run with 16 streams, whereupon I suddenly and simultaneously lost contact with both machines. Realising that something had gone Very Wrong, I examined the consoles only to find a slew of network errors on the Cray A console and the dreaded error, "0x27f - failed to report in", on the Cray B console.

After much prodding and poking, I was eventually able to coax the catatonic A back into life but B turned out to be completely dead — not surprising given that 0x27f was the netdev actor — and we decided that we had to reboot. We managed to get the machine back up in time to start the afternoon schedule and, best of all, I'd managed to collect enough data to pinpoint the cause of the performance problem.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 4th, 2026 09:03 pm
Powered by Dreamwidth Studios