Diagnostics and interconnects
Oct. 1st, 2009 07:09 pmTheres something to be said for supercomputers that use commodity interconnects. Not least, that they generally come with a set of decent online diagnostic tools.
The Cray T3E, with it's custom interconnect, would crash with R-option errors at the hint of corrupt data passing through the torus, and the only method for diagnosing the source of the problem involved churning through a complete system dump using
Thus, it's rather a nice change to have a system based on a commodity network — infiniband — which that comes complete error and traffic counters on the switches, the ability to report both bad tokens being passed across the network and failed CRC checks on the data packets. It's also nice to have a system that has an interconnect that uses completely independent planes, allowing components of the system to fail without causing all the nodes to crash.
It's all most unsupercomputerish.
The Cray T3E, with it's custom interconnect, would crash with R-option errors at the hint of corrupt data passing through the torus, and the only method for diagnosing the source of the problem involved churning through a complete system dump using
crashmk
. The NEC, while substantially better in reliability terms, was often able to ride out interconnect faults provided they occurred on the node side of the connection, also suffered a from a complete lack of online interconnect diagnostics — it was impossible to tell whether error were occurring or what the traffic flow was across the system.Thus, it's rather a nice change to have a system based on a commodity network — infiniband — which that comes complete error and traffic counters on the switches, the ability to report both bad tokens being passed across the network and failed CRC checks on the data packets. It's also nice to have a system that has an interconnect that uses completely independent planes, allowing components of the system to fail without causing all the nodes to crash.
It's all most unsupercomputerish.