Meltdown over meltdown
Jan. 4th, 2018 09:03 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Listening to the news this morning and hearing about Meltdown/Spectre on the BBC news, I felt a profound sinking feeling. Sure enough, I lost almost my entire day to dealing with it. The problem itself is pretty interesting, but the mitigation is far more intriguing: to what extent is the fix for the problem likely to impact the performance of large-scale parallel jobs? I guess the answer depends on how much overhead it adds to MPI calls — the driver layer typically runs in user space with some sort of OS bypass — and what impact it has on IO throughput.
The afternoon was mostly spent talking to an endless parade of people — my visitor's chair doubling as a psychiatrist's couch — with one of them, who I've been working with to run some very high resolution global simulations, came by to tell me they'd found a bug in the model. Apparently the adaptive mesh software which calculates the routing table degrades pathologically when the resolution drops below 8 kilometres, so the jobs we were attempting to run wouldn't have worked, even if we had been able to get them to schedule...
The afternoon was mostly spent talking to an endless parade of people — my visitor's chair doubling as a psychiatrist's couch — with one of them, who I've been working with to run some very high resolution global simulations, came by to tell me they'd found a bug in the model. Apparently the adaptive mesh software which calculates the routing table degrades pathologically when the resolution drops below 8 kilometres, so the jobs we were attempting to run wouldn't have worked, even if we had been able to get them to schedule...