sawyl: (Default)
[personal profile] sawyl
Yesterday, some of the scientists uncovered an interesting performance problem in the variational assimilation model. In most cases, they reported that a model iteration took a constant 8 seconds when running on ~768 CPUs. Occasionally the iteration times would spike, sometimes by enough to cause the model to run out of time; but when these models were rerun, the timings would almost always return to normal. Needless to say, they were rather keen to find a solution to the problem.

Provided with a series of test cases by one of the data assimilation gurus, I located a probably candidate and went over the systems it was running on to check for the usual symptoms of an OS problem — heavy swapping, high levels of kernel activity, logical/physical CPU reordering, etc — only to find nothing obvious. Having ruled out an explicit OS problem, I then examined each of the test cases in turn only to notice that all of the problem cases shared a common node. Once I'd identified the potential source of the problem, I was able to split the model runs into two groups, based on whether I expected to see problems or not. I then passed these on to my scientific colleague who was able to confirm that the additional cases I'd identified did indeed run slow and the other cases had all been OK.

With the focus of the problem finally clear in my mind, I decided that the most likely cause of the problem was the interconnect, given the known sensitivity of the variational assimilation models to communication problems. Sure enough, when I finally got around to querying the state of the Infiniband switches, I found a whole load of symbol errors and retransmits against my problem node which gave the final validation to my hypothesis.

All in all, a rather fun little exercise in problem solving...

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 5th, 2026 12:40 am
Powered by Dreamwidth Studios