A Lustre rabbit hunt
Jun. 21st, 2016 08:47 pmSpent an interesting hour attacking a perform problem in an iris application which ran faster on one system than another. Runinng the testcase through the profiler, I almost immediately identified a hotspot in
Tracing the IO calls I found that the mean and percentile times, which should have been identical, where higher on the machine where performance was worse. Wondering whether the problem might be excerbated by the IO patterns of netCDF — the output was bimodal with peaks at 5KB and 1MB — I repeated the test with
After confirming the OSTs were different in each case, I rechecked the Lustre parameters on the slow system; with hindsight, I should have done what my colleagues did and made this my first port of call. Unsurprisingly I found that both
biggus.save() and, intriguingly, noticed that it was taking 170 per cent longer on one machine than the other — enough to account for the disparity in performance.Tracing the IO calls I found that the mean and percentile times, which should have been identical, where higher on the machine where performance was worse. Wondering whether the problem might be excerbated by the IO patterns of netCDF — the output was bimodal with peaks at 5KB and 1MB — I repeated the test with
dd and a fixed buffer size; this showed the same signal and confirmed that the root cause of the problem lay with Lustre.After confirming the OSTs were different in each case, I rechecked the Lustre parameters on the slow system; with hindsight, I should have done what my colleagues did and made this my first port of call. Unsurprisingly I found that both
max_dirty_mbmax_rpcs_in_flight were set at their default values — 32 instead of 256 and 8 instead of 64 — and increasing these resolved the problem and brought the performance of the two machines into line with each other.