sawyl: (A self portrait)
[personal profile] sawyl
Spent an interesting hour attacking a perform problem in an iris application which ran faster on one system than another. Runinng the testcase through the profiler, I almost immediately identified a hotspot in biggus.save() and, intriguingly, noticed that it was taking 170 per cent longer on one machine than the other — enough to account for the disparity in performance.

Tracing the IO calls I found that the mean and percentile times, which should have been identical, where higher on the machine where performance was worse. Wondering whether the problem might be excerbated by the IO patterns of netCDF — the output was bimodal with peaks at 5KB and 1MB — I repeated the test with dd and a fixed buffer size; this showed the same signal and confirmed that the root cause of the problem lay with Lustre.

After confirming the OSTs were different in each case, I rechecked the Lustre parameters on the slow system; with hindsight, I should have done what my colleagues did and made this my first port of call. Unsurprisingly I found that both max_dirty_mbmax_rpcs_in_flight were set at their default values — 32 instead of 256 and 8 instead of 64 — and increasing these resolved the problem and brought the performance of the two machines into line with each other.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 4th, 2026 11:52 am
Powered by Dreamwidth Studios