sawyl | A Lustre rabbit hunt

Spent an interesting hour attacking a perform problem in an iris application which ran faster on one system than another. Runinng the testcase through the profiler, I almost immediately identified a hotspot in biggus.save() and, intriguingly, noticed that it was taking 170 per cent longer on one machine than the other — enough to account for the disparity in performance.

Tracing the IO calls I found that the mean and percentile times, which should have been identical, where higher on the machine where performance was worse. Wondering whether the problem might be excerbated by the IO patterns of netCDF — the output was bimodal with peaks at 5KB and 1MB — I repeated the test with dd and a fixed buffer size; this showed the same signal and confirmed that the root cause of the problem lay with Lustre.

After confirming the OSTs were different in each case, I rechecked the Lustre parameters on the slow system; with hindsight, I should have done what my colleagues did and made this my first port of call. Unsurprisingly I found that both max_dirty_mbmax_rpcs_in_flight were set at their default values — 32 instead of 256 and 8 instead of 64 — and increasing these resolved the problem and brought the performance of the two machines into line with each other.

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tales of a Fourth Grade Nothing

A Lustre rabbit hunt

A Lustre rabbit hunt

Profile

August 2018

Most Popular Tags

Style Credit

Expand Cut Tags