sawyl | Infiniband MTU sizes vs performance

I've spent quite a lot of time over the last month or two investigating an interesting network performance problem involving three systems connected by a mixture of ethernet and infiniband.

Given a pair of machines connected together by a 10GE link, the network performance is perfectly acceptable. When I introduce a third machine connected to the second vai a QDR infiniband link running with superpackets and transfer data from the first machine to the third, the available bandwidth appears to halve. When I enable path MTU discovery on the third machine to reduce the amount of fragementation carried out by the machine in the middle, the bandwidth drops still further.

My conclusions? I suspect that the difference in observed bandwidth between the two 10GE nodes and the two 10GE nodes + the QDR infiniband node is caused by the overhead of fragmenting and forwarding the packets. I also think that the additional performance decrease seen when the MTU size is stepped down to the lowest common demoninator - in this case 1500 bytes - is caused by the relatively poor performance of infiniband when working with relatively small packets.

Although this seems counter-intuitive — the default rule of network optimisation is to avoid packet fragmentation wherever possible — it seems to be backed up by IBM's documentation on superpackets, which states "[c]hanging the MTU value from the default settings can also have unexpected consequences so it is not recommended except for the most advanced user."