Just finished an interesting and surprisingly simple bit of work to enable topology-aware scheduling on the Cray XC40. The system is supposed to be placement neutral, but experimental evidence has shown that certain real-world models suffer a 10-15 per cent degradation in performance when they span multiple electrical groups.
The Aries interconnect on the XC uses a hierarchical dragonfly topology. At the very lowest level are blades, which communicate with each other using the Aries ASIC. At the next level, every blade in a chassis can communicate with every other blade via the backplane. Further up the hierarchy, chassis are linked together in groups of six via a copper cables which run in two dimensions to create an electrical group. Finally, at the highest level, every electrical group contains multiple fibre optic links to every other electrical group.
The dragonfly topology is designed to ensure that every node is a minimum of four hops from every other. The routing algorithms, however, don't necessarily use the minimum-cost route: in the event of congestion in the network, it will randomly select another node in the system to use as a virtual root and then re-route traffic via this to find a new route to the destination. This means that routing between runs of an application is non-deterministic because it depends on traffic elsewhere in the interconnect; offering potential for varied performance, especially when individual tasks in an application are placed relatively distant from each other.
Fortunately, PBS Pro provides a means to address this: placement sets. These use arbitrary labels which can be added to each node — or vnode in the case of the Cray — and which can be used to apply best-fit guidelines to the allocation of nodes within the system. Best of all the steps required to enable this are trivial, and can be enabled as follows:
- create custom resources for each required placement element, e.g. blade, chassis, electrical
- tag each node or vnode in system with its resource, e.g.
"set node cray_100 resources_available.electrical = g0", "set node cray_400 resources_available.electrical = g1" and so on until everything is labelled appropriately
- define a set of node sorts key in the server e.g.
"set server node_sort_key = electrical". Multiple keys can be specified using commas, e.g. "set server node_sort_key = \"blade,chassis,electrical\"", with the smallest group first
- enable placement rules with
"set server node_group_enable = true"
With placement enabled, PBS will attempt to assign the application to the first set that matches; if a particular set cannot be matched, PBS will move on to the next constraint until it finally reaches the implicit universal set which matches the entire system. This means that placement is treated as a preference rather than a hard scheduling rule.
It is possible for users to require a particular resource using the -l place=group=<resource> option to qsub, e.g. qsub -l place=group=electrical myjob.sh. Unlike a server placement set, a user-specified placement imposes a hard constraint on the job, preventing it from being started until the resource requirement can be satisfied — importantly it doesn't require a specific resource to be available, only that all nodes assigned to job share the same resource.
The performance of placement sets can be determined by post-processing the PBS accounting data. Simply extract the exec_nodes field — or exec_vnodes in the case of the Cray — and match the allocated hosts to their corresponding hardware groups. Examine a sample of data from before the change and a similar sample from after. If everything is working as expected — and the workload is mixed relative to size of the placement set — it should be possible to a clear decrease in the numbers of jobs spanning placement sets.