sawyl | Entries tagged with cray

On staycation next week, I spent today trying to finish off as much work as possible before going away. Despite having had everything ready for quite some time, my attempts were hampered by the absence of the machine, the system had been taken down for hardware maintainance on Thursday and hadn't been brought back up. Thus most of the morning was spent waiting for the machine to bounce; this reinitialises the Aries links and loads the BIOS on the nodes. While the bounch process scales to some extent, it can take a long time on a large system, especially when, as inevitably seems to happen, some components fail to start causing the command to timeout and retry.

With the machine eventually up and the OS booted, I spent a frantic couple of hours applying my changes and updating my documentation to make it clear what I'd changed. I managed to get the PBS server configured and the queues defined before installing our multi-layered set of hooks. As far as my naive tests were concerned, everything seemed to work as expected, but the system really needs a full shakedown once the performance tests are complete and I can re-enable the hooks across the board.

Via InsideHPC, I notice that the Cray XC40 installation video now seems to be public:

For the record, I appear in a few frames around the 00:21 mark — I'm wearing shorts and my posture is distinctively appalling — and again briefly at 00:39 talking to Chris and Sonja — no shorts by this time but the ponytail gives me away.

Despite not starting at 6am, like some of the others, today — actually I'm shamelessly retconning this from the weekend — was a real slog. I sat in with the rest of the gang as they ran through a CLE update, with the usual range of unexpected glitches.

As the day wore on, I felt increasingly tired and unwell. By the time we'd reached what should've been the end, I realised that I was very febrile and my body temperature was skyrocketing. But just as I'd accepted that and was working to put my post-upgrade tunings in prior to going home, we discovered an particular unfortunate oversight that required an extra hour of work; something I managed to do despite feeling worse by the second.

With the work finally done, I managed to catch the bus home, snuggling into my down jacket as if conditions were arctic and not unexpected warm for the time of year. When I got back, I texted my sister who advice her standard aggressive cocktail of antipyretics, and collapsed into bed, all the while dreading the morrow...

Just finished an interesting and surprisingly simple bit of work to enable topology-aware scheduling on the Cray XC40. The system is supposed to be placement neutral, but experimental evidence has shown that certain real-world models suffer a 10-15 per cent degradation in performance when they span multiple electrical groups.

The Aries interconnect on the XC uses a hierarchical dragonfly topology. At the very lowest level are blades, which communicate with each other using the Aries ASIC. At the next level, every blade in a chassis can communicate with every other blade via the backplane. Further up the hierarchy, chassis are linked together in groups of six via a copper cables which run in two dimensions to create an electrical group. Finally, at the highest level, every electrical group contains multiple fibre optic links to every other electrical group.

The dragonfly topology is designed to ensure that every node is a minimum of four hops from every other. The routing algorithms, however, don't necessarily use the minimum-cost route: in the event of congestion in the network, it will randomly select another node in the system to use as a virtual root and then re-route traffic via this to find a new route to the destination. This means that routing between runs of an application is non-deterministic because it depends on traffic elsewhere in the interconnect; offering potential for varied performance, especially when individual tasks in an application are placed relatively distant from each other.

Fortunately, PBS Pro provides a means to address this: placement sets. These use arbitrary labels which can be added to each node — or vnode in the case of the Cray — and which can be used to apply best-fit guidelines to the allocation of nodes within the system. Best of all the steps required to enable this are trivial, and can be enabled as follows:

create custom resources for each required placement element, e.g. blade, chassis, electrical
tag each node or vnode in system with its resource, e.g. "set node cray_100 resources_available.electrical = g0", "set node cray_400 resources_available.electrical = g1" and so on until everything is labelled appropriately
define a set of node sorts key in the server e.g. "set server node_sort_key = electrical". Multiple keys can be specified using commas, e.g. "set server node_sort_key = \"blade,chassis,electrical\"", with the smallest group first
enable placement rules with "set server node_group_enable = true"

With placement enabled, PBS will attempt to assign the application to the first set that matches; if a particular set cannot be matched, PBS will move on to the next constraint until it finally reaches the implicit universal set which matches the entire system. This means that placement is treated as a preference rather than a hard scheduling rule.

It is possible for users to require a particular resource using the -l place=group=<resource> option to qsub, e.g. qsub -l place=group=electrical myjob.sh. Unlike a server placement set, a user-specified placement imposes a hard constraint on the job, preventing it from being started until the resource requirement can be satisfied — importantly it doesn't require a specific resource to be available, only that all nodes assigned to job share the same resource.

The performance of placement sets can be determined by post-processing the PBS accounting data. Simply extract the exec_nodes field — or exec_vnodes in the case of the Cray — and match the allocated hosts to their corresponding hardware groups. Examine a sample of data from before the change and a similar sample from after. If everything is working as expected — and the workload is mixed relative to size of the placement set — it should be possible to a clear decrease in the numbers of jobs spanning placement sets.

Working with PBS and qmgr again for the first time a while, I've found myself beset by constant temptation to use show instead of print to display the server values — an atavistic trait that I've released has its roots in the distant days of traditional Cray NQS; something that obviously had such an impact on me that it has been forever burnt into my unconscious muscle memory.

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tales of a Fourth Grade Nothing

Entries tagged with cray

Configuration blitz

XC40 installation timelapse

Spiralling downwards

Topology-aware scheduling on the Cray XC40

Forever typing the same commands...

Profile

August 2018

Syndicate

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags