sawyl: (A self portrait)
On staycation next week, I spent today trying to finish off as much work as possible before going away. Despite having had everything ready for quite some time, my attempts were hampered by the absence of the machine, the system had been taken down for hardware maintainance on Thursday and hadn't been brought back up. Thus most of the morning was spent waiting for the machine to bounce; this reinitialises the Aries links and loads the BIOS on the nodes. While the bounch process scales to some extent, it can take a long time on a large system, especially when, as inevitably seems to happen, some components fail to start causing the command to timeout and retry.

With the machine eventually up and the OS booted, I spent a frantic couple of hours applying my changes and updating my documentation to make it clear what I'd changed. I managed to get the PBS server configured and the queues defined before installing our multi-layered set of hooks. As far as my naive tests were concerned, everything seemed to work as expected, but the system really needs a full shakedown once the performance tests are complete and I can re-enable the hooks across the board.
sawyl: (A self portrait)
After an early night yesterday, as recommended by my medical professional, I felt a lot better despite being woken pre-dawn by the local herring gulls — presumably getting excited over abandoned kebabs. Fortunately, getting up and closing the window shut out the racket and I managed to fit some more sleep in before the alarm went off.

The morning was taken up with an interesting problem: determining the levels of backfill in PBS, matching both placed jobs and backfilled jobs with projects, and attempting to show whether one group was being backfilled more than another. Combining scheduler log data with start records in the accounting data, I was able to assemble enough data for each project to determine the raw job counts, the number of nodes, and the number of node seconds for both placed and backfilled jobs. These confirmed that there had been a period when a small number of groups had benefited from backfill, but without more information about the queues it was hard to tell exactly why the situation changed partway through the data.

I now wonder whether there might be some more interesting ways to use this data. I think it might be useful to maintain the order of jobs in the priority list, so when each backfilled job is correlated with a project, it is possible to determine whether the job has been run ahead of other, higher priority candidates. This might be enough to report on the level of FIFO-ness in the queue and thus, the impact of priority ordering on the overall throughput of work for particular project groups.

With work over, I went down to the quay to meet up with L&A for a few routes. Having assured me that they were going to be on time because A was driving — something L said was code for only being 10 minutes late! — they were actually there well ahead of time and had to wait for me to get changed. We climbed some fun stuff, including a couple of difficult 6b+ TR routes which I comprehensively mis-read and consequently failed to flash.

After the others had tired a bit, we went into the boulder rooms and did a few problems. I successfully repeated the dyno problem I'd finished at the weekend — I'm now confident I've got it nailed — and the others tried the first couple of moves on the 40 degree stamina circuit. They did pretty well considering they were climbing front-on — I can only manage somewhere between 20-30 moves on the easiest circuit if I go full on with knee drops — but we soon called it night, not least because having all that weight on rough, juggy holds does horrible things to your hands in next to no time.
sawyl: (A self portrait)
Just finished an interesting and surprisingly simple bit of work to enable topology-aware scheduling on the Cray XC40. The system is supposed to be placement neutral, but experimental evidence has shown that certain real-world models suffer a 10-15 per cent degradation in performance when they span multiple electrical groups.

The Aries interconnect on the XC uses a hierarchical dragonfly topology. At the very lowest level are blades, which communicate with each other using the Aries ASIC. At the next level, every blade in a chassis can communicate with every other blade via the backplane. Further up the hierarchy, chassis are linked together in groups of six via a copper cables which run in two dimensions to create an electrical group. Finally, at the highest level, every electrical group contains multiple fibre optic links to every other electrical group.

The dragonfly topology is designed to ensure that every node is a minimum of four hops from every other. The routing algorithms, however, don't necessarily use the minimum-cost route: in the event of congestion in the network, it will randomly select another node in the system to use as a virtual root and then re-route traffic via this to find a new route to the destination. This means that routing between runs of an application is non-deterministic because it depends on traffic elsewhere in the interconnect; offering potential for varied performance, especially when individual tasks in an application are placed relatively distant from each other.

Fortunately, PBS Pro provides a means to address this: placement sets. These use arbitrary labels which can be added to each node — or vnode in the case of the Cray — and which can be used to apply best-fit guidelines to the allocation of nodes within the system. Best of all the steps required to enable this are trivial, and can be enabled as follows:

  1. create custom resources for each required placement element, e.g. blade, chassis, electrical
  2. tag each node or vnode in system with its resource, e.g. "set node cray_100 resources_available.electrical = g0", "set node cray_400 resources_available.electrical = g1" and so on until everything is labelled appropriately
  3. define a set of node sorts key in the server e.g. "set server node_sort_key = electrical". Multiple keys can be specified using commas, e.g. "set server node_sort_key = \"blade,chassis,electrical\"", with the smallest group first
  4. enable placement rules with "set server node_group_enable = true"

With placement enabled, PBS will attempt to assign the application to the first set that matches; if a particular set cannot be matched, PBS will move on to the next constraint until it finally reaches the implicit universal set which matches the entire system. This means that placement is treated as a preference rather than a hard scheduling rule.

It is possible for users to require a particular resource using the -l place=group=<resource> option to qsub, e.g. qsub -l place=group=electrical myjob.sh. Unlike a server placement set, a user-specified placement imposes a hard constraint on the job, preventing it from being started until the resource requirement can be satisfied — importantly it doesn't require a specific resource to be available, only that all nodes assigned to job share the same resource.

The performance of placement sets can be determined by post-processing the PBS accounting data. Simply extract the exec_nodes field — or exec_vnodes in the case of the Cray — and match the allocated hosts to their corresponding hardware groups. Examine a sample of data from before the change and a similar sample from after. If everything is working as expected — and the workload is mixed relative to size of the placement set — it should be possible to a clear decrease in the numbers of jobs spanning placement sets.

sawyl: (A self portrait)
Breaking my holiday to go in to work this morning for a meeting with Altair. Very much a flying visit, both for them and for me, but it was useful to talk over some ideas with them — especially since they had suggestions which might let us optimise our configuration and improve the performance of some of the code we've added to allow us to preempt things.
sawyl: (A self portrait)
Still very much frazzled — I haven't done a fraction of the things I absolutely positively have to do this week, never mind the stuff that's discretionary — but I really got into the spirit of today's training/discussions once we passed the basic details and had some very useful conversations with the people from Altair.
sawyl: (A self portrait)
Working with PBS and qmgr again for the first time a while, I've found myself beset by constant temptation to use show instead of print to display the server values — an atavistic trait that I've released has its roots in the distant days of traditional Cray NQS; something that obviously had such an impact on me that it has been forever burnt into my unconscious muscle memory.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 3rd, 2026 05:06 pm
Powered by Dreamwidth Studios