sawyl: (Default)
Just as I was thinking about going home, my phone rang. Foolishly, I answered. As a result, I spent a good hour and a half trying to sort out a series of errors caused by an out-of-memory event on one of the cray_login nodes.

Fortunately, the circumstances of the problem were relatively clear and I was able to make a note of the affected jobs prior to purging them from the system. This meant that I was able to target the stale ALPS reservations left dangling by the job failures and immediately remove them with apmgr cancel. This wasn't entirely successful: around 40 ALPS reservations went into a pendCancel state and refused to clean up despite restarts of various daemons and an apmgr resync

Eventually, I concluded that discretion was the better part of valour and simple placed all the affected nodes into admindown to prevent them from being used. When I resumed the work, things started to run without encountering the any of the dreaded transient MPP errors we normally see when ALPS and PBS drift out of alignment, and I was finally able to go home, a couple of hours later than planned.
sawyl: (A self portrait)
Up well before dawn to catch a taxi to Exeter airport in time to fly up to Edinburgh for today's meeting. Arriving at the airport too early to get through security — something I only discovered after I'd reached the front of the queue — I hung around in the lobby and waited for my flight to be called. I eventually gave up on the expected announcement and checked the screens which told me to departures. After standing around in the massive queue of people waiting to go to Alicante, I realised I was going to miss my plane if I waited for the queue to go through. I dodged through the fast-lane and made it to the gate just as the boarding shuttle bus was about to depart.

After a short and uneventful journey, which I spent catching up on some much-needed sleep, I found myself in Manchester. I quickly located the gate for my onward connection to Edinburgh and joined the boarding queue at the gate. As I was waiting, I noticed DJ further down the line; perhaps I was in the right place at the right time on the right day after all. Despite living in Wigan, DJ had had only just made it: something had happened on the motorway and, ten minutes from the airport, his taxi ground to a halt and sat in traffic as the minutes ticked away.

After another uneventful flight, we found in Edinburgh where we met up with one of the Exeter Crayons — who'd flown up from Bristol — and their colleague from Reading. We located a large taxi and headed out to the university Bush Estate campus, where the others pieced together a series of half-remembered landmarks in an attempt to get us closer to our destination. After a certain amount of random route testing — DJ remembered that the building was on its own somewhere and might have been close to the veterinary school, while PC said that he hadn't been since they replaced the single track road — we eventually spotted an extremely discreet sign announcing the name of the building.

An anonymous research facility somewhere in Scotland...

The meeting was good with some interesting presentations and a good discussion in the afternoon. During the lunchbreak we took a tour around the computer suite — very impressive, especially the elegant plant room which came complete with a viewing window — which our hosts proudly told us was one was extremely energy efficient thanks to the use of free cooling.

Given that we could see the remnants of snow from the conference room, I wasn't particularly surprised that they were able to use ambient cooling for most of the year.

The break area had a series of components from retired systems, including a very familiar piece of a equipment indeed: a node board from a liquid-cooled Cray T3E-900. They even had a selection of promotional mugs and what I assume is a bottle of fluorinert.

It's amazing how quickly things have changed: in 1997 a board like this would have contained a single DEC Alpha processor and a few hundred megabytes — if you were lucky; I seem to remember the limit was a couple of gigabytes, whereas we were limited to 128MB on the T3E-900 and 256MB on the 1200E — whereas a modern broadwell node on the XC40 has 36 physical cores and 128GB of memory.

With the day wrapped up, I hopped in a taxi with DJ, the Crayons, and colleagues from Reading and headed to the airport, having failed to see anything more of Edinburgh than the bypass and the research lab. We arrived much to early — DJ was flying to Bristol an hour before the rest of us — and settled down at a cafe in departures while we waited to be called to the gate. WE went our separate ways — the Reading group to fly to Heathrow, my colleague from the South West to fly to Bristol, while I went back to Manchester.

A Flybe Dash 8 waiting to take me back to Manchester. Despite MB's pronouncements of gloom — he's always keen to point up how prone the Dash is to landing gear failure — I made it to Manchester without incident.

Despite a minor delay to the flight, I arrived with almost thirty minutes to spare and walked to my gate. After catching up on the messages that had appeared while I was offline, I boarded my last flight of the day where the person in the next seat recognised me from this morning! Nothing interesting happened on the way back, although the wind got up noticeably once we reached Exeter, making for a bumpy landing and a rather chilly homecoming. Needless to say my taxi booking hadn't been successful, so I asked the rep to put a journey on account. Annoyingly, although not particularly surprisingly — there is some sort of complicated tax ruling involved — the taxi could only take me to the office and I ended up springing for the rest of the journey myself.

I arrived home, around 17 hours after I'd left, after spending well over 10 hours in transit...
sawyl: (A self portrait)
Another intriguing Darshan gotcha, this time related to file locations and packaging problems. Trying to bundle the software up into an RPM, I encountered continual problems with one of the pkg-config config files not being recognised by the package, despite appearing the the RPM spec file manifest.

After a certain amount of bafflement, I noticed the problem file wasn't being installed in the correct location within the RPM build-root; instead it was being installed in a subdirectory of the build-root whose structure doubled the build-root tree.

Examining the Makefile.am more closely, I quickly located the source of the problem:

bindir = $(DESTDIR)@bindir@
libdir = $(DESTDIR)@libdir@
pkgconfigdir = $(DESTDIR)$(libdir)/pkgconfig

Since $(DESTDIR), which is set to the build-root by RPM, has been added to the path of the libdir macro, including it for a second time in the pkgconfigdir macro has the effect of duplicating the build-root hierarchy, causing the manifest match to fail.

Fortunately all it took was a quick patch to snip out the extra reference, and everything built as expected — it's just annoying it took me as long as it did to notice the problem...

sawyl: (A self portrait)
While trying to build the latest version of the darshan IO profiling library on the Cray XC, I encountered a problem: following the addition of various bits of Lustre-specific code, the library no longer compiles with vendor's compiler; instead it fails with an error message indicating that the code has attempted to use an invalid structure element.

Unwilling to believe the darshan code might be broken, I created a small testcase using the same ioctl calls and lustre/lustre_user.h macros. I was able to compile this successfully with the GNU C compiler but not with the Cray one. This eventually jogged a memory about a change the in C standards to add support for anonymous unions and, sure enough, a quick google confirmed that the feature was not present in C99 but had been added in C11.

Once I was aware of this, I checked the man page for craycc only to discover the presence of a the -h gnu option. This enables support for GCC-like features, including C11-style anonymous, and provided enough compatibility to compile the latest version of darshan with the Cray compiler.
sawyl: (A self portrait)
I wonder if this explains some of the disk performance problems we've been looking into...

sawyl: (A self portrait)
At CUG a few of us tried to determine the environmental impact of a supercomputer versus that of a cruise ship. After a quick bit of work with wikipedia and with the top 500 list, we realised that the cited figure of 80+MW for a cruise liner compared with a maximum of 17MW for the most powerful machine in the world — most are much lower — making it an easy win for the supercomputer.

Today's Guardian, which features a piece on the environmental impact of cruise ships cements the point. Cite the newly commissioned Harmony of the Seas — which features three thrusters all of which have peak power output exceeding that of Tianhe-2 — notes:

According to its owners, Royal Caribbean, each of the Harmony’s three four-storey high 16-cylinder Wärtsilä engines will, at full power, burn 1,377 US gallons of fuel an hour, or about 96,000 gallons a day of some of the most polluting diesel fuel in the world.

...

[M]arine pollution analysts in Germany and Brussels said that such a large ship would probably burn at least 150 tonnes of fuel a day, and emit more sulphur than several million cars, more NO2 gas than all the traffic passing through a medium-sized town and more particulate emissions than thousands of London buses.

According to leading independent German pollution analyst Axel Friedrich, a single large cruise ship will emit over five tonnes of NOX emissions, and 450kg of ultra fine particles a day.

So while it's true that supercomputers use a lot of electricity, it's nothing compared to a large cruise ship...

sawyl: (A self portrait)
Just finished an interesting and surprisingly simple bit of work to enable topology-aware scheduling on the Cray XC40. The system is supposed to be placement neutral, but experimental evidence has shown that certain real-world models suffer a 10-15 per cent degradation in performance when they span multiple electrical groups.

The Aries interconnect on the XC uses a hierarchical dragonfly topology. At the very lowest level are blades, which communicate with each other using the Aries ASIC. At the next level, every blade in a chassis can communicate with every other blade via the backplane. Further up the hierarchy, chassis are linked together in groups of six via a copper cables which run in two dimensions to create an electrical group. Finally, at the highest level, every electrical group contains multiple fibre optic links to every other electrical group.

The dragonfly topology is designed to ensure that every node is a minimum of four hops from every other. The routing algorithms, however, don't necessarily use the minimum-cost route: in the event of congestion in the network, it will randomly select another node in the system to use as a virtual root and then re-route traffic via this to find a new route to the destination. This means that routing between runs of an application is non-deterministic because it depends on traffic elsewhere in the interconnect; offering potential for varied performance, especially when individual tasks in an application are placed relatively distant from each other.

Fortunately, PBS Pro provides a means to address this: placement sets. These use arbitrary labels which can be added to each node — or vnode in the case of the Cray — and which can be used to apply best-fit guidelines to the allocation of nodes within the system. Best of all the steps required to enable this are trivial, and can be enabled as follows:

  1. create custom resources for each required placement element, e.g. blade, chassis, electrical
  2. tag each node or vnode in system with its resource, e.g. "set node cray_100 resources_available.electrical = g0", "set node cray_400 resources_available.electrical = g1" and so on until everything is labelled appropriately
  3. define a set of node sorts key in the server e.g. "set server node_sort_key = electrical". Multiple keys can be specified using commas, e.g. "set server node_sort_key = \"blade,chassis,electrical\"", with the smallest group first
  4. enable placement rules with "set server node_group_enable = true"

With placement enabled, PBS will attempt to assign the application to the first set that matches; if a particular set cannot be matched, PBS will move on to the next constraint until it finally reaches the implicit universal set which matches the entire system. This means that placement is treated as a preference rather than a hard scheduling rule.

It is possible for users to require a particular resource using the -l place=group=<resource> option to qsub, e.g. qsub -l place=group=electrical myjob.sh. Unlike a server placement set, a user-specified placement imposes a hard constraint on the job, preventing it from being started until the resource requirement can be satisfied — importantly it doesn't require a specific resource to be available, only that all nodes assigned to job share the same resource.

The performance of placement sets can be determined by post-processing the PBS accounting data. Simply extract the exec_nodes field — or exec_vnodes in the case of the Cray — and match the allocated hosts to their corresponding hardware groups. Examine a sample of data from before the change and a similar sample from after. If everything is working as expected — and the workload is mixed relative to size of the placement set — it should be possible to a clear decrease in the numbers of jobs spanning placement sets.

sawyl: (A self portrait)
End of an era today, with the old supercomputers brought down for the last time. To mark the occasion — and the imminent departure of the vendor people — a group of us arranged to go out for supper at the Bay Leaf. The food was good, with even W, a true curry maven, pronouncing himself impressed with a garlic paneer dish which they augmented for him with a couple of naga chillies.

Suitably fed, we shed a few people and continued on to the nearest watering hole — George's Meeting House — where the others swilled back a few pints. I pushed off at about half-nine, W left shortly after when his girlfriend and two month-old son came by to pick him up, with the others planning to round off the evening with a few whiskys...
sawyl: (A self portrait)
This is amusing: NOAA are buying a new supercomputer from IBM; but because IBM don't actually make things anymore, the actual provision of the hardware has been subcontracted to Cray...
sawyl: (A self portrait)
The news is finally out: the Met Office are spending 100 million pounds on a new Cray!

ETA: A handful of links:
Best of all, it's also made the Guardian's passnotes.
sawyl: (A self portrait)
At the supermarket this morning, I noticed the bid for the next supercomputer upgrade had made the front page of the local newspaper...
sawyl: (A self portrait)
I'm rather happy with the way my backfill model has come out. Using a very simple algorithm that does almost no forward planning, I've been able to come up with something that keeps my simulated system almost fully utilised and produces a very cute allocation map at the end of its run.

Allocation example )

A plot of the utilisation of each step — nothing more than the percentage of allocated nodes — shows that while there are sufficient jobs waiting in the input queue, the scheduler is able to keep the system over 98 per cent busy.

Utilisation example )

Obviously the situation is not realistic — job times are assumed to be accurate and all the jobs are known ahead of time, rather than appearing as the simulation evolves — but it provides a useful way of studying the effects of a few basic parameters on job allocation and scheduler efficiency, as well as pointing up some interesting areas for further study.
sawyl: (A self portrait)
Following on from our attempts to get parallel HDF5 to pass its regression suite I've found that if I replace Parallel Environment MPI with MPICH, all the weird data mismatch failures disappear and the library passes its tests. This seems to rule out the OS, compilers, and file system as sources of the problem, and to bring the focus back on to PE MPI and its IO routines...
sawyl: (A self portrait)
We've spent the last few weeks battling to get versions of HDF & netCDF built and through their regression suites on the Power 7 under AIX 7.1 using the IBM XL compilers . In the process we've discovered that:

  • the version of config.guess supplied with the software did not return the correct OS value for our platform causing some of the compiler and linker tests to fail in non-obvious ways
  • that configure won't even try to build shared versions of the libraries unless you add -W,-brtl to LDFLAGS
  • the header file supplied with GPFS v3.5 won't compile with the compiler in stdc99. To get GPFS support to work, it is necessary to add -qlanglvl=extc99 to CFLAGS
  • the Fortan compiler chokes on fixed format F77 unless you set the FC environment variable to xlf_r
  • the MPI-IO version of HDF5 always seems to fail its regression suite. Some of of the failures appear to be caused by our version of MPI — we've seen a number of internal "not owner" messages after an MPI_Type_commit call — and others which appear to indicate that the values being written out do not match the same values being read back in — something that isn't exactly reassuring in an IO library.

We think we've finally got all the serial stuff libraries build, both in static and shared forms, and the regression tests look OK, but it's been a long slog that required almost every nasty trick in the porting handbook...

sawyl: (A self portrait)
A busy few days trying to trace the source of a substantial degradation in the performance of the batch subsystem. Examining the dispatch cycles — normally 10-20 seconds but now inflated to 130-180 seconds — I immediately suspected a bad job but was unable to trace the cause until I enabled full and performance debugging. With this switched on I noticed that each dispatch cycle contained over 330,000 lines referring to a specific job which, when examined, proved to have requested a set of resources that could never be satisfied.

Knowing the likely cause of the problem, I slapped an admin hold on the job and the cycle times promptly dropped back to something like normal. My sense of timing was, as ever, impeccable: I fixed the problem in time for a couple of the others to push off for their Christmas lunch; while I hung around for as long as I could to make sure that everything had settled before pushing off to keep my appointment in town. Excelsior!
sawyl: (A self portrait)
My new workstation arrived today and proved an immediate success. It's made it painfully apparent that most of the performance problems I'd assumed were a feature of the gnome desktop were actually due to resource starvation, as were almost all the other little things that've hampering my efficiency for most of the last six months.

In other news, I had an interesting discussion with Dr S about the problems of threading and performance on NUMA machines. He pointed that out threaded code doesn't scale much beyond 8 processors on the P7 unless you take steps to ensure the memory being used by each thread is physically close to the CPU where the thread is being run. It's obvious once you've bothered to think about it, but it does rather undercut the benefits of naive OpenMP parallelisation and the dynamic load balancing model that goes with it...
sawyl: (A self portrait)
Despite the cynicism of Osborne's announcement that civil servants will no longer get automatic pay rises — they went out with the ark — there was a certain amount of hope attached to the news that amid the cuts BIS have agreed to fund the next Met Office supercomputer upgrade. No numbers, obviously, but it's definitely a positive sign...
sawyl: (A self portrait)
Having considered recent difficulties, I've concluded that most of our problems have been brought on by a lack of date. So, as a good scientist, I've tried to come up with a way to remedy the situation. My solution? To run a series of tests that involve deliberately breaking the data mirror by taking half the logical disks out of service for a fixed period and timing the resynchronisation process.

According to my theory, this ought to show whether the resync time is proportional to the duration of the split — and, by implication assuming a constant load on the file system, the amount of data written during the period — or whether it is dependent on the size of the metadata, which can be assumed to be relatively constant over the period.

If the initial experiment indicates that the time to resynchronise varies with the duration, then it ought to be possible to repeat the tests with gradually increasing intervals and to use the resulting data to create a model. If the resync time appears to be constant, at least within the time frame of the split, then we can be relatively confident that, for short outages at least, we can predict the duration and hence the impact of an outage of the mirror.

While I'm not sure anyone will actually go for this — the impact of the data gathering process may be too sever to tolerate — I think it's worth pitching...
sawyl: (A self portrait)
Via insideHPC, a Washington Post piece on the US National Weather Service's plans to increase their supercomputing capacity. But what I really found interesting was a link to a 2012 blog posting by Cliff Mass criticising the performance of NWS' model when comparied with ECMWF and the Met Office.

Mass' comments remind me of a discussion I had with someone from another met service — not the NWS &mdahs; who was talking earnestly & excitedly about their forthcoming procurement and what a huge difference it was going to make to their supercomputing capacity. The scientist I was with breezily dismissed this, saying "You don't need more computer capacity. It's not going to help because all your models are at least 20 years out of date..." It may have been true, but it left the other completely crushed.
sawyl: (Default)
Brief foray into work this morning, primarily to attend a seminar on the challenges of exascale given by someone from the optimisation group. The presentation was engaging and interesting — a rarity, in my experience — and bolstered by some good real-world examples from ENDGame.
  • extrapolating from current systems, exascale machines are likely to require ~2GW to power and cool — more than total output of the UK's largest nuclear power station. But commodity CPUs are inefficient and are only used for HPC because they are cheap and plentiful; as chip manufacturers shift their focus to low power mobile computing, they may well solve the HPC power problem at the same time
  • main memory is power hungry, slow and getting slower relative to CPU clock speeds; systems are becoming increasing NUMA, so task placement — often poorly supported by the OS and at odds with the single task per CPU model used by MPI — is increasingly critical and cache optimisation is more important than ever
  • developers need to understand where their algorithm allows them to sacrifice numerical precision for improved performance. It may be better to use single instead of double precision in an iterative solver in order to reduce the memory bandwidth requirements and to increase the efficient use of cache. The key is to understand what level of approximation or error is good enough for the current task
  • recalculation may be better than lookup. When main memory is a long way away and lookups are expensive, it may be cheaper repeat pieces of work. Inlining may help reduce the complexity of this, but it has the potential to introduce bugs if the recalculation is not performed in the same way at every point in the program
  • GPUs are not the answer because:
    • the combination of CUDA and C/Fortran is unwieldy
    • memory layout is critical to good GPU performance, but the the stride patterns are generally the exact opposite to those required for good CPU performance
    • PCI express introduces a serious bottleneck: values need to be copied from main memory to GPU memory and back again
    • very substantial amounts of work are required to get any sort of speedup and even then, the gains on a flops per watt basis aren't really worth it, e.g. codes on Titan show a 2x speedup with a 2x increase in power after months of work
    • the rumours suggest that Tianhe-1A is so hard to use that's only real success has been running HPL
  • one-sided communications via PGAS offers significant gains in scalability when compared to two-sided operations because they remove the need to interrupt computation on the target CPU. Good performance requires good support from the vendor, e.g. in UPC and co-array fortran on the Cray XE6, while the current performance of one-sided MPI operations remains substantially worse than other offerings
  • IO is likely to be a huge problem at exascale, but there are reasons to hope that work being done by other big data organisations, e.g. Google, will feed into HPC and lead to improved data management models

Best of all, my own very minor contribution — that the complete HPC facility weighs in at 60 tonnes — got used in a throwaway comment!

Profile

sawyl: (Default)
sawyl

September 2017

S M T W T F S
      1 2
3 456 78 9
10111213141516
17181920212223
24252627282930

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 25th, 2017 10:28 pm
Powered by Dreamwidth Studios