sawyl | The challenges of exascale

Brief foray into work this morning, primarily to attend a seminar on the challenges of exascale given by someone from the optimisation group. The presentation was engaging and interesting — a rarity, in my experience — and bolstered by some good real-world examples from ENDGame.

extrapolating from current systems, exascale machines are likely to require ~2GW to power and cool — more than total output of the UK's largest nuclear power station. But commodity CPUs are inefficient and are only used for HPC because they are cheap and plentiful; as chip manufacturers shift their focus to low power mobile computing, they may well solve the HPC power problem at the same time
main memory is power hungry, slow and getting slower relative to CPU clock speeds; systems are becoming increasing NUMA, so task placement — often poorly supported by the OS and at odds with the single task per CPU model used by MPI — is increasingly critical and cache optimisation is more important than ever
developers need to understand where their algorithm allows them to sacrifice numerical precision for improved performance. It may be better to use single instead of double precision in an iterative solver in order to reduce the memory bandwidth requirements and to increase the efficient use of cache. The key is to understand what level of approximation or error is good enough for the current task
recalculation may be better than lookup. When main memory is a long way away and lookups are expensive, it may be cheaper repeat pieces of work. Inlining may help reduce the complexity of this, but it has the potential to introduce bugs if the recalculation is not performed in the same way at every point in the program
GPUs are not the answer because:
- the combination of CUDA and C/Fortran is unwieldy
- memory layout is critical to good GPU performance, but the the stride patterns are generally the exact opposite to those required for good CPU performance
- PCI express introduces a serious bottleneck: values need to be copied from main memory to GPU memory and back again
- very substantial amounts of work are required to get any sort of speedup and even then, the gains on a flops per watt basis aren't really worth it, e.g. codes on Titan show a 2x speedup with a 2x increase in power after months of work
- the rumours suggest that Tianhe-1A is so hard to use that's only real success has been running HPL
one-sided communications via PGAS offers significant gains in scalability when compared to two-sided operations because they remove the need to interrupt computation on the target CPU. Good performance requires good support from the vendor, e.g. in UPC and co-array fortran on the Cray XE6, while the current performance of one-sided MPI operations remains substantially worse than other offerings
IO is likely to be a huge problem at exascale, but there are reasons to hope that work being done by other big data organisations, e.g. Google, will feed into HPC and lead to improved data management models

Best of all, my own very minor contribution — that the complete HPC facility weighs in at 60 tonnes — got used in a throwaway comment!

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tales of a Fourth Grade Nothing

The challenges of exascale

The challenges of exascale

Profile

August 2018

Most Popular Tags

Style Credit

Expand Cut Tags