LoadLeveler: Day 1
Nov. 18th, 2008 09:57 pmFirst day of a two or possibly three day course on LoadLeveler, IBM's batch processing software. After we got to grips with the rather strange terminology, which has more than a whiff of the Iron Age about it, we ran which daemons did what, how everything hung together, looked at a few basic admin tasks and I ran my first MPI job.
The most interesting point of contention concerned symmetric multi-threading and its impact on performance. Naively, it seems as though a system with 32 real CPUs should out perform a system with 64 apparent CPUs courtesy of SMT, there are good reasons why this might not be so. For example, if an application requires 64 processors to run and can either be run across two nodes with real CPUs or, with SMT, within a single node on 64 apparent CPUs, the performance may actually improve because running within a single node greatly reduces communication costs. But this assumes that (a) the application is not requesting 64 CPUs because it requires 128 GB of memory; and (b) that the application memory access patterns are such that they allow two threads to make efficient use of a single physical CPU.
My suspicion — which must only be that given that I'm not really a hardcore applications person — is that I'd expect SMT to benefit vector applications, given that it allows memory latencies to be hidden in a slightly similar way. But more particularly, given the Unified Model's historic sensitivity to interconnect latencies, I wonder if it might not benefit from being run across a smaller number of nodes if this turns out to improve communication performance.
It'll be interesting to see what sort of results the applications people turn up and, in turn, how they effect the way we tune the batch system.
The most interesting point of contention concerned symmetric multi-threading and its impact on performance. Naively, it seems as though a system with 32 real CPUs should out perform a system with 64 apparent CPUs courtesy of SMT, there are good reasons why this might not be so. For example, if an application requires 64 processors to run and can either be run across two nodes with real CPUs or, with SMT, within a single node on 64 apparent CPUs, the performance may actually improve because running within a single node greatly reduces communication costs. But this assumes that (a) the application is not requesting 64 CPUs because it requires 128 GB of memory; and (b) that the application memory access patterns are such that they allow two threads to make efficient use of a single physical CPU.
My suspicion — which must only be that given that I'm not really a hardcore applications person — is that I'd expect SMT to benefit vector applications, given that it allows memory latencies to be hidden in a slightly similar way. But more particularly, given the Unified Model's historic sensitivity to interconnect latencies, I wonder if it might not benefit from being run across a smaller number of nodes if this turns out to improve communication performance.
It'll be interesting to see what sort of results the applications people turn up and, in turn, how they effect the way we tune the batch system.
no subject
Date: 2008-11-20 10:11 am (UTC)