sawyl | Entries tagged with optimisation

Looking into the generally sluggish turnaround of the batch system, I decided that the problem was probably down to long cycle times in the negotiator daemon's dispatch code and that it was being exacerbated by full debug logging. To prove my hypothesis, I decided to write something to plot out the cycle times before and after changing the debug flags. However while the script proved easy enough to write, the sizes of the log files meant that the performance simply wasn't good enough. So I said down with the python profiler and the timeit to see if I couldn't hurry things along their way.

The profiler showed that the bulk of the time was being spent in the log file parsing function. No great surprise there — the function was crunching through hundreds of gigs of data to pick out the cycle times. What was surprising was the relatively high cost of the endswith() string method. And because it was being called for almost ever line of input, it was dominating the performance of the program. So, optimisation number one: reorder the string comparisons to try to break out of the matching early and replace the endswith() with a less expensive indateutil.parser had become more prominent in the profiler listing. I tried a couple of easy tricks, removing some unnecessary rounding operations and trimming back the length of the date string to be parsed, but extracting the date of each log entry was still far too expensive. So, optimation number two: replace the generalised dateutil.parser routine with a custom bit of code and incorporate caching to avoid re-parsing the line when the date change is less than the number of significant figures.

As a result of the changes time required to run my testcase dropped to around an eighth of its original value, bringing the time taken to analysis a complete set of logs down to a mere 700 seconds. Then, when I finally plotted out the distiled data, I noticed that as soon as I disabled full debugging the cycle time droped from around 20-30 seconds to 3-4 seconds and stayed stable. Excelsior!

After much consideration and many, many experiments I think I've finally cracked the annoying performance problem I've been investigating for most of past couple of weeks. Essentially, the problem was that the run times for a series of standard builds which used to complete in around 900 seconds had suddenly started taking much longer than expected — up to three times longer in some cases.

With the times varying from routine to routine across runs, indicating that the problem was independent of the amount of compute work being performed, I immediately suspected the root cause might lie somewhere in the IO subsystem. I ran a series of experiments: parallel builds with a range of different source, destination and compiler configurations, all bundled up in resubmitting job chain to control for the usual diurnal fluctuations in load. Initial results were disappointing until I added a step to control for the location of the temporary directory. This generated a clear signal: changing $TMPDIR to point to something other than the default location — the root directory of a busy GPFS file system optimised for large block IO — gave noticeably superior performance.

A further set of experiments, this time using a trivial bit of code to time the performance of the mkstemp() call under a range of different conditions confirmed the problem: that the large block GPFS file systems offered suboptimal performance when working with small files and that the performance of the call fell off noticeably as the target directory filled up with files. This last test offered some interesting insights in the behaviour of GPFS metadata caching, with the performance falling off when all the inodes were created in quick succession but not if a few minutes were allowed to pass between file creation runs (presumably because this gives sufficient time for the metadata to synchronise).

All of which suggests that it ought to be possible to significantly improve compilation performance by changing the location of $TMPDIR to point to a user-specific empty directory on a GPFS file system optmised for small block IO.

ETA: Finally traced the root cause of the problem back to an application configuration file that was changed a couple of weeks ago, overriding the default value of $TMPDIR and changing it to point to the GPFS file system. Mystery solved!

A few months ago I wrote a python script to count the words, sentences, paragraphs, and word frequencies in my MA thesis. Having written it in something of a hurry, I discovered when I returned to it yesterday, that it wasn't terribly efficient and didn't scale terribly well when run against larger samples of text. So I decided to take a look with the profiler to see if I couldn't improve things.

I took the raw program and profiled it using largest chunk of text I had to hand, George Eliot's Middlemarch, as it happens. I noticed almost immediately that most of the code spent most of its time in the routine that updated the word frequency hash, and that the routine spent most of this time running a series of regular expressions to clean off any non-alphanumeric characters prior to updating the frequency hash. Suspecting that this wasn't very efficient, I decided to benchmark a few alternatives.

My original code looked something like the following:

def case1(word):
word = re.sub("^\\W*", "", word)
word = re.sub("\\W*$", "", word)
return word

So I decided to replace the pair of substitutions with a single one:

def case2(word):
return re.sub("^\\W*(\\w+?)\\W*$", "\\1", word)

But I discovered that this decreased performance by 25 per cent. So I decided to try a simple regexp match instead:

def case3(word):
m = re.match("^\\W*(\\w+?)\\W*$", word)
if m:
return m.group(1)

This gave a 160 per cent improvement on the original. But I realised I could do still better if I abandoned regexps altogether and replaced them with some plain string processing:

def case4(word):
return word.strip(",.?!:;'\"")

Sure enough, this gave a 1200 per cent improvement over the original re.sub() code, which reduced the cost of the function to the point where a re.split() in the caller become one of the dominant costs. Once this was replaced with an almost equivalent set of string splits, I found that the elapsed time of the program had been more than halved and the problem had moved from being CPU bound to being IO bound.

The moral of the story, then, is that although regular expressions are often useful they significantly limit code scalability and should be avoided in codes where performance is likely to be important.

Short final day, which involved running through the basic performance tuning methodology one last time and running through a couple of example cases.

The first study, a synthetic example from IBM, involved running through PerfPMR output in an attempt to determine whether the system was overloaded. After going through the output and spotting an absurd amount of IO being done to the root file system, we quickly decided that system would benefit from a reconfiguration of the disc devices to prevent data IO from contending with journal IO.

The second study, a real world example the instructor had picked up in a quid pro quo deal from a former course attendee who'd asked why their application showed poor response during peak hours. We ran through a few spreadsheets of nmon output, guessing from the number of fabric attached discs that the machine was probably running some sort of RDBMS, noticing substantial peaks in CPU during normal working hours. Looking closer, we noticed that the system was clocking up large amounts of time waiting for IO, but this failed to coincide with large amounts of file IO and when we looked at the physical disc activity, we noticed that most of the traffic was going to the root volume group suggesting heavy paging, something confirmed by one of the later spreadsheets which showed the system was moving hundreds of pages to and from the backing store every second. We discussed some of the obvious tunings — buy more memory; change the amount of internal caching use by the database to force data to be reloaded from the, presumably faster, fabric discs rather than caching it to disc on the relatively slow internal SCSI discs.

And with that, we called it a day, munched down our final round of course sandwiches and went home.

Current Music: Purcell - What power art thou

The day began with an attempt to solve the appalling network performance problems we encountered during yesterday's lab. Poking around we quickly noticed that although the operating system reported the adapters as being gigabit ethernet, the media settings had actually been set to 100Mbit with incorrect duplexing settings, so we dropped the interfaces, chdev'ed the devices to auto-negociate, brought the interfaces back up and found that the performance had improved by three orders of magnitude.

With that glorious triumph behind us, we looked at various IP layer tunings, how to manage fragmentation, and when to change the sizes and timeouts on the adapter queues. We then spent some time looking at NFS optimisation — surely the first optimisation is not to use NFS! — and some of the parameters that can be altered to give improved performance. After a brief lab which examined the differences in performance between NFS v2 and v3, we looked at some of the standard information gathering tools, such as PerfPMR — which, perversely, is no longer bundled into the base install — and the various system trace utilities.

All of which means that we're well on course to finish by midday tomorrow.

Current Music: Rachmaninov - Rhapsody on a Theme of Paganini

After yesterday's session on LVM performance, today naturally segued into a discussion of JFS tuning tips. Some of these were so obvious, such as not enabling dynamic compression, they merely needed a single comment. Some of the others, such as release behind buffering or IO pacing, were covered in greater depth.

After a quick lab session, setting up and defragging file systems, we went on to look at TCP/IP and some of the factors that govern its performance. There were some interesting comments about socket buffer tuning, while both Nagle's algorithm and sliding socket windows were covered clearly and concisely.

The lab following the exercise, however, was a less than comprehensive success. We first found ourselves unable to measure a meaningful round trip time between the different systems we were using — because they were LPARs, perhaps? — before going on to discover that this lack of latency did not mean that the network connection was particularly reliable if dismal performance and number of retransmits was anything to go by. In the end, we abandoned all hope and deferred the rest of the exercise until tomorrow morning.

Current Music: Mozart - Piano Concerto No 24

Today's discussions focused on the virtual memory subsystem and the LVM, and their contributions to overall system performance.

We started out by running through some of the basics principles of VM and the slightly odd terminology used by AIX to refer to the different sections of memory. We went on to talk about the way that AIX merges classic swapping with disc caching, the way that heavy IO activity may cause computational pages to be erroneously dumped to the backing store and the way that the VM manager can be tuned so that it preferentially drops file pages ahead of application data. We then looked at methods for tracking down memory heavy applications and did a few labs to reinforce the points.

After lunch, we examined the various LVM parameters and configuration options, such as active versus passive, that might have some sort of impact on performance. We then discussed some of the standard tools available to monitor IO performance before doing a couple of labs where we investigated how application performance varied as we changed the settings of the LVM.

All rather enjoyable.

Despite warnings that the course material was rather dry, I found today's course rather enjoyable. But then I've always thought underlying theory to be more important than a superficial knowledge of the available command set.

We started off by discussing the nature of performance problems, where they come from, how to tell whether there's a genuine problem and how to know when to stop trying to improve things. We then moved on to talk about the differences between processes and threads, the nature of the AIX scheduler and the way that it implements affinity, before doing a couple of exercises in which we changed some of the scheduler parameters to see how they affected running processes.

One point that bothered my slightly was the contention that it wasn't necessarily a bad thing for processes to clock up large amounts of time running in system space. In my experience, the accumulation of large amounts of system CPU on a machine typically indicates some sort of pathological problem — typically with IO — caused by one the running processes. This may be slightly naive and my observation might only be valid when applied to large numerical models, but I have thought that most normal applications would want to avoid spending all their time in kernel space if only so to allow them to undertake some useful computational work.

First day of a two or possibly three day course on LoadLeveler, IBM's batch processing software. After we got to grips with the rather strange terminology, which has more than a whiff of the Iron Age about it, we ran which daemons did what, how everything hung together, looked at a few basic admin tasks and I ran my first MPI job.

The most interesting point of contention concerned symmetric multi-threading and its impact on performance. Naively, it seems as though a system with 32 real CPUs should out perform a system with 64 apparent CPUs courtesy of SMT, there are good reasons why this might not be so. For example, if an application requires 64 processors to run and can either be run across two nodes with real CPUs or, with SMT, within a single node on 64 apparent CPUs, the performance may actually improve because running within a single node greatly reduces communication costs. But this assumes that (a) the application is not requesting 64 CPUs because it requires 128 GB of memory; and (b) that the application memory access patterns are such that they allow two threads to make efficient use of a single physical CPU.

My suspicion — which must only be that given that I'm not really a hardcore applications person — is that I'd expect SMT to benefit vector applications, given that it allows memory latencies to be hidden in a slightly similar way. But more particularly, given the Unified Model's historic sensitivity to interconnect latencies, I wonder if it might not benefit from being run across a smaller number of nodes if this turns out to improve communication performance.

It'll be interesting to see what sort of results the applications people turn up and, in turn, how they effect the way we tune the batch system.

Current Music: Haydn - Trumpet Concerto

After playing around with my job accounting interface for a week or so, my test users came back with a few suggested improvements. Most were easy to implement, but as I was testing the ability to execute queries across multiple data sets, I noticed that the performance wasn't all that it might be, with ten day queries taking 5 seconds to execute. Clearly something needed to be done.

The profiler confirmed my suspicions — the code was extremely IO bound and a lot time was being lost to the re.split() used to split each line of input into fields. My initial workaround for the problem was simply to run the input data set through tr -s " " to remove duplicate spaces, allowing me to replace my slow RE with a much faster string split.

Then it occurred to me that, having committed myself to preprocessing the input data, I might as well push the idea to its logical conclusion. Thus, I reprocessed all the input data, stripping out the unwanted fields and replacing them with things that had previously been dynamically created at run-time by the interface. Having done this, I decided to optimise the IO path by dumping the data out using the python marshal module, removing the need to do any processing on the input beyond a simple load.

When I considered the access patterns of the data, another idea suggested itself. Most of the queries were being made against the same core data sets, either because multiple users were querying information for the same day or because a single user was sorting the results of their previous query into a different order. So, given that mod_python makes it possible for data structures to persist in memory — one of the principle reasons it out-performs simple CGI — it occurred to me that I might buffer recently used data sets in memory. So I added a brief bit of wrapper code around the marshal.load() call and, presto, the performance of the most common subset of queries ran through twenty times faster.

All in all, a most satisfying way to spend an otherwise idle afternoon...

Current Music: Rimsky-Korsakov - Scheherazade

Fed up with the poor performance of one of my python scripts, I decided to spend half an hour or so optimising it, just to see if I couldn't improve things.

Sure enough, when I profiled the base version using a fixed configuration, I found that it was losing most of its time to a series of deepcopy() calls. I changed the internal data structures to use a dictionary instead of a more complex set of arrays and, presto, the run time went down from 7.55s to 1.76s and the function call count went down from 398,359 to 53,678.

But when I re-profiled, I found a couple of other nasties: an N by N array that was being needlessly regenerated for every loop trip and a whole host of calls to has_key(). After adding caching to the main loop to prevent recalculation and replacing all the has_keys() with try ... execpt KeyError ... constructs, I managed to get the time down to 0.97s and a mere 11,236 function calls.

And here's proof of my success: a plot showing how the raw and optimised versions of the program scale as the number of items increases:

So it looks like the big wins were, in order of effectiveness: (a) to use a more efficient algorithm and data structures; (b) to replace has_keys() with exceptions in order to avoid the overhead of the extra subroutine calls and to exploit the (relatively) fast performance of exception handling; (c) to avoid unnecessary work by caching results, trading memory for CPU. Of these, (a) and (c) were obvious, while (b) was only apparent after a little bit of experimentation with the profiler.

In his classic paper on benchmark skewing, Twelve Ways to Fool the Masses, Bailey cites as an example the dubious practice of presenting the timings from the inner most parts of a solver and reporting them as though they're the performance of the entire application.

It seems the lesson has been well learnt: one of the cell processor benchmarks has been caught playing with time in the same way. Thus, the application reports that performance scales linearly as more PEs are enabled, whilst the wall-clock time of the application remains relatively constant thanks to the dominance of start-up costs etc.

Very naughty.

My decision to reconfigure NIS has been comprehensively vindicated. I decided to up the number of slaves from a paltry one to six and to change the default bindings to distribute the load evenly across the servers.

When I moved the first cluster of systems off the master, I noticed that the parallel ypwhich I issued came back rather more swiftly than the first, but I assumed that it was load related and thought no more of it. When the same thing when I moved the second cluster, I decided run a couple of experiments to confirm my intuitive feeling that the things were running with greater celerity.

To establish a baseline, I ran a parallel ypwhich across 23 systems connected to the original NIS master. The systems were not under load — all the work had been checkpointed to allow for maintenance. The command took some 16 seconds to complete.

I then ran the command against 34 systems, split to use two different NIS slaves. Each system was running a full workload. The command took 0.6 seconds to complete — 26 times faster than the baseline case!

My test, which relies on logging in to systems in parallel and running a series of commands, is likely to be extremely sensitive to poor NIS latencies. But it also seems to me to be indicative of the sorts of delays a large scale parallel application might see when launching across a large number of nodes. In most cases, a 15–20 second delay would be neglible. In the case of a real time application, especially one that depends on running a number of large MPI executables within a single job, the cumulative start-up delays might well have a significant impact.

Ever wonder what happens if you get your compiler options wrong? Here's an example of a particularly pathological case on an Itanium (actually a 1GHz McKinley system):

itanic> efc check.f90 ; time ./a.out
check.f90(23) : (col. 0) remark: LOOP WAS SOFTWARE-PIPELINED.
check.f90(23) : (col. 0) remark: LOOP WAS SOFTWARE-PIPELINED.

real 0m2.042s
user 0m2.032s
sys 0m0.009s
itanic> efc -check bounds check.f90 ; time ./a.out

real 0m19.830s
user 0m19.049s
sys 0m0.010s
itanic> efc -O0 check.f90 ; time ./a.out

real 1m3.465s
user 1m3.364s
sys 0m0.008s
itanic>

For comparison, here are the results of the same set of tests on a modern 3.5GHz Xeon using a very similar compiler and set of compiler options:

xeon> ifort check.f90 ; time ./a.out

real 0m1.476s
user 0m1.474s
sys 0m0.001s
xeon> ifort -check bounds check.f90 ; time ./a.out

real 0m1.681s
user 0m1.676s
sys 0m0.003s
xeon> ifort -O0 check.f90 ; time ./a.out

real 0m5.016s
user 0m4.962s
sys 0m0.004s
xeon>

Just goes to show how heavily the Itanium, with its deep pipelines and oh-so-clever EPIC instruction set, depends on the compiler to do the Right Thing and just how painful life can be if you accidentally leave bounds checking enabled in a production binary.

Had an interesting discussion with one of the beakers today about static versus dynamic linking. He suggested configuring the compilers to statically link the Intel libraries, rather picking them up via LD_LIBRARY_PATH as is the case at the moment. In line with my generally contrary nature and my belief in the principle of least surprise, I advocated linking everything in statically, on the grounds that once you've got a static binary, it's pretty much guaranteed to work regardless of what happens to the libraries.

Trouble is, least surprise cuts both ways: changing to static linking will suddenly increase the size of the executables, which will throw some people, but it would bring the behaviour of the native compilers into line with that of the cross compilers.

Hmm.

Current Music: Sibelius - Kullervo

At a conference a few days ago, Dr S made a couple of throwaways about the relatively poor performance of cp on gStoreFS file systems. During dinner, a beaker from DKRZ sidled up to him and suggested that everything could be fixed by twiddling a couple of system settings.

Sure enough, once I was informed of the results of my scientific colleague's schmooze-a-thon, I ran a couple of experiments and quickly discovered a setting that cut the time taken by a copy command down from 112 seconds to 2 seconds.

Yikes.

Today's fascinating discovery? The choice of ssh encryption scheme matters. I ran a couple of tests and discovered that, for a 100MB file, triple DES took something like 3–4 times as long as arcfour, which was constantly 20 percent better than its nearest rival.

Current Music: JS Bach - Toccata, Adagio and Fugue

Rerunning my file creation benchmark, I discovered that the performance of one of the file servers appears to have got ten times worse since my last test three days ago. Looks like I'm going to have a busy day on Monday...

Current Music: The PM Programme

This afternoon, I was asked to look into some of the NFS performance problems and to come up with a plan to remove the worst of the bottlenecks. As soon as the subject came up I felt my heart sink and then later, after the discussion had finished and the full horror of the thing had sunk in, I found that I wanted to do nothing more than to lock myself in the lavatory and spend the rest of the day crying quietly...

Current Music: Brahms - Variations on the St. Anthony Chorale

In an attempt to put off writing my increasingly imminent conference presentation, I decided to tweak one of my bits of C++ to see if I could get it vectorize. Thirty minutes of tweaks - unrolling a couple of loops, replacing whiles with fixed length loops and in-lining a dot product routine - and I'd got the vector ratio up from a big fat 0.00 to 97.8 and chopped 144 seconds off the run time. Hurrah!

The back story to all this is that

doctor_squale and I were petitioned by some poor unfortunate, who'd written his latest greatest in C++ only to discover that the performance sucked, who thus wanted some help tuning his code. Unfortunately I've got other things to do, like my conference presentation, and the Good Doctor is off on a week long drinking binge, so he can't help.

Now, if you'll excuse me, I've got more important things to do. Like finding a paper bag to hyperventilate into when the panic attacks really start to kick in...

Current Music: Bloc Party - Positive Tension

Profile

sawyl

August 2018

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Syndicate

Page Summary

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Oct. 25th, 2025 01:27 pm

Tales of a Fourth Grade Nothing

Entries tagged with optimisation

Speeding up a logfile parser

Temporary files on GPFS

Tuning a text analyser

Performance Management: Day 5

Performance Management: Day 4

Performance Management: Day 3

Performance Analysis: Day 2

Performance Analysis: Day 1

LoadLeveler: Day 1

Tuning a mod_python script

Tuning a python script

Lies, damned lies and benchmarking

NIS performance tuning

Picking the wrong options

Static or dynamic?

Schmoozing pays off

Cypher speeds

The results are in

NFS blues

Displacement activity