sawyl | Entries tagged with python

Following on from Wednesday's problems, I decided to revisit my use of distutils to manage my standard set of scripts and libraries. Having decided that, if I felt more confident about using setup.py to manage everything, it'd be much easier to maintain everything at the same level, I decided to address a couple of long-standing problems with my distribution.

Essentially the problem boils down to one thing: I want to be able to place libraries in the python site-pacakges directory but, due to local conventions, I need to be able to put scripts into separate bin, sbin, and libexec directories in a non-standard and completely different directory tree to the interpreter.

To do this, I created subclasses of the standard python install and build distutils for each of the various activities and added them into a custom subclass of Distribution, and used this in setup.py. This worked up to a point, but failed to include the files listed in the sbin and libexec keywords in source distributions unless the files were added to the MANIFEST.in, and the standard setup.py install failed to include options to allow the target directories to be altered, even though I'd added new install commands for each of the targets.

I realised I needed to tackle each problem with a different subclass.

I altered the base sdist class with a trivial addition to the add_defaults() method which check whether there are libexec and/or sbin files and uses the relevant build method to add them to the internal FileList instance. This solved the first problem.

I also created a subclass of install to add my two additional command line options — install-sbin and install-libexec. I immediately found that I was able to specify the options on the command line and in setup.cfg, but I couldn't persuade my two install classes to pick them up. After a certain amount of puzzling, I realised the answer was obvious: I needed to add each option to the set_undefined_options() call in the class's finalize_options() method to explicitly copy the value from install into the class's option, if it is unset.

Unfortunately none of this is terribly clear from the documentation, but there are plenty of examples in the distutils directory of the interpreter and distutils.cmd.Command has docstrings which explain how the various methods work and how they should be used.

Spent an interesting hour attacking a perform problem in an iris application which ran faster on one system than another. Runinng the testcase through the profiler, I almost immediately identified a hotspot in biggus.save() and, intriguingly, noticed that it was taking 170 per cent longer on one machine than the other — enough to account for the disparity in performance.

Tracing the IO calls I found that the mean and percentile times, which should have been identical, where higher on the machine where performance was worse. Wondering whether the problem might be excerbated by the IO patterns of netCDF — the output was bimodal with peaks at 5KB and 1MB — I repeated the test with dd and a fixed buffer size; this showed the same signal and confirmed that the root cause of the problem lay with Lustre.

After confirming the OSTs were different in each case, I rechecked the Lustre parameters on the slow system; with hindsight, I should have done what my colleagues did and made this my first port of call. Unsurprisingly I found that both max_dirty_mbmax_rpcs_in_flight were set at their default values — 32 instead of 256 and 8 instead of 64 — and increasing these resolved the problem and brought the performance of the two machines into line with each other.

Winding down ahead of the start of my Christmas leave, I finally found the time to dig into a little side project to extend python distutils to allow me to bundle up various tools which install in non-standard locations — principally sbin and libexec directories.

After reading my way through the various commands included in distutils, I worked came up with a series of modules to deal with the problem:

a build_sbin.py module that subclasses build_scripts, replacing the normal install location with a specific sbin version
a install_sbin.py module that duplicates much of the functionality of install_scripts.py, picking up the targets from the sbin build directory and installing them into either the sbin directory under the standard installation or the value set in setup.cfg
a dist.py module that subclasses distutils.dist.Distribution to create CustomDistribution, adding the attributes self.sbin and self.libexec to allow these keywords to be passed through from distutils.core.setup() and used by the custom build and install modules.

With all this in place, it was simply a matter adding the keyword distclass=CustomDistribution to setup() and adding the appropriate libraries to cmdclass — I eventually moved this into CustomDistribution even though I'm pretty sure it's not the right thing to do, because I thought I was going to want to do it every time.

This may all seem a bit baroque but got two significant bits of code that need to be installed in this fashion and I need something to reduce the complexity of the install processs and to provide a single command that can be used to regress the current version to an older one in the event of problems. And finally, after a morning of hackery, I think I've got something that will fulfil my requirements.

I've spent a big chunk of today trying to track down a strange deadlock in the parallelised version of my current python script. Eventually, after much digging, I noticed that the hang always occurred when there were 32,767 items in the Multiprocessing.Queue, even though the queue had been declared with an unlimited size and the put attempt had not generated a Full exception as might be expected.

I was then able to confirm that a Queue created with q = Queue(2**15) failed with an exception, while a one created with q = Queue(2**15-1) worked as expected and raised a Full condition when the limit was reached. Inspecting the source code of multiprocessing/queue.py, I noticed that maxsize defaults to _multiprocessing.SemLock.SEM_VALUE_MAX, which inevitably comes out at 32,767 and explains the whole problem — although it's unfortunate that the code simply hangs rather than raising an exception when the limit is reached.

So given that I know I need a work queue that can hold more than 32K items, it looks like I'm going to have to roll my own class to support a larger number of active semaphores — something that I hope will be as simple as subclassing Queue and overriding the semaphore routines...

ETA: I finally fixed this by creating a wrapper class to add a local, producer-side buffer to hold any items that could not be immediately added to the queue. By attempting a flush of the buffer ahead of every attempted put, I was able to reduce the number of Full exceptions I had to take by checking for pending elements in the buffer and only accessing the Queue when there were no pending items.

Today I finally bit the bullet and parallised my file scanning program. I'd been hoping I could get away with a few tuning tucks to the core code but when I ran a test case with 30 million items, I found the execution time was completely dominated by the cost of the name query routine. Fortunately, this turned out to be relatively easy to divide up: all I had to do was replace the current recursive code with a queue and a series of worker tasks and I was able to drop the run-time to something like a fifth of the serial version. I also looked into parallelising the metadata matching routine — something that would have required a distributed hash to implement — but after realising that the routine varied with the size of the file system rather than the number of items, I decided not to bother.

In the process of parallelising the code with multiprocessing.Process and Cython, I discovered a weird problem with the argument values: when I called the new parallel routine directly, everything worked as expected; but when I called the routine from a higher level, I got an exception that appeared to suggest that the first item in my Queue was corrupt.

Digging into the problem, I realised that it was caused by my choice of argument in the function definition. In the parallel routine I'd chosen to use char *value to mirror the data type used by the serial version; something that worked when the routine was called directly with a string rather than a python variable. But when the calling value was replaced with a python object I found that the value was replaced by assert_spawning, causing the first worker thread to fail. Once I realised this I was able to fix the problem by changing the type to object value, but it took me a while to work out what was going on, not least because the routines worked when tested separately and only failed when used as a complete entity...

I've discovered a couple of elegant spin-offs from my current work on GPFS. Playing around with the pool structures and calls in the API, I've found it to be much faster than mmdf, largely because the latter goes out to each NSD and queries it for its usage information — something that's not of much interest if you just want to know whether the pools are full or not.

I've also realised that if I combine a couple of other calls, I can both identify the fileset that owns a particular directory and pull out the three levels of quota — user, group, fileset — for it. This makes it trivially easy to summarise all the limits imposed on a particular user in a way that indicates both their current level of consumption, how much they have left, and the level at which the resource is being constrained.

And best of all, the code to do all this can be wrapped up in a handful of Cython calls. OK so the fileset identification code seems to need an inode scan, which seems to mean it can't be run as a normal user, and I haven't yet found a way to dump out a complete list of filesets defined within the file system — although this information must be available be mmlsfileset and mmrepquota can do it — but it's an extremely handy thing to get out of something originally intended for a completely different purpose...

Thanks to a casual suggestion from someone at work, I've discovered delights of Django. Having read through some of the documentation and done the first few tutorials, it has opened my eyes to a world of possibilities. Not only is the process of implementing persistent objects in python pretty easy, but the admin interface is powerful and flexible enough to make it almost trivial to glue the objects into a useful whole.

I feel a hacking run coming on...

I've been messing around with an attempt to glue python on to an API layer. I didn't think it'd be all that difficult: I had a couple of pieces of sample code, some reasonably good documentation on the API and the cython manual to hand. But when I tried to convert the API header file into cython definitions, I hit an interesting snag: some of the type definitions I needed to use were incompletely defined.

At first I was thrown by the incomplete definitions, assuming that I'd missed an #include somewhere. Once I realised this wasn't the case — I imagine that the structures are only ever defined in a private set of development headers — I tried to work out why the examples worked and why my code failed. Eventually I spotted the pattern: because the incomplete types were only ever used as pointers, their actual contents didn't matter because they were simply being used to reference chunks of memory and, consequently, that all I needed to do to get them working in cython was to define them as void pointers.

It's not a particularly attractive solution to the problem but it seems to work...

Yesterday, someone at work posted an query about numpy structured arrays. And while the query itself wasn't that complicated, one of the responses suggested that the sorts of tasks they were hoping to carry out might be better suited to pandas than numpy. So, as a heavy user of structured records, I was intrigued.

Not only did I find that someone had bundled pandas into our standard version of python, but a few quick tests seemed to show that it offered a whole suite of data slicing and manipulation tools that make some of things I've been working on over the last couple of weeks trivially quick to do. Best of all via pandas I discovered a couple of optimised libraries, including the multi-threaded numexpr which kicks the performance of array operations into orbit!

Time, I think, to update my iMac with the full range of scientific python bits and pieces. I'm especially keen on iPython notebooks: given all the data analysis I've been doing lately, I'd really love something like an interactive lab book for all my scribbles and calculations...

Working on a bit of code to create a number of complex figures, I noticed that it seemed to leak memory like a sieve: around 7-10MB per figure, causing my already stressed desktop machine to decend into an epic bout of thrashing.

After scrutinising my code, I decided that that problem had to be with matplotlib and started skimming the documentation. I soon realised my mistake: although I'd been saving my figures and deleting the variable associated with them, I hadn't been running clf() and close() before doing so and the memory associated with them wasn't being realised. So I added a couple of lines of code and the memory footprint dropped from multiple GB to a more sustainable 200MB and remained stable for the duration of the run.

Sketching out a piece of python, I realised that I could replace some of the calls to subprocess.Popen() with a wrapper object I wrote a couple of weeks ago, so I took the object, dropped it into one of my existing library modules and tried it out. It all worked as expected so I pulled in an option parser derived from argpase.ArgumentParser and got on with the rest of the code.

With the program complete, I added something something to vary the level of the logging library according to the command line switches and got on with testing. At which point I noticed something unexpected: I didn't get any log output from my process object until I increased the verbose level to 3, even though I'd set the the log level to debug at verbose = 2, and I didn't start getting debug information from my object until I went up to 4. Surprised, I went back and started taking the modules to pieces until I eventually realised that I'd fallen victim to my own cleverness.

Examining the modules, I realised that my custom ArgumentParser included a custom __init__() method that added some extra standard options to each of my programs, including a verbose flag and noticed that it also overrode parse_args() to increase the log level via logging.basicConfig(). The override also contained a bit of code that, either dubiously or cleverly, called logging.getLogger() using the name of the library containing the command object — I'd named my logging hierarchy after the package — and set it to logging.ERROR until the verbose level was at 3 or 4, dropping it to logging.INFO or logging.DEBUG.

I'm not entirely sure whether this is entirely sensible, but I think it is acceptable in this context. Both the argument parser and the command object are part of the same package hierarchy and are intended to play closely together. By putting them together, it makes it trivially easy to separate out the debug stream from the core script from the debug output from the commands; something that is especially important when troubleshooting scripts that wrap around a child command, where, as a last resource, it is often useful to able to be able see the output from the underlying commands but which, if provided routinely, tend to swamp useful output from the surrounding script.

A little while ago, I tried to get SciPy to build on AIX using IBM's XL series of compilers and the ESSL library. The results were disappointing. I couldn't get SciPy to link correctly and the only procedure I was able to find online was fiendishly complicated. After spend some days working on the problem, I gave up, put it aside and shifted my attention to more important things.

So I wasn't exactly delighed when SciPy's priority increased sufficiently to put it back on my radar. But, full of brio after the holidays and in search of an interesting puzzle to crack, I decided to give it another go. And what do you know? It only took me a few hours to come up with a solution, saved here for posterity and for any other AIX admin foolish enough to want to give this a try.

( The gory details... )

Onwards to the horrible — and more important — mess that is netCDF...

Skimming through the documentation for IPython, I noticed a reference to a method called XKCDify(). Intrigued, I chased the code down to a post on Jake Vanderplas' blog which promised to turn all my plots into something worthy of the hand of the mighty Randall Monroe. And sure enough, when I tried it, it worked like a charm. Excelsior!

I use matplotlib's AutoDateFormatter a lot. But it doesn't quite match my requirements. As well as scalable date formats I'd also like to be able to mark boundary ticks, e.g. first tick of a particular month or first month of a year, using a different format to the rest of the range markers.

Luckily, after a bit of thought, I realised that I could subclass the module, add an extra dictionary of boundary formats, and add some code to use the locator passed to the object by the caller to determine the relative positions of each tick:

( Gory details... )

This method works well when labelling the X-axis but is slightly less successful when it comes to labelling the value of the cursor in interactive sessions. However it has occurred to me that it ought to be possible to fix this by adding soemthing to setFormat() to return yet another alternate format if the locations array is zero length, as is the case when displaying interactive values.

Looking into the generally sluggish turnaround of the batch system, I decided that the problem was probably down to long cycle times in the negotiator daemon's dispatch code and that it was being exacerbated by full debug logging. To prove my hypothesis, I decided to write something to plot out the cycle times before and after changing the debug flags. However while the script proved easy enough to write, the sizes of the log files meant that the performance simply wasn't good enough. So I said down with the python profiler and the timeit to see if I couldn't hurry things along their way.

The profiler showed that the bulk of the time was being spent in the log file parsing function. No great surprise there — the function was crunching through hundreds of gigs of data to pick out the cycle times. What was surprising was the relatively high cost of the endswith() string method. And because it was being called for almost ever line of input, it was dominating the performance of the program. So, optimisation number one: reorder the string comparisons to try to break out of the matching early and replace the endswith() with a less expensive indateutil.parser had become more prominent in the profiler listing. I tried a couple of easy tricks, removing some unnecessary rounding operations and trimming back the length of the date string to be parsed, but extracting the date of each log entry was still far too expensive. So, optimation number two: replace the generalised dateutil.parser routine with a custom bit of code and incorporate caching to avoid re-parsing the line when the date change is less than the number of significant figures.

As a result of the changes time required to run my testcase dropped to around an eighth of its original value, bringing the time taken to analysis a complete set of logs down to a mere 700 seconds. Then, when I finally plotted out the distiled data, I noticed that as soon as I disabled full debugging the cycle time droped from around 20-30 seconds to 3-4 seconds and stayed stable. Excelsior!

I'm increasingly impressed with python's excellent logging module, especially the powerful way it combines with python's method of packaging modules.

By splitting generation from handling, logging makes it trivial to log messages in a module and delegate the process of actually printing them to a destination — file, console, network, whatever — to the calling application. By using a logger hierarchy that mirrors the standard python package naming conventions, per-module loggers can be initialised using __name__ and their output managed by adding a handler to the name of their parent module.

The only minor wrinkle I've encountered so far is the need to prevent a module's logger instance being exported when a higher level package does an import * on the package. Having originally thought about using a private __logger variable, I've realised that it's probably neater if I add an ___all__ = [ func1, func2, var1, var2 ] to the module. Not only does this prevent the logger from polluting the caller's namespace, but it also makes it easy to prevent other methods in the module from escaping into the wild.

After a good few days of being mojo-less, I finally fell as though I'm starting to cycle back up again. I've managed to break an annoying Pyrex problem that has been bothering me for a while: it turned out all I needed to do was change a couple my ctypedef lines to include the member names of the structure components I wanted to access. Once I'd got that done and once I'd fixed a leaking file descriptor problem by replacing a simple recursion with tail recursion, everything fell into place like a charm.

Having been asked to re-run an extremely slow piece of data analysis, I decided that I couldn't face waiting a week for my data and decided instead to parallelise my program. After investigating and rejecting python threads on the grounds of GIL contention, I eventually decided to use the multiprocessing module to distribute the work over a number of different OS processes, using a pair of Queue objects to assign work to each of the threads and to combine the results in the main process.

The structure of the final program was something a bit like this:

( Source code... )

Along the way I discovered a couple of potential gotchas. I found that unless I emptied both Queue data structures before I attempted to join() the worker tasks, the program would deadlock. I also found that I needed to explicitly count the number of remaining results in order to determine whether the output queue was empty, because otherwise I'd have needed to use a timeout that was longer than longest possible elapsed time of each unit of work — something that would have had a significant impact on the over all run time of the program.

The eventual performance of the end program was extremely satisfactory. I managed to parallelise the code in about a quarter of the elapsed time of shortest serial analysis and, because the scalability of the program was almost linear, I got a 60x improvement on a single run and 200x improvement on the reanalysis of all my data sets.

This may have been fixed in later versions of Python — I've been using 2.6.6 — but I've noticed that the standard distutils don't seem to work properly when (a) the interpreter has been built using the the XL compilers; and (b) when the extension being built needs to be linked with the C++ compiler, e.g. in SciPy. When both these conditions are met, distutils generates an invalid command link that features both the C++ compiler as the linker (correct) and C compiler as the second argument (horribly incorrect), as this example from an existing problem ticket shows:

error: Command "xlC xlC -qlanglvl=extc89 ..." failed with exit status 252

Not being willing (or able) to build each component by hand, I came up with a rough and ready fix — the if block that follows the platform test for 'aix5' — for distutils.unixccompiler.link() that seems to work around the problem:

( Source code... )

It's a long way from perfect — the test shouldn't really be for AIX but something like self.compiler_cxx.startswith("xl") and the code should search through the list, rather than simply removing element 1 — but it's a reasonable first cut at a solution to a deeply annoying problem.

I've been doing quite a lot of work with python and numpy and matplotlib of late and I've made a couple of useful discoveries:

numpy structured arrays are an almost perfect replacement for tables in R, provided that you don't try to use the dtype parameter to explicitly request a field type of |O4, e.g. to accomodate datetime objects, because (a) this seems to cause the current version of numpy to complain; and (b) seems to be unnecessary.
using matplotlib.ticker.FixedLocator to override the standard X-axis ticks provides cleaner labelling than most of the other methods when working with time/date sequences.
doing a yaxis.get_major_ticks()[0].label1On = False switches off the first label of the Y-axis, avoiding ugly collisions between labels on the two different axes.

I've used these discoveries to put together a script that plots out nmon and topas data for one or more machines, making it easy to compare and contrast the performance of nodes that share the provision of a service, e.g. GPFS, LoadLeveler etc.

Profile

sawyl

August 2018

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Syndicate

Page Summary

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Aug. 8th, 2025 08:22 am

Tales of a Fourth Grade Nothing

Entries tagged with python

Customising python distutils sdist and install

A Lustre rabbit hunt

Python distutils and non-standard script locations

Multiprocessing and Queues: An interesting shortcoming

Parallel python and cython

Things spun out of GFPS API work

Discovering Django

Cython and incomplete type definitions

Python and Pandas

Closing matplolib figures

An unexpected piece of cleverness

Building SciPy on AIX

XKCDify

Extending matplotlib's DateAutoFormatter

Speeding up a logfile parser

Thoughts on python logging

Back on form

Parallelising python programs

Python, distutils and IBM XL compilers

Experiments with matplotlib