sawyl | Entries tagged with python

A few months ago I wrote a python script to count the words, sentences, paragraphs, and word frequencies in my MA thesis. Having written it in something of a hurry, I discovered when I returned to it yesterday, that it wasn't terribly efficient and didn't scale terribly well when run against larger samples of text. So I decided to take a look with the profiler to see if I couldn't improve things.

I took the raw program and profiled it using largest chunk of text I had to hand, George Eliot's Middlemarch, as it happens. I noticed almost immediately that most of the code spent most of its time in the routine that updated the word frequency hash, and that the routine spent most of this time running a series of regular expressions to clean off any non-alphanumeric characters prior to updating the frequency hash. Suspecting that this wasn't very efficient, I decided to benchmark a few alternatives.

My original code looked something like the following:

def case1(word):
word = re.sub("^\\W*", "", word)
word = re.sub("\\W*$", "", word)
return word

So I decided to replace the pair of substitutions with a single one:

def case2(word):
return re.sub("^\\W*(\\w+?)\\W*$", "\\1", word)

But I discovered that this decreased performance by 25 per cent. So I decided to try a simple regexp match instead:

def case3(word):
m = re.match("^\\W*(\\w+?)\\W*$", word)
if m:
return m.group(1)

This gave a 160 per cent improvement on the original. But I realised I could do still better if I abandoned regexps altogether and replaced them with some plain string processing:

def case4(word):
return word.strip(",.?!:;'\"")

Sure enough, this gave a 1200 per cent improvement over the original re.sub() code, which reduced the cost of the function to the point where a re.split() in the caller become one of the dominant costs. Once this was replaced with an almost equivalent set of string splits, I found that the elapsed time of the program had been more than halved and the problem had moved from being CPU bound to being IO bound.

The moral of the story, then, is that although regular expressions are often useful they significantly limit code scalability and should be avoided in codes where performance is likely to be important.

After playing around with my job accounting interface for a week or so, my test users came back with a few suggested improvements. Most were easy to implement, but as I was testing the ability to execute queries across multiple data sets, I noticed that the performance wasn't all that it might be, with ten day queries taking 5 seconds to execute. Clearly something needed to be done.

The profiler confirmed my suspicions — the code was extremely IO bound and a lot time was being lost to the re.split() used to split each line of input into fields. My initial workaround for the problem was simply to run the input data set through tr -s " " to remove duplicate spaces, allowing me to replace my slow RE with a much faster string split.

Then it occurred to me that, having committed myself to preprocessing the input data, I might as well push the idea to its logical conclusion. Thus, I reprocessed all the input data, stripping out the unwanted fields and replacing them with things that had previously been dynamically created at run-time by the interface. Having done this, I decided to optimise the IO path by dumping the data out using the python marshal module, removing the need to do any processing on the input beyond a simple load.

When I considered the access patterns of the data, another idea suggested itself. Most of the queries were being made against the same core data sets, either because multiple users were querying information for the same day or because a single user was sorting the results of their previous query into a different order. So, given that mod_python makes it possible for data structures to persist in memory — one of the principle reasons it out-performs simple CGI — it occurred to me that I might buffer recently used data sets in memory. So I added a brief bit of wrapper code around the marshal.load() call and, presto, the performance of the most common subset of queries ran through twenty times faster.

All in all, a most satisfying way to spend an otherwise idle afternoon...

Current Music: Rimsky-Korsakov - Scheherazade

One of the perpetual nuisances of being a Big Iron shop — especial one ruled by bureaucracy and governed by the counting of the beans — is need to main a vast and baroque accounting suite.

Standard Unix accounting, needless to say, doesn't even come close. The process accounting records only save a handful of possible chargeables and there's no way to charge different commands to different projects. In fact, the base accounting tools don't even support the concept of projects. Super-UX, at least, provides support for projects but lacks the consolidation and reporting facilities of CSA under UNICOS.

And CSA, for all it's flexibility, wasn't exactly robust and reliable. When the accounting cycle broke, as it regularly did, the problems would take hours to fix, and if they weren't fixed by the time of the next run at midnight, the suite would garble two days worth of data together, doubling the size of the mess. CSA also suffered from a particularly annoying bug whereby the accounting suite would repeatedly crash if the system uptime exceeded six months — not a big problem with early T3Es — and the only way to clear the problem was to reboot the system. Yes, the only way to fix the accounting suite, the thing that allowed every micro-second of CPU to be precisely counted, was to take the machine out of production for an hour to allow the machine to be IPLed.

Given the shortcomings in the Super-UX accounting suite, our current solution to the problem is to use Perl to unpack the process accounting records from each machine, consolidate them into a single CSA-style front end format record, archive the raw data and post process the FEF information into something for the end users of the system. Typically, this involves consolidating all the process records for a single batch job — across one or more nodes — into a single line, with each entry containing details of both consumable and requested resources. These entries are then used to generate an efficiency rating for each job — typically, the ratio of CPU time consumed against the amount of time the CPUs were assigned to the job — which allows the scientists to pick up on poorly performing jobs and apply peer pressure to encourage their colleagues to stop wasting resources.

Or that's how things should be. In practice, it seems, most users don't really have the time to churn through the list of jobs every day and pick out their own worst offenders. I can't really say that I blame them — after all, it's pretty dispiriting to be confronted with hard evidence of your own shortcoming — but it means that the system isn't working and the number of idle cycles is starting to creep up.

So, in an attempt to fix the situation, I've knocked up a piece of code using mod_python to respond to accounting queries. So far it's pretty basic, but it allows a user to filter out their own jobs and to restrict the results to a particular work class, and seems to work pretty well.

Now all I have to do is decide what to do about the look and feel. Most of my previous web-bases scripts have been used perl and have relied on CGI.pm to do the bulk of the formatting. Not having found a python module with the same degree of functionality, it looks like the best course of action might be to use something like CheetahTemplate to care of the output formatting.

Fed up with the poor performance of one of my python scripts, I decided to spend half an hour or so optimising it, just to see if I couldn't improve things.

Sure enough, when I profiled the base version using a fixed configuration, I found that it was losing most of its time to a series of deepcopy() calls. I changed the internal data structures to use a dictionary instead of a more complex set of arrays and, presto, the run time went down from 7.55s to 1.76s and the function call count went down from 398,359 to 53,678.

But when I re-profiled, I found a couple of other nasties: an N by N array that was being needlessly regenerated for every loop trip and a whole host of calls to has_key(). After adding caching to the main loop to prevent recalculation and replacing all the has_keys() with try ... execpt KeyError ... constructs, I managed to get the time down to 0.97s and a mere 11,236 function calls.

And here's proof of my success: a plot showing how the raw and optimised versions of the program scale as the number of items increases:

So it looks like the big wins were, in order of effectiveness: (a) to use a more efficient algorithm and data structures; (b) to replace has_keys() with exceptions in order to avoid the overhead of the extra subroutine calls and to exploit the (relatively) fast performance of exception handling; (c) to avoid unnecessary work by caching results, trading memory for CPU. Of these, (a) and (c) were obvious, while (b) was only apparent after a little bit of experimentation with the profiler.

I spent a decent chunk of yesterday running one of my colleagues through my Unitree janitor script. In the process I noticed a rough edges that could have done with another coat of varnish, so this morning I got out my figurative paintbrush and started work on the making good.

As I was running through and tidying up the comments I had a brainwave. Why not merge the functionality tests into the comments and use doctest to run a complete regression on the module? So I did. And everything went swimmingly, right up until I added a set of tests for a modified command line parser object. The offending test went something like this:

>>> p = MyParser()
>>> p.add_option("-a")
<MyOption at 0x...: -a>

But every time I ran the test I got the following failure:

Expected:
<MyOption at 0x...: -a>
Got:
<MyOption at 0x4033a20c: -a>

Which would have been fine, except that, as per the manual, I'd specifically put in a set of ellipses to deal with the fact that the address was likely to differ. But what, it transpires, I'd missed was that the doctest needs to know to enable ellipses, either on a per test basis with a "#doctest: +ELLIPSIS" pragma directive tagged on to the specific test or by passing the correct flag to the appropriate doctest function, e.g.

from doctest import testmod, ELLIPSES
testmod(optionflags=ELLIPSES)

Rather embarrassingly, it took me almost an hour to work out what I was doing wrong...

Current Music: Mahler - Symphony No. 8

Today's discovery? The rather wonderful object oriented interface to netCDF contained in the ScientificPython bundle of modules. It turned a very tedious task — generating banner test patterns in order to validate a port of ncview — into a total triviality.

Current Music: Liszt - Années de Pèlerinage, S.160

Following a suggestion from

silvershooter, I've added locks and keys to my adventure game:

( More adventure text... )

Hopefully, the work that underlies this — a clean up of some of the data structures, a move to XML input files, etc. — should generalise, making it relatively painless to add in other actions. Which means that, with any luck, version 0.3 should allow for those twin staples of bureaucratic life: coffee drinking and Guardian reading.

For no very good reason, other than a desire to try and create an interface using cmd, I've written a rather lame text adventure in python. Here's a small taster:

( Adventure text... )

Astute readers will note that I've neglected to include any actual adventure elements. Well, fear not. In version 0.2, I plan to make it possible to water the potted plants and obtain free coffee from the vending machines. It's going to be a right roller coaster of a game once it's finished...

Current Music: Albinoni - Oboe Concerto No. 9

For no terribly compelling reason, here is a small change giver:

coins = [ 100, 50, 20, 10, 5, 2, 1 ]

def change(value, bin=0):
div, mod = divmod(value, coins[bin])
if not mod:
return [ ( div, coins[bin] ) ]
else:
return change(mod, bin+1) + \
[ ( div, coins[bin] ) ]

When supplied with a small sum of money, it breaks the number down and returns a list composed of tuples which indicate how many coins of each denomination are required:

>>> change(138) [(1, 1), (1, 2), (1, 5), (1, 10), (1, 20), (0, 50), (1, 100)] >>>

For my next lame science project, I plan to build a software model of a vending machine. Or possibly a machine that automatically lays traffic cones every 5 metres. I haven't quite decided...

I have a new favourite python module: optparse. Unsurprising, I suppose given my interests in writing command line utilities, but it's by far the best option parsing package I've found so far.

I particularly like its extensibility. I spent a happy few hours this morning writing a couple of routines to cleanly convert scientific values, e.g. 100k, into floating point values, and to change date stamps, 31/01/2008, in to epoch seconds. I've already pressed both routines into service, using them to simplify a couple of my Unitree maintenance programs.

I also greatly appreciate the ability to automatically generate help messages. Simply add in an overview of the program to the OptionParser constructor, add short descriptions to each of the options and, presto chango, -h generates a useful, pretty printed, help message. Ideal for user applications.

Current Music: Handel - Messiah

According to the documentation, any signal sent to a multi-threaded python script will be delivered to the main thread. Thus, the easiest way to cleanly shut down a set of worker threads would seem to involve having each worker periodically check for a global variable which can be set to False by a signal handler routine registered by the main thread. For example:

import threading import signal import time keep_going = True def handler(s, f): global keep_going keep_going = False def worker(): global keep_going while keep_going: print "Working..." time.sleep(1) print "Done" def main(): signal.signal(signal.SIGINT, handler) for i in range(1): t = threading.Thread(target=worker) t.start() for t in threading.enumerate(): if t != threading.currentThread(): t.join() main()

However, this does not work as expected. Why? Because, when the main thread issues the join() call, it sets a lock which prevents the signal handler code from interrupting — something that can be demonstrated by placing a sleep between the two loops. The solution? Replace the blocking join loop with one that times out, for example:

while True: threads = threading.enumerate() if len(threads) == 1: break for t in threads: if t != threading.currentThread(): t.join(1)

Which works for me...

Current Music: Sibelius - Pohjola's Daughter

Here's an interesting question: how can a python object be connected to an API library such that contents of the object correctly reflect the status of the underlying object in the API? Here are some details of the way I chose to do it.

I've got an object called a Request. This represents a single instance of job in the batch system. Each Request has properties such as queue and state which change depending on what the batch system is doing and which need to be kept up to date. Rather than simply caching the static data in a struct I've chosen to solve the problem by creating a routine that queries the batch server every time a __getattr__ is called on the object.

Essentially, I set up a very simple structure with a pair of buffers:

  typedef struct {
    int int_buffer;
    char *char_buffer;
  } Request;

I then wrote a simple routine to query the state of a particular property via the API and write the results to either int_buffer or char_buffer in the structure:

  void Request_query(Request *r, int property);

I integrated this into python using the following bit of SWIG:

  %extend Request {

    void query(int property); 

  }

So far, so simple. Now I needed to construct some code to intercept the __getattr__ calls so that they call query(). This can be done with a %pythoncode section in the SWIG interface file as follows:

  %pythoncode {

  get_attrs = { "queue": ( QUEUE, "char_buffer"),
                "state": ( STATE, "int_buffer") }

  def _Request_getattr(self, class_type, name):
    try:
      self.query(get_attrs[name][0])
      name = get_attrs[name][1]
    except KeyError:
      pass

    return _swig_getattr(self, class_type, name)

  Request.__getattr__ = lambda self, name: \
    _Request_getattr(self, Request, name)

  }

The upshot of this is that any attempt to get a property will result in the dictionary get_attrs being checked for an entry. If an entry is found, the appropriate buffers are updated using query() and the value of name is changed to point to the buffer. Once this work has been done, the builtin _swig_getattr() routine is called as normal to extract a value from the object — either the buffer set by query() or, in the case of a simple static value which lacks a dictionary entry, the contents of the variable in the struct.

A similar method can be used to reverse the process, allowing changes to properties to be passed back to the API by replacing __setattr__. For example:

  %pythoncode {
  
  def _Request_setattr(self, class_type, name, value):
    if name == "queue":
      self.setQueue(value)
    else:
      _swig_setattr(self, class_type, name, value)

  Request.__setattr__ = lambda self, name, value: \
    _Request_getattr(self, Request, name, value)

  }

Whilst none of this seems to be terribly well documented, it's possible to grok how it's supposed to work by skimming through the SWIG generated module file.

Current Music: Rachel Unthank - Newcastle Lullaby

My recent work with SWIG has paid off: I've been able to use my new interface to implement a batch monitoring daemon in less than a hundred lines of code.

Now that it looks like I've got the querying side of things working, I need to override the __setattr__ methods so that they call an API ATTROP_SET routine. With that in place, it should be possible to alter batch parameters on the fly, simply by changing the properties of python objects — talk about insanely great!

A few words of wisdom: pay attention to SWIG's diagnostic messages — it's printing those things out for a reason, y'know. If SWIG dumps out a load of info messages and the resulting code crashes for no obvious reason, take a few minutes to trace the causes of the diagnostics and then fix them so that they stop happening.

I've finally uncovered the source of the bug that caused my backup script to go horribly wrong. Essentially, it comes down to how the python os.path.join() behaves when given a pair of absolute paths. When the second path element is unqualified, the join function works as expected:

>>> os.path.join("/tmp", "foo") '/tmp/foo' >>>

But when the second path element is fully qualified, the join function returns the second path:

>>> os.path.join("/tmp", "/foo") '/foo' >>>

I think that explains why my backup script chose to dump all over the root file system instead of putting all the files into a nice little archive directory. Oops.

Current Music: The PM Programme

Today's Guardian features a content-light article which suggests that Python is going to allow bedroom coders to kick start a cell phone content revolution. I guess anything's possible...

Current Music: Wagner - Die Walküre

I've finally — after the best part of year — got round to implementing a script in python and expect to automatically log the output from the Itanic consoles. For all my foot dragging, it was actually pretty straight forward, thanks to a combination of pexpect and some handy guidelines on how to reliably expect stuff.

Current Music: Night Waves

In general, Python's refusal to treat an assignment as an expression is a Good Thing, cuz it prevents the dumb "if ( $x = "yes" ) { }" type errors that are oh so easy in Perl, C, whatever. Unfortunately, it makes the task of writing a text chomper a bit of a chore, since it's not possible to write:

for line in open("input").readlines(): if m = re.match("foo(\\d+)", line): print m.group(1) elif m = re.match("bar([a-z]+)", line): print m.group(1)

( Read more... )

Current Music: Moloko - Fun For Me

Profile

sawyl

August 2018

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Syndicate

Page Summary

Tuning a text analyser
Tuning a mod_python script
Python and accounting data
Tuning a python script
doctest and an obvious mistake
ScientificPython and netCDF
Now with added keys!
A bureaucratic adventure game
Change generator
optparse: a testimonial
Python, threading and signals
Dynamic properties with SWIG
Swigging pays off
Swig diagnostics
Bitten by python
Say you want a revolution
Great Expectations
Pattern Engines

Style Credit

Style: Neutral Good for Practicality by timeasmymeasure

Expand Cut Tags

No cut tags

Page generated Dec. 8th, 2025 11:29 am

Tales of a Fourth Grade Nothing

Entries tagged with python

Tuning a text analyser

Tuning a mod_python script

Python and accounting data

Tuning a python script

doctest and an obvious mistake

ScientificPython and netCDF

Now with added keys!

A bureaucratic adventure game

Change generator

optparse: a testimonial

Python, threading and signals

Dynamic properties with SWIG

Swigging pays off

Swig diagnostics

Bitten by python

Say you want a revolution

Great Expectations

Pattern Engines