Tuning a mod_python script
Aug. 12th, 2008 09:11 pmAfter playing around with my job accounting interface for a week or so, my test users came back with a few suggested improvements. Most were easy to implement, but as I was testing the ability to execute queries across multiple data sets, I noticed that the performance wasn't all that it might be, with ten day queries taking 5 seconds to execute. Clearly something needed to be done.
The profiler confirmed my suspicions — the code was extremely IO bound and a lot time was being lost to the
Then it occurred to me that, having committed myself to preprocessing the input data, I might as well push the idea to its logical conclusion. Thus, I reprocessed all the input data, stripping out the unwanted fields and replacing them with things that had previously been dynamically created at run-time by the interface. Having done this, I decided to optimise the IO path by dumping the data out using the python marshal module, removing the need to do any processing on the input beyond a simple load.
When I considered the access patterns of the data, another idea suggested itself. Most of the queries were being made against the same core data sets, either because multiple users were querying information for the same day or because a single user was sorting the results of their previous query into a different order. So, given that mod_python makes it possible for data structures to persist in memory — one of the principle reasons it out-performs simple CGI — it occurred to me that I might buffer recently used data sets in memory. So I added a brief bit of wrapper code around the
All in all, a most satisfying way to spend an otherwise idle afternoon...
The profiler confirmed my suspicions — the code was extremely IO bound and a lot time was being lost to the
re.split() used to split each line of input into fields. My initial workaround for the problem was simply to run the input data set through tr -s " " to remove duplicate spaces, allowing me to replace my slow RE with a much faster string split.Then it occurred to me that, having committed myself to preprocessing the input data, I might as well push the idea to its logical conclusion. Thus, I reprocessed all the input data, stripping out the unwanted fields and replacing them with things that had previously been dynamically created at run-time by the interface. Having done this, I decided to optimise the IO path by dumping the data out using the python marshal module, removing the need to do any processing on the input beyond a simple load.
When I considered the access patterns of the data, another idea suggested itself. Most of the queries were being made against the same core data sets, either because multiple users were querying information for the same day or because a single user was sorting the results of their previous query into a different order. So, given that mod_python makes it possible for data structures to persist in memory — one of the principle reasons it out-performs simple CGI — it occurred to me that I might buffer recently used data sets in memory. So I added a brief bit of wrapper code around the
marshal.load() call and, presto, the performance of the most common subset of queries ran through twenty times faster.All in all, a most satisfying way to spend an otherwise idle afternoon...
no subject
Date: 2008-08-13 11:39 am (UTC)Hopefully, the world does not care.
no subject
Date: 2008-08-13 08:07 pm (UTC)