Parallel python and cython
Mar. 28th, 2014 01:06 pmToday I finally bit the bullet and parallised my file scanning program. I'd been hoping I could get away with a few tuning tucks to the core code but when I ran a test case with 30 million items, I found the execution time was completely dominated by the cost of the name query routine. Fortunately, this turned out to be relatively easy to divide up: all I had to do was replace the current recursive code with a queue and a series of worker tasks and I was able to drop the run-time to something like a fifth of the serial version. I also looked into parallelising the metadata matching routine — something that would have required a distributed hash to implement — but after realising that the routine varied with the size of the file system rather than the number of items, I decided not to bother.
In the process of parallelising the code with
Digging into the problem, I realised that it was caused by my choice of argument in the function definition. In the parallel routine I'd chosen to use
In the process of parallelising the code with
multiprocessing.Process and Cython, I discovered a weird problem with the argument values: when I called the new parallel routine directly, everything worked as expected; but when I called the routine from a higher level, I got an exception that appeared to suggest that the first item in my Queue was corrupt.Digging into the problem, I realised that it was caused by my choice of argument in the function definition. In the parallel routine I'd chosen to use
char *value to mirror the data type used by the serial version; something that worked when the routine was called directly with a string rather than a python variable. But when the calling value was replaced with a python object I found that the value was replaced by assert_spawning, causing the first worker thread to fail. Once I realised this I was able to fix the problem by changing the type to object value, but it took me a while to work out what was going on, not least because the routines worked when tested separately and only failed when used as a complete entity...