sawyl: (A self portrait)
[personal profile] sawyl
Today I finally bit the bullet and parallised my file scanning program. I'd been hoping I could get away with a few tuning tucks to the core code but when I ran a test case with 30 million items, I found the execution time was completely dominated by the cost of the name query routine. Fortunately, this turned out to be relatively easy to divide up: all I had to do was replace the current recursive code with a queue and a series of worker tasks and I was able to drop the run-time to something like a fifth of the serial version. I also looked into parallelising the metadata matching routine — something that would have required a distributed hash to implement — but after realising that the routine varied with the size of the file system rather than the number of items, I decided not to bother.

In the process of parallelising the code with multiprocessing.Process and Cython, I discovered a weird problem with the argument values: when I called the new parallel routine directly, everything worked as expected; but when I called the routine from a higher level, I got an exception that appeared to suggest that the first item in my Queue was corrupt.

Digging into the problem, I realised that it was caused by my choice of argument in the function definition. In the parallel routine I'd chosen to use char *value to mirror the data type used by the serial version; something that worked when the routine was called directly with a string rather than a python variable. But when the calling value was replaced with a python object I found that the value was replaced by assert_spawning, causing the first worker thread to fail. Once I realised this I was able to fix the problem by changing the type to object value, but it took me a while to work out what was going on, not least because the routines worked when tested separately and only failed when used as a complete entity...

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 4th, 2026 11:52 am
Powered by Dreamwidth Studios