Lazy data parallelism
Sep. 7th, 2012 04:58 pmFortunately, python's multiprocessing module makes this trivial. Essentially, all I needed to was the following:
import os from multiprocessing import Pool from subprocess import call def compress(target): call(["bzip2", target]) if __name__ == "__main__": Pool(4).map(compress, os.listdir("."))
This creates a four process pool of worker tasks, generates a list of files in the current directory and passes them via the map()
call to the compress()
function, allowing the Pool
object to handle the assignment of work to the tasks. Obviously, my final code wasn't really as simple as the fragment above. I added a couple of extra features to allow me to exclude already compressed files from my list of targets and to limit the maximum number of files processed in any one run to a reasonable number.
Once I'd done this, I set up a multi-step LoadLeveler job to traverse a subdirectory and process a thousand files before resubmitting itself to handle the next chunk. Once the compression program is no longer able to find files to compress, the job advances to the next directory and resubmits itself. (My first guess at a base case — when no more uncompressed files remained in the directory — turned out to be too naive: having decided not to use the --force
option to bzip2
my script was unable to replace the partially compressed files left over by early batch jobs that had been killed off after hitting their wall-clock limit, causing it to go into a resubmission loop. In addition to adding a force option, I changed the job to check whether the contents of the current directory before and after the compression script, causing it to advance to the next directory whenever the two listings were identical).
Fortunately, the space savings seen to have more than justified the investment of my time: I've managed to shave 4TB off my total disk usage in the last 24 hours.