sawyl: (Default)
[personal profile] sawyl
Having accumulated a staggering 22TB of LoadLeveler data over the last few months, I decided that the time had come to do a bit of housekeeping. After a couple of tests showed that bzip2 seemed to be able to manage something like 95 per cent compression, I decided to find a way to exploit the obvious data parallelism inherent in my problem.

Fortunately, python's multiprocessing module makes this trivial. Essentially, all I needed to was the following:

import os
from multiprocessing import Pool
from subprocess import call

def compress(target):
    call(["bzip2", target])

if __name__ == "__main__":
    Pool(4).map(compress, os.listdir("."))

This creates a four process pool of worker tasks, generates a list of files in the current directory and passes them via the map() call to the compress() function, allowing the Pool object to handle the assignment of work to the tasks. Obviously, my final code wasn't really as simple as the fragment above. I added a couple of extra features to allow me to exclude already compressed files from my list of targets and to limit the maximum number of files processed in any one run to a reasonable number.

Once I'd done this, I set up a multi-step LoadLeveler job to traverse a subdirectory and process a thousand files before resubmitting itself to handle the next chunk. Once the compression program is no longer able to find files to compress, the job advances to the next directory and resubmits itself. (My first guess at a base case — when no more uncompressed files remained in the directory — turned out to be too naive: having decided not to use the --force option to bzip2 my script was unable to replace the partially compressed files left over by early batch jobs that had been killed off after hitting their wall-clock limit, causing it to go into a resubmission loop. In addition to adding a force option, I changed the job to check whether the contents of the current directory before and after the compression script, causing it to advance to the next directory whenever the two listings were identical).

Fortunately, the space savings seen to have more than justified the investment of my time: I've managed to shave 4TB off my total disk usage in the last 24 hours.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 5th, 2026 04:03 am
Powered by Dreamwidth Studios