sawyl: (A self portrait)
[personal profile] sawyl
Working on big bit of data analysis — distiling down something like 600TB into a summary — it occurred to instead of using multiprocessing.Pool to run in parallel on a single node, I could really benefit from something more scalable.

While I'd like to be able to use one of the python MPI libraries, just because it's what I know, I suspect I might be better off using multiprocessing.mangers to distribute the work and use the environment LoadLeveler sets up for POE to trigger the right numbers of tasks on each of the nodes in the job. Which makes me wonder if I mightn't be better off investigating hadoop, with a view to seeing whether (a) whether it might not be of use to other people and (b) whether we can come up with a way to get it to play nicely with our regular batch system.

Fortunately I don't think I really need to worry about any of this: my deadline isn't until Tuesday and if my analysis requires more than 2500 CPU hours to complete, then I need to completely rethink my approach...

ETA: Unwilling to risk not having my results in time I added an extra level of decomposition, re-ran my analysis over a much larger number of CPUs, and got my answers back in around half an hour. FTW!

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 12th, 2025 06:23 am
Powered by Dreamwidth Studios