A case for MapReduce
Aug. 23rd, 2013 01:28 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Working on big bit of data analysis — distiling down something like 600TB into a summary — it occurred to instead of using multiprocessing.Pool to run in parallel on a single node, I could really benefit from something more scalable.
While I'd like to be able to use one of the python MPI libraries, just because it's what I know, I suspect I might be better off using
Fortunately I don't think I really need to worry about any of this: my deadline isn't until Tuesday and if my analysis requires more than 2500 CPU hours to complete, then I need to completely rethink my approach...
ETA: Unwilling to risk not having my results in time I added an extra level of decomposition, re-ran my analysis over a much larger number of CPUs, and got my answers back in around half an hour. FTW!
While I'd like to be able to use one of the python MPI libraries, just because it's what I know, I suspect I might be better off using
multiprocessing.mangers
to distribute the work and use the environment LoadLeveler sets up for POE to trigger the right numbers of tasks on each of the nodes in the job. Which makes me wonder if I mightn't be better off investigating hadoop, with a view to seeing whether (a) whether it might not be of use to other people and (b) whether we can come up with a way to get it to play nicely with our regular batch system.Fortunately I don't think I really need to worry about any of this: my deadline isn't until Tuesday and if my analysis requires more than 2500 CPU hours to complete, then I need to completely rethink my approach...
ETA: Unwilling to risk not having my results in time I added an extra level of decomposition, re-ran my analysis over a much larger number of CPUs, and got my answers back in around half an hour. FTW!