Jun. 17th, 2009

sawyl: (Default)
Still feeling tired from yesterday, I decided to go for a short swim, concentrating on speed rather than distance.

Having a done a K or so, I'd paused to adjust my goggles when one of the other people in the lane turned to me and said, "You made that look so easy..." My reply? "...and I thought that all my standing around and panting would have given the game away!"
sawyl: (Default)
I've spent the last week or two working on a script to create instant LoadLeveler reservations, even in situations where resource shortages would normally prevent them from being created. In the process, I've uncovered a couple of LoadLeveler gotchas.

I've found that a job that starts and finds that it cannot write to its stdout or stderr files will go into user hold. Unless the initial directory is set with "# @ initialdir", then the initial directory is the current working directory of the llsubmit used to submit the job. If the directory is not writable by the job owner and the error and output paths are not explicitly routed to another directory, the job will not start to run.

This unpleasant fact caused us much worry when we discovered that jobs submitted using a script — a script that resided in a directory that was owned by a user other than the one submitting the job — went into user hold when submitted interactively; while near-identical jobs submitted as part of a job chain from a running batch job, whose working directory was set to $HOME ran flawlessly.

Also, if an existing job is cancelled to make room for a new reservation, it takes a few seconds to release its resources. Once the resources are released, the next idle job will jump into the gap unless the scheduling temporarily paused causing the attempt to create a new reservation to fail. If the scheduling is paused by draining the schedulers, no new work will be started and, and here's the real gotcha, now new jobs will be accepted for submission into the batch system, until the schedulers are resumed.

There may be a solution to this dilemma. Perhaps, if all jobs are allocated floating resources at submit time, the resource limit could be reduced slightly prior to the attempt to create the reservation in order to stop any new jobs from starting. Then, once the reservation has grabbed the newly released jobs, the limit could be brought back up to allow normal work to continue to schedule.

Still, these minor problems aside, it looks as though I might be on course to have something flawed but functional working by the end of the week — which precisely matches my initial estimate that it would take two weeks to get sorted out.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 6th, 2025 04:02 am
Powered by Dreamwidth Studios