
I've spent the last week or two working on a script to create instant LoadLeveler reservations, even in situations where resource shortages would normally prevent them from being created. In the process, I've uncovered a couple of LoadLeveler gotchas.
I've found that a job that starts and finds that it cannot write to its stdout or stderr files will go into user hold. Unless the initial directory is set with "# @ initialdir
", then the initial directory is the current working directory of the llsubmit
used to submit the job. If the directory is not writable by the job owner and the error and output paths are not explicitly routed to another directory, the job will not start to run.
This unpleasant fact caused us much worry when we discovered that jobs submitted using a script — a script that resided in a directory that was owned by a user other than the one submitting the job — went into user hold when submitted interactively; while near-identical jobs submitted as part of a job chain from a running batch job, whose working directory was set to $HOME
ran flawlessly.
Also, if an existing job is cancelled to make room for a new reservation, it takes a few seconds to release its resources. Once the resources are released, the next idle job will jump into the gap unless the scheduling temporarily paused causing the attempt to create a new reservation to fail. If the scheduling is paused by draining the schedulers, no new work will be started and, and here's the real gotcha, now new jobs will be accepted for submission into the batch system, until the schedulers are resumed.
There may be a solution to this dilemma. Perhaps, if all jobs are allocated floating resources at submit time, the resource limit could be reduced slightly prior to the attempt to create the reservation in order to stop any new jobs from starting. Then, once the reservation has grabbed the newly released jobs, the limit could be brought back up to allow normal work to continue to schedule.
Still, these minor problems aside, it looks as though I might be on course to have something flawed but functional working by the end of the week — which precisely matches my initial estimate that it would take two weeks to get sorted out.