sawyl: (A self portrait)
Yet another bad day following yet another LoadLeveler meltdown. This time I think I've traced the problem back to the NEGOTIATOR_RESCAN_QUEUE setting, if only because the frequency of short dispatch cycles versus long ones seems to match the setting and seems to change when the value is cranked up. Looking at the rest of the tunings, I think we could probably benefit from making some of them less aggressive. In particularly I think ought to increase the startd polling period and set UPDATE_ON_POLL_INTERVAL_ONLY to true, given the vast numbers of jobs we run on a daily basis.

The nearest thing I got to a relaxing moment all afternoon occurred in the dentist's chair. There were no sinister surprises — not that I was really expecting any — and I got a perfect score for everything, proving that the recent efforts I've made to keep up with my flossing have paid off. Afterwards, I went back to work, spent the remains of the day writing things up for the holidays before heading into town to get some shopping in before going to the quay for night climbing.

Spots of light in the gloom... )

The event was good fun and although it wasn't pitch dark — despite the evidence of my appalling photo — it was gloomy enough and the head torches were bright enough that it was surprisingly tricky to see anything outside your spotlight. We took things gently, much to the relief of my bruised up hand, and didn't do as many routes as usual because R was slightly late escaping from work. Her arrival was, as ever, well worth waiting for thanks to a pair of illuminated fairy wings which, to my regret, I failed to photograph for posterity — although I hope I'll be able to mooch a pic from somewhere at some point...
sawyl: (A self portrait)
A busy few days trying to trace the source of a substantial degradation in the performance of the batch subsystem. Examining the dispatch cycles — normally 10-20 seconds but now inflated to 130-180 seconds — I immediately suspected a bad job but was unable to trace the cause until I enabled full and performance debugging. With this switched on I noticed that each dispatch cycle contained over 330,000 lines referring to a specific job which, when examined, proved to have requested a set of resources that could never be satisfied.

Knowing the likely cause of the problem, I slapped an admin hold on the job and the cycle times promptly dropped back to something like normal. My sense of timing was, as ever, impeccable: I fixed the problem in time for a couple of the others to push off for their Christmas lunch; while I hung around for as long as I could to make sure that everything had settled before pushing off to keep my appointment in town. Excelsior!
sawyl: (Default)
Looking into the generally sluggish turnaround of the batch system, I decided that the problem was probably down to long cycle times in the negotiator daemon's dispatch code and that it was being exacerbated by full debug logging. To prove my hypothesis, I decided to write something to plot out the cycle times before and after changing the debug flags. However while the script proved easy enough to write, the sizes of the log files meant that the performance simply wasn't good enough. So I said down with the python profiler and the timeit to see if I couldn't hurry things along their way.

The profiler showed that the bulk of the time was being spent in the log file parsing function. No great surprise there — the function was crunching through hundreds of gigs of data to pick out the cycle times. What was surprising was the relatively high cost of the endswith() string method. And because it was being called for almost ever line of input, it was dominating the performance of the program. So, optimisation number one: reorder the string comparisons to try to break out of the matching early and replace the endswith() with a less expensive indateutil.parser had become more prominent in the profiler listing. I tried a couple of easy tricks, removing some unnecessary rounding operations and trimming back the length of the date string to be parsed, but extracting the date of each log entry was still far too expensive. So, optimation number two: replace the generalised dateutil.parser routine with a custom bit of code and incorporate caching to avoid re-parsing the line when the date change is less than the number of significant figures.

As a result of the changes time required to run my testcase dropped to around an eighth of its original value, bringing the time taken to analysis a complete set of logs down to a mere 700 seconds. Then, when I finally plotted out the distiled data, I noticed that as soon as I disabled full debugging the cycle time droped from around 20-30 seconds to 3-4 seconds and stayed stable. Excelsior!
sawyl: (Default)
And so, after finishing the Work That Must Be Finished and doing a global stop and start LoadLeveler has settled itself back down and started scheduling again as normal. I'm more than slightly relieved — not only has my vision of days of instability followed by a cold start vanished, but my original hypothesis about the cause of the problem seems to have been justified.
sawyl: (Default)
Contact with my muse successfully resestablish and much accomplished in consequence. I was particularly pleased to get the basics of multi-cluster LoadLeveler sorted, but the filtering threw me slightly: the filter seems to be get run both locally and remotely, but only the remote filter is able to change the contents of the job.

I'm also not entirely sure how to design a cluster metric that correctly reflects the load of a particular system. Is it enough simply to look at the the number of idle steps? Or should this be scaled to either the size of the cluster or, possibly, the number of currently unreserved machines in the cluster? And should jobs pending reservations be included in the idle count? Surely not if they're waiting for a reservation that starts next week. But what if they're waiting for a reservation that starts in ten minutes? Maybe the decision to include or exclude reserved steps should depend on whether the reservation will go active in a time window that matches the expected wall clock of the job?
sawyl: (Default)
Despite being told that the latest LoadLeveler documentation had been released, I was still thrown by the title of the installation guide: Tivoli Workload Scheduler LoadLeveler for AIX V5.1 Installation Guide, as in the AIX version of LoadLeveler 5.1 and not, as I original interpreted it, the AIX 5.1 specific version of LoadLeveler...
sawyl: (Default)
When you're way down on the callout roster and the phone still rings, you know it's going to be bad. And so it proved. After being sold a sorry tale about power problems and the struggle to get the systems started again, my interlocutor broke the really bad news: LoadLeveler wouldn't run on either system making it impossible to run any production work and the others, having offered up their advice, were out of ideas and it was up to me to defend the honour of the team.

A little bit of digging soon revealed a particularly nasty scheduler problem: when the main daemons were restarted, the negotiator came up and started responding to queries, while the scheduler when into a crash/restart loop until the restart threshold was exceeded and the rest of the daemons were shutdown by the master process. Given the underlying cause of the problem, I suspected that the on-disc data structures had probably been corrupted by the power down. And when I checked the spool directories on the primary and alternate managers, I noticed that the time stamps on the database files matched the time when the system was brought down. Not conclusive, but I decided to take a gamble and force a cold-start by moving the current spool directories out of the way, creating a new, empty spools with the right permissions, and restarting the daemons on the managers. And it worked: the schedulers came back up, albeit with empty queues, and started accepting and scheduling new jobs. Crisis averted.

So today's lessons learnt. Trashing the LoadLeveler scheduler spools isn't fatal, provided you don't mind losing all your jobs; when it's a choice between something, however minimal, and nothing, always choose the something; and when you're running the same level of software everywhere, you can expect to see the same bugs everywhere. Today's corollary. When asked how you solved the problem, always always reply with a modified version of Gell-Mann's description of Feynman's problem solving method: "I wrote down the problem. I examined the logs. I thought very hard. I came up with the answer..."
sawyl: (Default)
I've recently spent some time churning through LoadLeveler's job history file, trying to tease out some trends from the mass of data. What I really want is a way to quantify the relationship between project priority and step queue time, in the hope that this will show that higher priority projects spend less time queuing. Unfortunately the picture is greatly complicated by the fact that step sizes and durations differ between projects, and that the backfill scheduling algorithm is more likely to select smaller, shorter jobs regardless of their priority.

If only I had a better grasp of statistical methods...
sawyl: (Default)
Having made a few changes to LoadLeveler to enable preemption for reservations, I ran a few naive tests. These apparently confirmed my expectations: running jobs were bumped and put back into the queue in order to allow a new reservation to be forced in. But when others started to use preemption in earnest, I was surprised to discover that their llmkres commands had failed with insufficient resource errors.

Doing some digging, I finally isolated the cause: I could create preempting reservations because I had LoadLeveler admin rights while all normal users simple got an error indicating that the reservation could not be created. Sure enough, when I checked the documentation, I found the answer staring me in the face:

# RESERVATION_PRIORITY to define whether LoadLeveler administrators may reserve nodes on which running jobs are expected to end after the start time for the reservation.

...

This keyword value applies only for LoadLeveler administrators; other reservation owners do not have this capability.

Most unfortunate.

sawyl: (Default)
I've had a surprisingly productive day testing out some of IBM's suggestions for how we might improve our LoadLeveler configuration. Of these, the most promising seems to be the recommendation that we enable a limited form of preemption by:

  1. Setting PREEMPTION_SUPPORT = full;
  2. Setting DEFAULT_PREEMPT_METHOD = vc;
  3. Setting RESERVATION_PRIORITY = high;

This gives the reservations a higher priority than the running work and, if there are insufficient resources for the reservation when it goes into the SETUP state, the jobs running on the nodes assigned to reservation will be cleared out using the default preempt method. In this case I opted for vacate, which kills the job and requeues it, rather than hold or suspend, because we don't have enough paging space to allow two full memory sized jobs to coexist on a node.

Although I think that preemption may well fix most of our problems, I don't think it is a magic bullet. For one thing it doesn't, by itself, grant us any control over which jobs are preempted — something our current script does for us. But it might be possible to deal with this using another suggestions: that we split the system into two separate pools and use dummy jobs to limit the reservations to a single pool, while allowing the rest of the work to run in either pool.

I think we could probably better this by allowing all lower priority work to run in either pool, while restricting high priority work to the non-reservation pool. Something we could probably do by checking the priority of the job in the filter as it is submitted and by adding a # @ pool = parameter as it passes through.

sawyl: (Default)
I've encountered another interesting LoadLeveler oddity, again involving reservations, but this time due, I suspect, to a timing bug rather than an design effect.

Essentially, what appears to have happened is that a reservation was created with a start time of T+2 minutes and the request was granted by the first scheduler. Less than a minute later, the second scheduler allocated a queued job to the nodes that had just been reserved by the first scheduler. Then, when the reservation became active, the job associated with it was unable to run because it lacked the necessary resources.

My suspicion is that the problem is due to some sort of race between the two scheduler instances: that the two daemons are not communicating frequently enough to ensure that they both have upto date copies of the current reservation list. I'm not entirely surprised there are problems. I'm deliberately creating a reservation as near to the current time as possible — not something anyone in the right mind would normally want to do — and I'm relying on LoadL to coordinate across multiple machines, which is asking for trouble.

Time to dig through the manual: perhaps there's some parameter I can tweak to increase the scheduler synchronisation frequency...
sawyl: (Default)
I've discovered a nasty little problem with LoadLeveler reservations. A problem which means that somes, just sometimes, the reservation actually prevents the job from running, but which is kind of hard to describe.

Imagine a pair of reservations. Both reservations have at least one node in common between them, but because the first reservation is due to finish before the second is due to start, this is not a problem.

Now, consider what happens when RESERVATION_CAN_BE_EXCEEDED is set to true in the LoadL_config file and a job is submitted into the first reservation. If the job has a wallclock time that is longer that the reservation, but which means that it will finish because the second reservation becomes active, there is no problem. But if the wallclock time of the job means that it will end after the second reservation is due to start — i.e. if the current time plus the wallclock time of the job is larger than the start time of the second reservation — the two resource requests will clash and consequently, the job will not run.

Thus, under certain circumstances, running a job in a reservation may actually hamper its execution, rather than assist it. And, worst still, because the placement of reservations is dependent on the state of the system, the problem does not necessarily occur repeatably...
sawyl: (Default)
After months of talking about it, I've finally implemented a set of LoadLeveler prolog and epilog programs. In the process I've discovered:

  • any output generated by the prolog gets stamped on by the job itself
  • the epilog has to be run as the job user in order to update the output files
  • information about prolog or epilog problems is written to the StarterLog of the node where the master task executed
  • an easy solution to the epilog output problem is to write the program in perl or C an dup stdout and stderr to $LOADL_STEP_OUT and LOALD_STEP_ERR because this also redirects any child processes
  • not all LOADL variables are available on all nodes of a multi-node job (the important ones are usually only set on the master)
  • the programs are run by the job starter daemon and are unaffected by environment changes in the job script
  • that the precise details of writing prologs and epilogs doesn't seem to be terribly well documented

And, most importantly, that it doesn't matter if the epilog programs don't work but if the prolog doesn't work, the system will start to trash the jobs until the problems are fixed...

sawyl: (Default)
I've spent the last week or two working on a script to create instant LoadLeveler reservations, even in situations where resource shortages would normally prevent them from being created. In the process, I've uncovered a couple of LoadLeveler gotchas.

I've found that a job that starts and finds that it cannot write to its stdout or stderr files will go into user hold. Unless the initial directory is set with "# @ initialdir", then the initial directory is the current working directory of the llsubmit used to submit the job. If the directory is not writable by the job owner and the error and output paths are not explicitly routed to another directory, the job will not start to run.

This unpleasant fact caused us much worry when we discovered that jobs submitted using a script — a script that resided in a directory that was owned by a user other than the one submitting the job — went into user hold when submitted interactively; while near-identical jobs submitted as part of a job chain from a running batch job, whose working directory was set to $HOME ran flawlessly.

Also, if an existing job is cancelled to make room for a new reservation, it takes a few seconds to release its resources. Once the resources are released, the next idle job will jump into the gap unless the scheduling temporarily paused causing the attempt to create a new reservation to fail. If the scheduling is paused by draining the schedulers, no new work will be started and, and here's the real gotcha, now new jobs will be accepted for submission into the batch system, until the schedulers are resumed.

There may be a solution to this dilemma. Perhaps, if all jobs are allocated floating resources at submit time, the resource limit could be reduced slightly prior to the attempt to create the reservation in order to stop any new jobs from starting. Then, once the reservation has grabbed the newly released jobs, the limit could be brought back up to allow normal work to continue to schedule.

Still, these minor problems aside, it looks as though I might be on course to have something flawed but functional working by the end of the week — which precisely matches my initial estimate that it would take two weeks to get sorted out.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Syndicate

RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 20th, 2025 08:39 am
Powered by Dreamwidth Studios