sawyl: (Default)
[personal profile] sawyl
Today was marred by a total nightmare of an NQS problem.

While attempting to remap the queues to a new set of execution nodes, I uncovered a heretofore unseen problem whereby NQS continued to send jobs to the unbound nodes, only for the scheduler to immediately lose visibility of them. Then inevitably, as soon as one job disappeared over the event horizon, the scheduler decided that the node was running idle because it couldn't see any jobs and dumped yet another job into the black hole. Meanwhile the jobs on the node weren't being run because although the scheduler had posted them to a node, it hadn't told them to start running because it couldn't see them...

The upshot of this cat's cradle of a crawling horror? Half the workload trashed and change that should have taken 30 minutes that ended up taking 5 hours. I swear, I'm never, ever, going to have anything to do with the batch system ever again.
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 5th, 2026 09:10 am
Powered by Dreamwidth Studios