NQS nightmare
Sep. 6th, 2005 07:32 pmToday was marred by a total nightmare of an NQS problem.
While attempting to remap the queues to a new set of execution nodes, I uncovered a heretofore unseen problem whereby NQS continued to send jobs to the unbound nodes, only for the scheduler to immediately lose visibility of them. Then inevitably, as soon as one job disappeared over the event horizon, the scheduler decided that the node was running idle because it couldn't see any jobs and dumped yet another job into the black hole. Meanwhile the jobs on the node weren't being run because although the scheduler had posted them to a node, it hadn't told them to start running because it couldn't see them...
The upshot of this cat's cradle of a crawling horror? Half the workload trashed and change that should have taken 30 minutes that ended up taking 5 hours. I swear, I'm never, ever, going to have anything to do with the batch system ever again.
While attempting to remap the queues to a new set of execution nodes, I uncovered a heretofore unseen problem whereby NQS continued to send jobs to the unbound nodes, only for the scheduler to immediately lose visibility of them. Then inevitably, as soon as one job disappeared over the event horizon, the scheduler decided that the node was running idle because it couldn't see any jobs and dumped yet another job into the black hole. Meanwhile the jobs on the node weren't being run because although the scheduler had posted them to a node, it hadn't told them to start running because it couldn't see them...
The upshot of this cat's cradle of a crawling horror? Half the workload trashed and change that should have taken 30 minutes that ended up taking 5 hours. I swear, I'm never, ever, going to have anything to do with the batch system ever again.