Mopping up after PBS and ALPS
Jun. 6th, 2017 08:25 pmJust as I was thinking about going home, my phone rang. Foolishly, I answered. As a result, I spent a good hour and a half trying to sort out a series of errors caused by an out-of-memory event on one of the
Fortunately, the circumstances of the problem were relatively clear and I was able to make a note of the affected jobs prior to purging them from the system. This meant that I was able to target the stale ALPS reservations left dangling by the job failures and immediately remove them with
Eventually, I concluded that discretion was the better part of valour and simple placed all the affected nodes into
cray_login
nodes.Fortunately, the circumstances of the problem were relatively clear and I was able to make a note of the affected jobs prior to purging them from the system. This meant that I was able to target the stale ALPS reservations left dangling by the job failures and immediately remove them with
apmgr cancel
. This wasn't entirely successful: around 40 ALPS reservations went into a pendCancel
state and refused to clean up despite restarts of various daemons and an apmgr resync
Eventually, I concluded that discretion was the better part of valour and simple placed all the affected nodes into
admindown
to prevent them from being used. When I resumed the work, things started to run without encountering the any of the dreaded transient MPP errors we normally see when ALPS and PBS drift out of alignment, and I was finally able to go home, a couple of hours later than planned.