Jun. 6th, 2017

Just as I was thinking about going home, my phone rang. Foolishly, I answered. As a result, I spent a good hour and a half trying to sort out a series of errors caused by an out-of-memory event on one of the cray_login nodes.

Fortunately, the circumstances of the problem were relatively clear and I was able to make a note of the affected jobs prior to purging them from the system. This meant that I was able to target the stale ALPS reservations left dangling by the job failures and immediately remove them with apmgr cancel. This wasn't entirely successful: around 40 ALPS reservations went into a pendCancel state and refused to clean up despite restarts of various daemons and an apmgr resync

Eventually, I concluded that discretion was the better part of valour and simple placed all the affected nodes into admindown to prevent them from being used. When I resumed the work, things started to run without encountering the any of the dreaded transient MPP errors we normally see when ALPS and PBS drift out of alignment, and I was finally able to go home, a couple of hours later than planned.


