Jun. 6th, 2017

sawyl: (Default)
Just as I was thinking about going home, my phone rang. Foolishly, I answered. As a result, I spent a good hour and a half trying to sort out a series of errors caused by an out-of-memory event on one of the cray_login nodes.

Fortunately, the circumstances of the problem were relatively clear and I was able to make a note of the affected jobs prior to purging them from the system. This meant that I was able to target the stale ALPS reservations left dangling by the job failures and immediately remove them with apmgr cancel. This wasn't entirely successful: around 40 ALPS reservations went into a pendCancel state and refused to clean up despite restarts of various daemons and an apmgr resync

Eventually, I concluded that discretion was the better part of valour and simple placed all the affected nodes into admindown to prevent them from being used. When I resumed the work, things started to run without encountering the any of the dreaded transient MPP errors we normally see when ALPS and PBS drift out of alignment, and I was finally able to go home, a couple of hours later than planned.


sawyl: (Default)

September 2017

      1 2
3 456 78 9

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 26th, 2017 07:47 pm
Powered by Dreamwidth Studios