sawyl: (Default)
[personal profile] sawyl
Just as I was thinking about going home, my phone rang. Foolishly, I answered. As a result, I spent a good hour and a half trying to sort out a series of errors caused by an out-of-memory event on one of the cray_login nodes.

Fortunately, the circumstances of the problem were relatively clear and I was able to make a note of the affected jobs prior to purging them from the system. This meant that I was able to target the stale ALPS reservations left dangling by the job failures and immediately remove them with apmgr cancel. This wasn't entirely successful: around 40 ALPS reservations went into a pendCancel state and refused to clean up despite restarts of various daemons and an apmgr resync

Eventually, I concluded that discretion was the better part of valour and simple placed all the affected nodes into admindown to prevent them from being used. When I resumed the work, things started to run without encountering the any of the dreaded transient MPP errors we normally see when ALPS and PBS drift out of alignment, and I was finally able to go home, a couple of hours later than planned.

Profile

sawyl: (Default)
sawyl

July 2017

S M T W T F S
       1
23 4 5 6 7 8
9 10111213 14 15
161718 19202122
23242526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 27th, 2017 06:45 pm
Powered by Dreamwidth Studios