Jun. 6th, 2017

sawyl: (Default)
Just as I was thinking about going home, my phone rang. Foolishly, I answered. As a result, I spent a good hour and a half trying to sort out a series of errors caused by an out-of-memory event on one of the cray_login nodes.

Fortunately, the circumstances of the problem were relatively clear and I was able to make a note of the affected jobs prior to purging them from the system. This meant that I was able to target the stale ALPS reservations left dangling by the job failures and immediately remove them with apmgr cancel. This wasn't entirely successful: around 40 ALPS reservations went into a pendCancel state and refused to clean up despite restarts of various daemons and an apmgr resync

Eventually, I concluded that discretion was the better part of valour and simple placed all the affected nodes into admindown to prevent them from being used. When I resumed the work, things started to run without encountering the any of the dreaded transient MPP errors we normally see when ALPS and PBS drift out of alignment, and I was finally able to go home, a couple of hours later than planned.

Profile

sawyl: (Default)
sawyl

August 2018

S M T W T F S
   123 4
5 6 7 8910 11
12131415161718
192021222324 25
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 6th, 2025 07:35 am
Powered by Dreamwidth Studios