Tricky AIX problems
Jan. 22nd, 2009 07:45 pmFollowing an unexpected loss of power last week, one of the AIX systems has consistently failed to recognise its InfiniBand adapter. After trying the usual tricks to get the OS to reconfigure the card — removing and reinstalling fixes, deleting and re-creating the devices in the ODM — the problem was passed over to the hardware guys who replaced the card and found that it made no difference either way.
Today, the engineering bods finally found the source of the problem: the hardware configuration on the HMC was inconsistant. By creating a new partition with an explicit hardware configuration — the duff partition had simply been defined to use all available resources — and booting AIX, they were able to confirm that the system could talk to the card. Knowing this, they simply deleted and recreated the broken partition definition and, presto-chango, the system was back with all interfaces present and correct.
With the system back up, we then uncovered a whole series of problems with the routing configuration. Essentially, the static routes were correctly defined in the
But, grousing aside, I think I've learnt more about AIX and GPFS troubleshooting in the last week than I picked up on a month's worth of courses, which is surely a good thing...
Today, the engineering bods finally found the source of the problem: the hardware configuration on the HMC was inconsistant. By creating a new partition with an explicit hardware configuration — the duff partition had simply been defined to use all available resources — and booting AIX, they were able to confirm that the system could talk to the card. Knowing this, they simply deleted and recreated the broken partition definition and, presto-chango, the system was back with all interfaces present and correct.
With the system back up, we then uncovered a whole series of problems with the routing configuration. Essentially, the static routes were correctly defined in the
inet0 entries of CuAt but they had not been installed at boot time. We suspect that the problem may be due to a race condition in the network startup scripts — the routes are defined to a multi-link pseudo device, but if the route commands are run before the creation of the ml0 interface has completed, the routes aren't created — and it's interesting that we don't see the same problem on an otherwise identical system configured with a much larger number of IB cards. Although the problem is trivial to fix — simply running cfgmgr -l inet0 at the end of the boot does the trick — it's a nasty little gotcha in an area that shouldn't really require manual intervention to get going.But, grousing aside, I think I've learnt more about AIX and GPFS troubleshooting in the last week than I picked up on a month's worth of courses, which is surely a good thing...