Debugging network problems
Jun. 28th, 2010 09:17 pmI've been looking into an annoying network problem which caused connections to a particular system to fail apparently at random. After initial investigations with
So I went off in a different direction, combing through
Sure enough, when I checked the routing table, I discovered that the system had added a second default route to an old and invalid gateway, probably during a recent reboot. When I removed this, the system abruptly returned to normal and the network suddenly returned to its old reliable self. I'm not quite sure where this route came from. I suspect that when it was removed after the gateway became invalid, it was deleted using
The lessons here? Always check the before checking anything else. Always, always use
tcpdump showed that large amounts of traffic were being lost, I suspected a problem with the EtherChannel configuration. But I later discovered that the device had been configured in failover mode, the backup adaptor had never been connected, and dropping the backup out of the configuration made no difference.So I went off in a different direction, combing through
netstat output in search of errors. Although I found a suspiciously high number of duplicate packets and large number of drops due to a full listeners queue in the statistics output (netstat -s), I failed to find anything conclusive. In desperation, I tried listing the number of packets dropped at each layer of subsystem using netstat -D (which is, I think, AIX specific). This showed that almost all the outbound packets were being dropped at the interface level, which made me suspect an IP configuration problem.Sure enough, when I checked the routing table, I discovered that the system had added a second default route to an old and invalid gateway, probably during a recent reboot. When I removed this, the system abruptly returned to normal and the network suddenly returned to its old reliable self. I'm not quite sure where this route came from. I suspect that when it was removed after the gateway became invalid, it was deleted using
route delete rather than chdev -l inet0. This disabled the route on the running system but left it in the ODM, where it was reloaded when the system was rebooted. The lessons here? Always check the before checking anything else. Always, always use
chdev to change the routing on AIX, and if you're not sure of the options to use, use smitty rmroute.