Handling service problems
May. 18th, 2007 04:47 pmA thought has occurred. I reckon that with a couple of nips and tucks to the configuration, we should be able to improve the reliability of the event handlers triggered by nagios to the point were we can be sure that if a service event occurs, it occurs because of a genuine problem on the host and not because of a timeout or because the entire host is down.
Thus, I suspect that if we reconfigure some of the service checks to add dependencies on the ping checks, then we should be able to prevent checks and notifications from triggering if the host is down. If we then add an escalation handler that only triggers on the second alert and add a handler to submit a remedy ticket, then presto chango, we should have a way to automatically report bona fide service problems.
Thus, I suspect that if we reconfigure some of the service checks to add dependencies on the ping checks, then we should be able to prevent checks and notifications from triggering if the host is down. If we then add an escalation handler that only triggers on the second alert and add a handler to submit a remedy ticket, then presto chango, we should have a way to automatically report bona fide service problems.