Currently I’m working with a high availability environment where unscheduled downtime is not accepted and can’t be tolerated, this requires having a strong monitoring infrastructure that keeps track of both the system’s availability as well as the system’s performance.
Basically application monitoring can be divided into two criteria :
- Availability Monitoring: This covers the availability of the system and whether the system is accessible and providing the functionality its supposed to or not which exactly what a good monitoring system should do, however this approach only covers what can be called reactive monitoring, as you only detect the problem once it occurs, furthermore having 2 or at most 3 states (ok-warning-critical) this system is not sufficient to provide trends that can be used for capacity management or preemptive actions to avert possible incidents.
- Performance Monitoring: This covers the performance of the system and is mainly used to reads the values of a preset KPIs for each application or system, this approach is more granular and can detect trends or possible future incidents, in general this approach is harder to implement than availability monitoring however as a result you get what can be called proactive monitoring as you stop incidents before they are sensed by the system’s users.
In the following two entries I’m going to cover both monitoring approaches using open source monitoring tools (Nagios-Cacti).