Monitoring Hosts on Nagios Without NRPE

I was placed in the situation where I had to monitor a set of highly critical hosts that have a minimal RHEE installation, lacking wget, yum or even gcc* packages, installing NRPE on these machines wasn’t possible, further more being as critical as they were I wasn’t really allowed to roll up my sleeves and install all of the prerequisites working my way to NRPE. In this post I’ll talk about how I managed to monitor these machines with minimal modification to their set up.

First of all I created the user Nagios on all of the hosts, then while logged on the Nagios machine I exported my ssh key to all of them, making sure that I can log into each and every one of them without having to type in the password.d

ssh-copy-id -i ~/.ssh/id_rsa.pub nagios@Host1

I was supposed to monitor the disk space on /, /boot and the load average on each of these machines, so I built a simple script to work with nagios’s “check_by_ssh” plugin, mainly the script queried the values, compared it against certain threshold and exited with theĀ appropriateĀ code (0 : ok, 1 : warning, 2 :critical).

disk.sh
#!/bin/bash
##checks the used disk space for nagios
##usage disk.sh mountpoint critical_used%value warning_used%value
size=`df -Ph $1 | tail -1 | awk '{print $5}'`
size=$(echo ${size%\%})

if [ $size -gt $2 ]
then
echo "Critical $1 size exceeded $2 % current size $size "
exit 2;
fi

if [ $size -gt $3 ]
then
echo "Warning $1 size exceeded $3 % current size $size"
exit 1;
fi

echo "OK $1 curent size $size %"
exit 0;

and

load.sh
#!/bin/bash
##checks last 15 minutes load average, if more than 3 critical, more than 2 warning
##usage load.sh
loadavg=`uptime | awk '{print $11}'`
# bash doesn't understand floating point
# so convert the number to an interger
thisloadavg=`echo $loadavg|awk -F \. '{print $1}'`
if [ "$thisloadavg" -ge "3" ]; then
 echo "Critical - Load Average $loadavg ($thisloadavg) "
 exit 2
else
if [ "$thisloadavg" -ge "2" ]; then
 echo "Warning - Load Average $loadavg ($thisloadavg) "
 exit 1
else
 echo "Okay - Load Average $loadavg ($thisloadavg) "
 exit 0
fi
fi

I then deployed these scripts using the scp command to all of the targeted machines, making sure that the files are executable and reachable, I chose to place them in /usr/share/nagios_scripts to make the life of the other administrators easier.

On my nagios machine I added a new configuration directory to the nagios.cfg file and placed a new hosts.cfg in it that included all the hosts I made sure to add my personal touch (an icon for each machine to appear on the hosts list as well as the map).

Finally I created a services.cfg file and added the services definition.

define service{
        use                             local-service
        host_name                       host1,host2,host3
        service_description             Load_Avg_(5mins)
        check_command                   check_by_ssh!nagios!'/usr/share/nagios_scripts/load.sh'
        notifications_enabled           0
        }

define service{
        use                             local-service
        host_name                       host1,host2,host3
        service_description             /boot_Disk_Space
        check_command                   check_by_ssh!nagios!'/usr/share/nagios_scripts/disk.sh /boot 90 75'
        notifications_enabled           0
        }
define service{
        use                             local-service
        host_name                       host1,host2,host3
        service_description             /_Disk_Space
        check_command                   check_by_ssh!nagios!'/usr/share/nagios_scripts/disk.sh / 90 75'
        notifications_enabled           0
        }

A quick $ /etc/init.d/nagios reload and everything was working, all in all it took little under 2o minutes.