Monitoring Solutions for Our Servers
Proper Server Monitoring
I need proper server monitoring for our servers. I haven't done this in the past because It was always obvious when there was a problem. I now have production systems that should be monitored properly so that I can respond accordingly.
Choosing the Stack
There are a lot of options when it comes to this monitoring. I am not even going to start listing them because there are so many. I think we have a couple of types that we will impletement here to get a well rounded solution with some redundancy.
-
Server monitoring: This will be a monitor for the actual infrastructure, services, disk, networking, and anything else we need. This should include logging for historical info as well as real time.
-
HTTP monitoring: We need to be doing checks on our hosted domain from an external source.
-
Alerting: What good is monitoring if there are no alerts to go with it. I assume this will be integral to the monitoring but we need to make sure it's setup correctly.
System Monitoring
I am going to start with the system monitor. It's critical to monitor system process's and stats. Things like CPU, memory, disk, and networking.
Icinga2
Currently Icinga2 is one of the services used at work. This made it a natural option for monitoring. I was familiar with it to an extent. It would help me learn a service we use at work. I spun up a VM and started getting it setup.
It was a tangled mess of services. It took a while to even get my head around the terminology for the stack. I did eventually get it installed and working to an extent but I ran into some issues and found the documentation to be lacking. I decided that this was just way to much for my needs. I didn't like the frustration and I abandoned the idea of using this service.
I found out about 4 day's latter that we would be moving away from Icinga2 at work and moving to checkmk. This is another Nagios based solution and they even say it has a wide array of options and modules etc. This could be just as bad. Let's find out!
checkmk
Oh my goodness! This looks sooo much easier than Icinga2! This is what I was hoping for when I started with this project. It took me very little time to get the initial install done and have a working configuration. No convoluted jargon to wrap my head around, no crazy configuration files to mess with. Install a couple of deb packages. The documentation seems great so far.
I have finished a couple of videos that help with the very basic navigation and am about to work my way through the beginners guide. Next section looks like it's configuring the actual monitors so this will be the moment of truth.
checkmk
No doubt that this will be a viable solution for me. Easy to install, still nagios based so it feels familiar to what I have worked with in Icinga2. I am moving forward with getting this going to monitor the farm.
Setup
After getting the server setup which was a single install instead of 3 or 4 binaries like Incing2 it was a matter of getting the agent installed on the hosts. This process was very easy to. For me it's just a bunch of Debian and Ubuntu machines that need a .deb install. The install handles creating a user and all the setup that is needed in order to monitor the host.
I was a bit confused for a while on what ports needed to be opened and why. It wasn't a big deal at first because the focus was getting this setup inside our LAN which is very insecure. I don't run firewall's on these machines. Security is maintained with the assumption that if it has made it onto the LAN it's trusted. However, our production KVM is public facing. This machine is considerably more hardened and as such required an understanding of what was going on.
Once the agent is installed on the host you want to monitor then that host need's to get registered on the server. I didn't quite understand that two ports are needed on the server and a single port on the monitored host. My understanding of what needs to be open and why is:
Server: 80,443,8000
-
80,443 Web interface & monitor node registration. The web interface portion is obvious and self explanatory. The node register isn't actually strictly necessary. It's only required if you dont specify port 8000 in the node registration process. It's only used for discovery by the monitored node, and this is one of the things I didn't realize initially. This is important for that remote production node. I don't want or need to expose 80/443 at my home. I use a VPN to access the farm when I am remote. I thought I was going to have to work around this, turns out I didn't need to. I can leave HTTP ports unexposed at home and access them via the VPN. During node registration I just specify port 8000 which is easy enough.
-
8000 is the port needed in order to register the node. This port is used during the registration process. Its the port that can be specified at setup to bypass using discovery of the needed port on 80/443. I needed to expose this port to the monitored host. Since this is a production machine with a static IP, this was pretty easy. I created a destination NAT rule in my shiny new MikroTik router that forwards all traffic from the IP of my prod server to the checkmk server. 8000 is somewhat common so we will see if this becomes an issue, but fine for now. The reason you might want to use 80/443 for discovery is because if you have multiple checkmk processes running on the server that port will increment for each process. So each checkmk "site" get's its own port (8000, 8001) etc. So if you have multiple sites, you need to either specify the correct port with the site, or allow access from the monitored node to 80 or 443.
It took me a little while to get my head around all this. It sounds simple enough when I write it out now but when you are unfamiliar with a service these are the kinds of details and ideas that can cause hours of frustration.
Once I got a rhythm for what needed to happen it was really easy to get my 18 hosts setup. For all the internal stuff I was able to copy and paste a few lines. You need the monitoring server info, the password for your login, and that is about it. I used the following:
sudo cmk-agent-ctl register --server checkmk.example.com:8000 --site monitoring --user admin --hostname $HOSTNAME.example.com
Assuming the agent can connect to the server, a confirmation of the certificate is presented, a password is requested and that is it. That's all that is needed to get an agent registered on the server.
Monitoring
Once nodes are registered the process of monitoring is very easy. The documentation for checkmk is great and I had no problems testing the connections and adding monitoring for each host.
Profit!
I got all of our nodes setup for monitoring within a few hours start to finish. I imidiately saw the benefits when I was alerted to a disk on our Syncthing server that was 97% full. I was able to grow the ZFS volume via Proxmox in a couple minutes and resolve the issue. I had no idea this volume had grown and needed to be resized and I wouldn't have known until the Syncthing process started having problems and then it would have spent time troubleshooting it. Having a centralized monitor that will alert me to these types of issues will allow me to spend less time monitoring and more time tinkering!