I’ve spent a lot of time with a lot of different monitoring systems over the years, both commercial and open source, and anyone who has worked with me knows all too well how much I hate every single one of them. So a couple messages I saw in ##infra-talk last Thursday night really piqued my interest:
<lusis> ##monitoringsucks <lusis> whack, yeah I thought I'd throw together a irc chat about some next-gen monitoring shit
Soon after joining, I was made aware of something Dan Deleo (of Chef/OpsCode fame) has been working on called critical. As soon as I saw “Monitoring As Code,” I immediately thought “fuck yes, it’s about time!” The thing is, I can’t even count how many times I’ve sat down and tried to implement this thing and quickly gave up, because I didn’t quite know what I was trying to implement. But now that I know other smart DevOps folks are thinking some of the same thoughts I am, I figured now would be a good time to at least write down some of my ideas.
Code Instead of Configuration Files
I’ve been saying for a long time now that I basically want a Chef equivalent for monitoring (even if I didn’t know exactly what that meant). To me, the real genius of Chef is that everything is configured via pure Ruby code. Not only is this extremely powerful, it’s also extremely flexible, because you’re not really constrained by anyone’s preconceived ideas about how to do things. Chef never assumes that it knows how to configure my environment better than I do, and creates a situation where I’m only limited by my own programming abilities. The importance of this cannot be understated in the monitoring world, where there’s a long-running joke about how every monitoring system sucks (and always will), simply because everyone has their own ideas about what the perfect monitoring system actually does. The solution to this problem is to stop thinking in terms of monolithic monitoring applications and start thinking in terms of modular monitoring frameworks that allow the user to design the perfect monitoring application for their particular environment. I think Chef is a great example of this idea in practice as it relates to configuration management. So how can we apply these ideas to infrastructure monitoring?
I’ve been convinced for a long time now that a client-based architecture is the only way to go. My first reason for this is because monitoring through firewalls is a huge pain in the ass. Having your checks originate from a central server means you need to configure all kinds of VPN and/or ACL bullshit before your server can even see the hosts you want to monitor. But you’re not done there. You still almost definitely need to do things to the hosts themselves in order to expose the appropriate metrics to your monitoring server(s) (you don’t want every host on your LAN to have access to all this stuff, right?).
The second reason a client-based architecture is important is for scalability and performance. Everyone knows what happens when you have too many active checks in Nagios. You eventually end up in a situation where Nagios can’t complete all of its checks within a reasonable interval, and the inevitable slow host(s) will make the problem even worse.
So in a nutshell, my ideal architecture consists of:
- Intelligent clients that are responsible for data collection, automatic problem remediation, etc.
- Dumb servers that simply aggregate the data collected by clients, and possibly serve as a central location for client configuration.
A Simple, Chef-like DSL
We need a concept similar to Chef resources, which would be analogous to Nagios plugins. Once you have a resource defined, you simply plug in the appropriate parameters for your check. So for example, a simple HTTP check resource might look something like this:
check_http "My Website" do url "http://example.com/path/to/something" expect_http_codes  on_failure :restart, resources(:service => "apache") end
A generic command testing resource would also be great. This is a very simple feature that has been sorely missing from Monit for several years now:
check_command "Check Something" do command "/usr/local/bin/check_something.sh" expect_return_codes  on_failure :run, resources(:execute => "fix_something.sh") end
Automatic Problem Remediation
Of course, no monitoring system would be complete without a way to do automatic problem remediation. I imagine something like Chef’s resource notifications would work pretty well here (see the DSL examples above).
Trending and Alerting
It has never made any sense to me why our trending and alerting systems should duplicate the work of retrieving host metrics. Why not store them all in one place, and have your trending and alerting systems work off the same data?
When it comes to actually collecting the data, I think collectd is on the right track here, because it’s such a brain-dead simple idea that provides a ton of power and flexibility. By itself, the daemon essentially does nothing except read a configuration file. The rest is handled by input/output plugins that you write yourself. So for example, you might write a simple input plugin that polls your load average every 10 seconds. Then you might have output plugins that write these metrics to local CSV/RRD files, remote databases, and/or POST them to a central server with a RESTful interface. The point is that you’re not only responsible for what metrics you collect, but also what you do with the data.
Ohai is probably the closest analog for this in Chef’s world. For monitoring, I can imagine something similar that would poll system metrics on regular intervals, buffer them to local files (so that we can query them efficiently from within the monitoring “recipes”), and occasionally push them up to a server, where we can visualize everything within a Ganglia-like console.
Static configuration files are obsolete. I want do everything in code.