Thursday, June 25, 2015

Deploying monitoring tools with ansible

At my previous job, I used Chef for configuration management. New job, new tools, so I decided to use ansible, which I had played with before. Part of that was that I got sick of tools based on Ruby. Managing all the gems dependencies and migrating from one Ruby version to another was a nightmare that I didn't want to go through again. That's one reason why at my new job we settled on Go as the main language we use for our backend API layer.

Back to ansible. Since it's written in Python, it's already got good marks in my book. Plus it doesn't need a server and it's fairly easy to wrap your head around. I've been very happy with it so far.

For external monitoring, we use Pingdom because it works and it's cheap. We also use New Relic for application performance monitoring, but it's very expensive, so I've been looking at ways to supplement it with Open Source tools.

An announcement about telegraf drew my attention the other day: here was a metrics collection tool written in Go and sending its data to InfluxDB, which is a scalable database also written in Go and designed to receive time-series data. Seemed like a perfect fit. I just needed a way to display the data from InfluxDB. Cool, it looked like Grafana supports InfluxDB! It turns out however that Grafana support for the latest InfluxDB version 0.9.0 is experimental, i.e. doesn't really work. Plus telegraf itself has some rough edges in the way it tags the data it sends to InfluxDB. Long story short, after a whole day of banging my head against the telegraf/InfluxDB/Grafana wall, I decided to abandon this toolset.

Instead, I reached again to trusty old Graphite and its loyal companion statsd. I had problems with Graphite not scaling well before, but for now we're not sending it such a huge amount of metrics, so it will do. I also settled on collectd as the OS metric collector. It's small, easy to configure, and very stable. The final piece of the puzzle was a process monitoring and alerting tool. I chose monit for this purpose. Again: simple, serverless, small footprint, widely used, stable, easy to configure.

This seems like a lot of tools, but it's not really that bad if you have a good solid configuration management system in place -- ansible in my case.

Here are some tips and tricks specific to ansible for dealing with multiple monitoring tools that need to be deployed across various systems.

Use roles extensively

This is of course recommended no matter what configuration management system you use. With ansible, it's easy to use the commmand 'ansible-galaxy init rolename' to create the directory structure for a new role. My approach is to create a new role for each major application or tool that I want to deploy. Here are some of the roles I created:

  • a base role that adds users, deals with ssh keys and sudoers.d files, creates directory structures common to all servers, etc.
  • a tuning role that mostly configures TCP-related parameters in sysctl.conf
  • a postfix role that installs and configures postfix to use Amazon SES
  •  a go role that installs golang from source and configures GOPATH and GOROOT
  • an nginx role that installs nginx and deploys self-signed SSL certs for development/testing purposes
  • a collectd role that installs collectd and deploys (as an ansible template) a collectd.conf configuration file common to all systems, which sends data to graphite (the system name is customized as {{inventory_hostname}} in the write_graphite plugin)
  • a monit role that installs monit and deploys (again as an ansible template) a monitrc file that monitors resource metrics such as CPU, memory, disk etc. common to all systems
  • an api role that does the heavy lifting for the installation and configuration of the packages that are required by our API layer
Use an umbrella 'monitoring' role

At first I was configuring each ansible playbook to use both the monit role and the collectd role. I realized that it's a bit more clear and also easier to maintain if instead playbooks use a more generic monitoring role, which does nothing but list monit and collectd as dependencies in its meta/main.yml file:

  - { role: monit }
  - { role: collectd }

Customize monitoring-related configuration files in other roles

A nice thing about both monit and collectd, and a main reason I chose them, is that they read configuration files from a directory called /etc/monit/conf.d for monit and /etc/collectd/collectd.conf.d for collectd. This makes it easy for each role to add its own configuration files. For example, the api role adds 2 files as custom checks in  /etc/monit/conf.d: check_api and check_nginx. It also adds 2 files as custom metric collectors in /etc/collectd/collectd.conf.d: nginx.conf and memcached.conf. The api role does this via a file called tasks/monitoring.yml file which gets included in tasks/main.yml.

As another example, the nginx role also adds its own check_nginx configuration file to /etc/monit/conf.d via a tasks/monitoring.yml file.

The rule of thumb I arrived at is this: each low-level role such as monit and collectd installs the common configuration files needed by all other roles, whereas each higher-level role such as api installs its own custom checks and metric collectors via a monitoring.yml task file. This way, it's easy to see at a glance what each high-level role does for monitoring: just look in its monitoring.yml task file.

To wrap this post up, here is an example of a playbook I use to build API servers:

$ cat api-servers.yml

- hosts: api-servers
  sudo: yes
    - base
    - tuning
    - postfix
    - monitoring
    - nginx
    - api

I call this playbook with:

$ ansible-playbook -i hosts/myhosts api-servers.yml

To make this work, the hosts/myhosts file has a section similar to this:


1 comment:

Miquel said...

Hi Grig,

so funny, so many years have passed since we worked together in open source stuff, and today I could say exactly the same sentence as you "At my previous job, I used Chef for configuration management. New job, new tools, so I decided to use ansible". And so on ;-)

(Though to be fair they were already using Ansible).

I will have a look at InfluxDB for sure.

Keep on posting!

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...