Tuesday, September 21, 2010

Quick note on installing and configuring Ganglia

I decided to give Ganglia a try to see if I like its metric visualizations and its plugins better than Munin's. I am still in the very early stages of evaluating it. However, I already banged my head against the wall trying to understand how to configure it properly. Here are some quick notes:

1) You can split your servers into clusters for ease of metric aggregation.

2) Each node in a cluster needs to run gmond. In Ubuntu, you can do 'apt-get install ganglia-monitoring' to install it. The config file is in /etc/ganglia/gmond.conf. More on the config file in a minute.

3) Each node in a cluster can send its metrics to a designated node via UDP.

4) One server in your infrastructure can be configured as both the overall metric collection server, and as the web front-end. This server needs to run gmetad, which in Ubuntu can be installed via 'apt-get install gmetad'. Its config file is /etc/gmetad.conf.

Note that you can have a tree of gmetad nodes, with the root of the tree configured to actually display the metric graphs. I wanted to keep it simple, so I am running both gmetad and the Web interface on the same node.

5) The gmetad server periodically polls one or more nodes in each cluster and retrieves the metrics for that cluster. It displays them via a PHP web interface which can be found in the source distribution.

That's about it in a nutshell in terms of the architecture of Ganglia. The nice thing is that it's scalable. You split nodes in clusters, you designate one or more nodes in a cluster to gather metrics from all the other nodes, and you have one ore more gmetad node(s) collecting the metrics from the designated nodes.

Now for the actual configuration. I have a cluster of DB servers, each running gmond. I also have another server called bak01 that I keep around for backup purposes. I configured each DB server to be part of a cluster called 'db'. I also configured each DB server to send the metrics collected by gmond to bak01 (via UDP on the non-default port of 8650). To do this, I have these entries in /etc/ganglia/gmond.conf on each DB server:


cluster {
  name = "db"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}

udp_send_channel { 
  host = bak01
  port = 8650

On host bak01, I also defined a udp_recv_channel and a tcp_accept_channel:

udp_recv_channel { 
  port = 8650

/* You can specify as many tcp_accept_channels as you like to share 
   an xml description of the state of the cluster */ 
tcp_accept_channel { 
  port = 8649 

The upd_recv_channel is necessary so bak01 can receive the metrics from the gmond nodes. The tcp_accept_channel is necessary so that bak01 can be contacted by the gmetad node.

That's it in terms of configuring gmond.

On the gmetad node, I made one modification to the default /etc/gmetad.conf file by specifying the cluster I want to collect metrics for, and the node where I want to collect the metrics from:

data_source "eosdb" 60 bak01

I then restarted gmetad via '/etc/init.d/gmetad restart'.

Ideally, these instructions would get you to a state where you would be able to see the graphs for all the nodes in the cluster. 

I automated the process of installing and configuring gmond on all the nodes via fabric. Maybe it all happened too fast for the collecting node (bak01), because it wasn't collecting metrics correctly for some of the nodes. I noticed that if I did 'telnet localhost 8649' on bak01, some of the nodes had no metrics associated with them. My solution was to stop and start gmond on those nodes, and that kicked things off. Strange though...

In any case, my next step is to install all kinds of Ganglia plugins, especially related to MySQL, but also for more in-depth disk I/O metrics.

6 comments:

Mark Seger said...

hey greg - I'm guessing you're the one you posted a tweet about collectl the other day (can't hide from google ;)). Anyhow are you aware that you can use collectl to feed data to ganglia? Collect the details on each system and pass a summary to ganglia.

Ganglia is pretty with 3 or 4 exceptions:
- if you lose the network, you lose the data when you may most need it
- the data cannot be gathered frequently enough to deal with subtle problems (I take 5 to 10 second samples)
- if you try plotting the data, rrd lies to you and normalizes the data under the covers. I want to see it all every time!

-mark

Mark Seger said...

hey greg - I'm guessing you're the one you posted a tweet about collectl the other day (can't hide from google ;)). Anyhow are you aware that you can use collectl to feed data to ganglia? Collect the details on each system and pass a summary to ganglia.

Ganglia is pretty with 3 or 4 exceptions:
- if you lose the network, you lose the data when you may most need it
- the data cannot be gathered frequently enough to deal with subtle problems (I take 5 to 10 second samples)
- if you try plotting the data, rrd lies to you and normalizes the data under the covers. I want to see it all every time!

-mark

Grig Gheorghiu said...

Hi Mark -- yes I did post a tweet on collectl. And my intention is to use collectl in conjunction with ganglia, for disk stats as a start. So expect another blog post on that soon ;-)

Grig

Mark Seger said...

great...

Since you're obviously interested in disk stats and as an experiment you may find useful, log some data locally and then generate a plot file (playback the data with -P -f) and use colplot from collectl-utils to plot the data for your individual disks OR you could always load it into a spreadsheet but depending on thee number of disks you could end up with a lot of columns.

If you have a lot of data you can't do this with rrd because of the normalization effect I mentioned earlier, but gnuplot (which colplot) dutifully plots every bit of data with not modification.

-mark

Grig Gheorghiu said...

Mark - thanks for the tip, I'll definitely try that.

0.3E9m/s said...

Could you say which documentation you used to learn how to set up your Ganglia configuration? The old IBM wiki, the ganglia.info wiki, or the stuff on github? Did you have to do any troubleshooting, or did it all "just work" out of the box?
thanks

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...