1) You can split your servers into clusters for ease of metric aggregation.
2) Each node in a cluster needs to run gmond. In Ubuntu, you can do 'apt-get install ganglia-monitoring' to install it. The config file is in /etc/ganglia/gmond.conf. More on the config file in a minute.
3) Each node in a cluster can send its metrics to a designated node via UDP.
4) One server in your infrastructure can be configured as both the overall metric collection server, and as the web front-end. This server needs to run gmetad, which in Ubuntu can be installed via 'apt-get install gmetad'. Its config file is /etc/gmetad.conf.
Note that you can have a tree of gmetad nodes, with the root of the tree configured to actually display the metric graphs. I wanted to keep it simple, so I am running both gmetad and the Web interface on the same node.
5) The gmetad server periodically polls one or more nodes in each cluster and retrieves the metrics for that cluster. It displays them via a PHP web interface which can be found in the source distribution.
That's about it in a nutshell in terms of the architecture of Ganglia. The nice thing is that it's scalable. You split nodes in clusters, you designate one or more nodes in a cluster to gather metrics from all the other nodes, and you have one ore more gmetad node(s) collecting the metrics from the designated nodes.
Now for the actual configuration. I have a cluster of DB servers, each running gmond. I also have another server called bak01 that I keep around for backup purposes. I configured each DB server to be part of a cluster called 'db'. I also configured each DB server to send the metrics collected by gmond to bak01 (via UDP on the non-default port of 8650). To do this, I have these entries in /etc/ganglia/gmond.conf on each DB server:
name = "db"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
host = bak01
port = 8650
On host bak01, I also defined a udp_recv_channel and a tcp_accept_channel:
port = 8650
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
port = 8649
The upd_recv_channel is necessary so bak01 can receive the metrics from the gmond nodes. The tcp_accept_channel is necessary so that bak01 can be contacted by the gmetad node.
That's it in terms of configuring gmond.
On the gmetad node, I made one modification to the default /etc/gmetad.conf file by specifying the cluster I want to collect metrics for, and the node where I want to collect the metrics from:
data_source "eosdb" 60 bak01
I then restarted gmetad via '/etc/init.d/gmetad restart'.
Ideally, these instructions would get you to a state where you would be able to see the graphs for all the nodes in the cluster.
I automated the process of installing and configuring gmond on all the nodes via fabric. Maybe it all happened too fast for the collecting node (bak01), because it wasn't collecting metrics correctly for some of the nodes. I noticed that if I did 'telnet localhost 8649' on bak01, some of the nodes had no metrics associated with them. My solution was to stop and start gmond on those nodes, and that kicked things off. Strange though...
In any case, my next step is to install all kinds of Ganglia plugins, especially related to MySQL, but also for more in-depth disk I/O metrics.