Prometheus is a pull system, where the monitoring server pulls data from its clients by hitting a special HTTP handler exposed by each client ("/metrics" by default) and retrieving a list of metrics from that handler. The output of /metrics is plain text, which makes it fairly easily parseable by humans as well, and also helps in troubleshooting.
Here's a subset of the OS-level metrics that are exposed by a client running the node_exporter Prometheus binary (and available when you hit http://client_ip_or_name:9100/metrics):
# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="guest"} 0
node_cpu{cpu="cpu0",mode="idle"} 2803.93
node_cpu{cpu="cpu0",mode="iowait"} 31.38
node_cpu{cpu="cpu0",mode="irq"} 0
node_cpu{cpu="cpu0",mode="nice"} 2.26
node_cpu{cpu="cpu0",mode="softirq"} 0.23
node_cpu{cpu="cpu0",mode="steal"} 21.16
node_cpu{cpu="cpu0",mode="system"} 25.84
node_cpu{cpu="cpu0",mode="user"} 79.94
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="xvda"} 0
# HELP node_disk_io_time_ms Milliseconds spent doing I/Os.
# TYPE node_disk_io_time_ms counter
node_disk_io_time_ms{device="xvda"} 44608
# HELP node_disk_io_time_weighted The weighted # of milliseconds spent doing I/Os. See https://www.kernel.org/doc/Documentation/iostats.txt.
# TYPE node_disk_io_time_weighted counter
node_disk_io_time_weighted{device="xvda"} 959264
There are many such "exporters" available for Prometheus, exposing metrics in the format expected by the Prometheus server from systems such as Apache, MySQL, PostgreSQL, HAProxy and many others (see a list here).
What drew me to Prometheus though was the fact that it allows for easy instrumentation of code by providing client libraries for many languages: Go, Java/Scala, Python, Ruby and others.
One of the main advantages of Prometheus over alternative systems such as Graphite is the rich query language that it provides. You can associate labels (which are arbitrary key/value pairs) with any metrics, and you are then able to query the system by label. I'll show examples in this post. Here's a more in-depth comparison between Prometheus and Graphite.
Installation (on Ubuntu 14.04)
I put together an ansible role that is loosely based on Brian Brazil's demo_prometheus_ansible repo.
Check out my ansible-prometheus repo for this ansible role, which installs Prometheus, node_exporter and PromDash (a ruby-based dashboard builder). For people not familiar with ansible, most of the installation commands are in the install.yml task file. Here is the sequence of installation actions, in broad strokes.
For the Prometheus server:
- download prometheus-0.16.1.linux-amd64.tar.gz from https://github.com/prometheus/prometheus/releases/download
- extract tar.gz into /opt/prometheus/dist and link /opt/prometheus/prometheus-server to /opt/prometheus/dist/prometheus-0.16.1.linux-amd64
- create Prometheus configuration file from ansible template and drop it in /etc/prometheus/prometheus.yml (more on the config file later)
- create Prometheus default command-line options file from ansible template and drop it in /etc/default/prometheus
- create Upstart script for Prometheus in /etc/init/prometheus.conf:
start on startup
chdir /opt/prometheus/prometheus-server
script
./prometheus -config.file /etc/prometheus/prometheus.yml
end script
For node_exporter:
- download node_exporter-0.12.0rc1.linux-amd64.tar.gz from https://github.com/prometheus/node_exporter/releases/download
- extract tar.gz into /opt/prometheus/dist and move node_exporter binary to /opt/prometheus/bin/node_exporter
- create Upstart script for Prometheus in /etc/init/prometheus_node_exporter.conf:
# Run prometheus node_exporter
start on startup
script
/opt/prometheus/bin/node_exporter
end script
For PromDash:
- git clone from https://github.com/prometheus/promdash
- follow instructions in the Prometheus tutorial from Digital Ocean (can't stop myself from repeating that D.O. publishes the best technical tutorials out there!)
Here is a minimal Prometheus configuration file (/etc/prometheus/prometheus.yml):
global:
scrape_interval: 30s
evaluation_interval: 5s
scrape_configs:
- job_name: 'prometheus'
target_groups:
- targets:
- prometheus.example.com:9090
- job_name: 'node'
target_groups:
- targets:
- prometheus.example.com:9100
- api01.example.com:9100
- api02.example.com:9100
- test-api01.example.com:9100
- test-api02.example.com:9100
The configuration file format for Prometheus is well documented in the official docs. My example shows that the Prometheus server itself is monitored (or "scraped" in Prometheus parlance) on port 9090, and that OS metrics are also scraped from 5 clients which are running the node_exporter binary on port 9100, including the Prometheus server.
At this point, you can start Prometheus and node_exporter on your Prometheus server via Upstart:
# start prometheus
# start prometheus_node_exporter
Then you should be able to hit http://prometheus.example.com:9100 to see the metrics exposed by node_exporter, and more importantly http://prometheus.example.com:9090 to see the default Web console included in the Prometheus server. A demo page available from Robust Perception can be examined here.
Note that Prometheus also provides default Web consoles for node_exporter OS-level metrics. They are available at http://prometheus.example.com:9090/consoles/node.html (the ansible-prometheus role installs nginx and redirects http://prometheus.example.com:80 to the previous URL). The node consoles show CPU, Disk I/O and Memory graphs and also network traffic metrics for each client running node_exporter.
Working with the MySQL exporter
I installed the mysqld_exporter binary on my Prometheus server box.
# cd /opt/prometheus/dist
# git clone https://github.com/prometheus/mysqld_exporter.git
# cd mysqld_exporter
# make
Then I created a wrapper script I called run_mysqld_exporter.sh:
# cat run_mysqld_exporter.sh
#!/bin/bash
export DATA_SOURCE_NAME=“dbuser:dbpassword@tcp(dbserver:3306)/dbname”; ./mysqld_exporter
Two important notes here:
1) Note the somewhat awkward format for the DATA_SOURCE_NAME environment variable. I tried many other formats but only this one worked for me. The wrapper's script main purpose is to define this variable properly. With some of my other tries, I got this error message:
INFO[0089] Error scraping global state: Default addr for network 'dbserver:3306' unknown file=mysqld_exporter.go line=697
You could also define this variable in ~/.bashrc but in that case it may clash with other Prometheus exporters (the one for PostgreSQL for example) which also need to define this variable.
2) Note that the dbuser specified in the DATA_SOURCE_NAME variable needs to have either SUPER or REPLICATION CLIENT permissions to the MySQL server you need to monitor. I ran a SQL statement of this form:
I created an Upstart init script I called /etc/init/prometheus_mysqld_exporter.conf:
# cat /etc/init/prometheus_mysqld_exporter.conf
# Run prometheus mysqld exporter
start on startup
chdir /opt/prometheus/dist/mysqld_exporter
script
./run_mysqld_exporter.sh
end script
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the MySQL metrics:
- job_name: 'mysql'
honor_labels: true
target_groups:
- targets:
- prometheus.example.com:9104
I restarted the Prometheus server:
# stop prometheus
# start prometheus
Then I started up mysqld_exporter via Upstart:
- job_name: 'mysql'
honor_labels: true
target_groups:
- targets:
- prometheus.example.com:9104
I restarted the Prometheus server:
# stop prometheus
# start prometheus
Then I started up mysqld_exporter via Upstart:
# start prometheus_mysqld_exporter
If everything goes well, the metrics scraped from MySQL will be available at http://prometheus.example.com:9104/metrics.
Here are some of the available metrics:
# HELP mysql_global_status_innodb_data_reads Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_reads untyped
mysql_global_status_innodb_data_reads 12660
# HELP mysql_global_status_innodb_data_writes Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_writes untyped
mysql_global_status_innodb_data_writes 528790
# HELP mysql_global_status_innodb_data_written Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_data_written untyped
mysql_global_status_innodb_data_written 9.879318016e+09
# HELP mysql_global_status_innodb_dblwr_pages_written Generic metric from SHOW GLOBAL STATUS.
# TYPE mysql_global_status_innodb_dblwr_pages_written untyped
mysql_global_status_innodb_dblwr_pages_written 285184
# HELP mysql_global_status_innodb_row_ops_total Total number of MySQL InnoDB row operations.
# TYPE mysql_global_status_innodb_row_ops_total counter
mysql_global_status_innodb_row_ops_total{operation="deleted"} 14580
mysql_global_status_innodb_row_ops_total{operation="inserted"} 847656
mysql_global_status_innodb_row_ops_total{operation="read"} 8.1021419e+07
mysql_global_status_innodb_row_ops_total{operation="updated"} 35305
Most of the metrics exposed by mysqld_exporter are of type Counter, which means they always increase. A meaningful number to graph then is not their absolute value, but their rate of change. For example, for the mysql_global_status_innodb_row_ops_total metric, the rate of change of reads for the last 5 minutes (reads/sec) can be expressed as:
rate(mysql_global_status_innodb_row_ops_total{operation="read"}[5m])
This is also an example of a Prometheus query which filters by a specific label (in this case {operation="read"})
A good way to get a feel for the metrics available to the Prometheus server is to go to the Web console and graphing tool available at http://prometheus.example.com:9090/graph. You can copy and paste the ine above in the Expression edit box and click execute. You should see something like this graph in the Graph tab:
It's important to familiarize yourself with the 4 types of metrics handled by Prometheus: Counter, Gauge, Histogram and Summary.
Working with the Postgres exporter
Although not an official Prometheus package, the Postgres exporter has worked just fine for me.
I installed the postgres_exporter binary on my Prometheus server box.
# cd /opt/prometheus/dist
# git clone https://github.com/wrouesnel/postgres_exporter.git
# cd postgres_exporter
# make
# cat run_postgres_exporter.sh
#!/bin/bash
export DATA_SOURCE_NAME="postgres://dbuser:dbpassword@dbserver/dbname"; ./postgres_exporter
#!/bin/bash
Note that the format for DATA_SOURCE_NAME is a bit different from the MySQL format.
I created an Upstart init script I called /etc/init/prometheus_postgres_exporter.conf:
# cat /etc/init/prometheus_postgres_exporter.conf
# Run prometheus postgres exporter
start on startup
chdir /opt/prometheus/dist/postgres_exporter
script
./run_postgres_exporter.sh
end script
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the Postgres metrics:
- job_name: 'postgres'
honor_labels: true
target_groups:
- targets:
- prometheus.example.com:9113
I restarted the Prometheus server:
# stop prometheus
# start prometheus
Then I started up postgres_exporter via Upstart:
# start prometheus_postgres_exporter
If everything goes well, the metrics scraped from Postgres will be available at http://prometheus.example.com:9113/metrics.
Here are some of the available metrics:
# HELP pg_stat_database_tup_fetched Number of rows fetched by queries in this database
# TYPE pg_stat_database_tup_fetched counter
pg_stat_database_tup_fetched{datid="1",datname="template1"} 7.730469e+06
pg_stat_database_tup_fetched{datid="12998",datname="template0"} 0
pg_stat_database_tup_fetched{datid="13003",datname="postgres"} 7.74208e+06
pg_stat_database_tup_fetched{datid="16740",datname="mydb"} 2.18194538e+08
# HELP pg_stat_database_tup_inserted Number of rows inserted by queries in this database
# TYPE pg_stat_database_tup_inserted counter
pg_stat_database_tup_inserted{datid="1",datname="template1"} 0
pg_stat_database_tup_inserted{datid="12998",datname="template0"} 0
pg_stat_database_tup_inserted{datid="13003",datname="postgres"} 0
pg_stat_database_tup_inserted{datid="16740",datname="mydb"} 3.5467483e+07
# HELP pg_stat_database_tup_returned Number of rows returned by queries in this database
# TYPE pg_stat_database_tup_returned counter
pg_stat_database_tup_returned{datid="1",datname="template1"} 6.41976558e+08
pg_stat_database_tup_returned{datid="12998",datname="template0"} 0
pg_stat_database_tup_returned{datid="13003",datname="postgres"} 6.42022129e+08
pg_stat_database_tup_returned{datid="16740",datname="mydb"} 7.114057378094e+12
# HELP pg_stat_database_tup_updated Number of rows updated by queries in this database
# TYPE pg_stat_database_tup_updated counter
pg_stat_database_tup_updated{datid="1",datname="template1"} 1
pg_stat_database_tup_updated{datid="12998",datname="template0"} 0
pg_stat_database_tup_updated{datid="13003",datname="postgres"} 1
pg_stat_database_tup_updated{datid="16740",datname="mydb"} 4351
These metrics are also of type Counter, so to generate meaningful graphs for them, you need to plot their rates. For example, to see the rate of rows returned per second from the database called mydb, you would plot this expression:
# TYPE pg_stat_database_tup_fetched counter
pg_stat_database_tup_fetched{datid="1",datname="template1"} 7.730469e+06
pg_stat_database_tup_fetched{datid="12998",datname="template0"} 0
pg_stat_database_tup_fetched{datid="13003",datname="postgres"} 7.74208e+06
pg_stat_database_tup_fetched{datid="16740",datname="mydb"} 2.18194538e+08
# HELP pg_stat_database_tup_inserted Number of rows inserted by queries in this database
# TYPE pg_stat_database_tup_inserted counter
pg_stat_database_tup_inserted{datid="1",datname="template1"} 0
pg_stat_database_tup_inserted{datid="12998",datname="template0"} 0
pg_stat_database_tup_inserted{datid="13003",datname="postgres"} 0
pg_stat_database_tup_inserted{datid="16740",datname="mydb"} 3.5467483e+07
# HELP pg_stat_database_tup_returned Number of rows returned by queries in this database
# TYPE pg_stat_database_tup_returned counter
pg_stat_database_tup_returned{datid="1",datname="template1"} 6.41976558e+08
pg_stat_database_tup_returned{datid="12998",datname="template0"} 0
pg_stat_database_tup_returned{datid="13003",datname="postgres"} 6.42022129e+08
pg_stat_database_tup_returned{datid="16740",datname="mydb"} 7.114057378094e+12
# HELP pg_stat_database_tup_updated Number of rows updated by queries in this database
# TYPE pg_stat_database_tup_updated counter
pg_stat_database_tup_updated{datid="1",datname="template1"} 1
pg_stat_database_tup_updated{datid="12998",datname="template0"} 0
pg_stat_database_tup_updated{datid="13003",datname="postgres"} 1
pg_stat_database_tup_updated{datid="16740",datname="mydb"} 4351
rate(pg_stat_database_tup_returned{datid="16740",datname="mydb"}[5m])
The Prometheus expression evaluator available at http://prometheus.example.com:9090/graph is again your friend. BTW, if you start typing pg_ in the expression field, you'll see a drop-down filled automatically with all the available metrics starting with pg_. Handy!
Working with the AWS CloudWatch exporter
This is one of the officially supported Prometheus exporters, used for graphing and alerting on AWS CloudWatch metrics. I installed it on the Prometheus server box. It's a java app, so it needs a JDK installed, and also maven for building the app.
# cd /opt/prometheus/dist
# git clone https://github.com/prometheus/cloudwatch_exporter.git
# apt-get install maven2 openjdk-7-jdk
# cd cloudwatch_exporter
# mvn package
The cloudwatch_exporter app needs AWS credentials in order to connect to CloudWatch and read the metrics. Here's what I did:
- created an AWS IAM user called cloudwatch_ro and downloaded its access key and secret key
- created an AWS IAM custom policy called CloudWatchReadOnlyAccess-201511181031, which includes the default CloudWatchReadOnlyAccess policy (the custom policy is not stricly necessary, and you can use the default one, but I preferred to use a custom one because I may need to further edits to the policy file)
- attached the CloudWatchReadOnlyAccess-201511181031 policy to the cloudwatch_ro user
- created a file called ~/.aws/credentials with the contents:
[default]
aws_access_key_id=ACCESS_KEY_FOR_USER_CLOUDWATCH_RO
aws_secret_access_key=SECRET_KEY_FOR_USER_CLOUDWATCH_RO
The cloudwatch_exporter app also needs a json file containing the CloudWatch metrics we want it to retrieve from AWS. Here is an example of ELB-related metrics I specified in a file called cloudwatch.json:
{
"region": "us-west-2",
"metrics": [
{"aws_namespace": "AWS/ELB", "aws_metric_name": "RequestCount",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "BackendConnectionErrors",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_2XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_4XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_5XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_4XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_5XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "SurgeQueueLength",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Maximum", "Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "SpilloverCount",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "Latency",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Average"]},
]
}
"region": "us-west-2",
"metrics": [
{"aws_namespace": "AWS/ELB", "aws_metric_name": "RequestCount",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "BackendConnectionErrors",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_2XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_4XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_Backend_5XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_4XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "HTTPCode_ELB_5XX",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "SurgeQueueLength",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Maximum", "Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "SpilloverCount",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Sum"]},
{"aws_namespace": "AWS/ELB", "aws_metric_name": "Latency",
"aws_dimensions": ["AvailabilityZone", "LoadBalancerName"],
"aws_dimension_select": {"LoadBalancerName": [“LB1”, “LB2”]},
"aws_statistics": ["Average"]},
]
}
Note that you need to look up the exact syntax for each metric name, dimensions and preferred statistics in the AWS CloudWatch documentation. For ELB metrics, the documentation is here. The CloudWatch name corresponds to the cloudwatch_exporter JSON parameter aws_metric_name, dimensions corresponds to aws_dimensions, and preferred statistics corresponds to aws_statistics.
I modified the Prometheus server configuration file (/etc/prometheus/prometheus.yml) and added a scrape job for the CloudWatch metrics:
- job_name: 'cloudwatch'
honor_labels: true
target_groups:
- targets:
- prometheus.example.com:9106
I restarted the Prometheus server:
# stop prometheus
# start prometheus
I created an Upstart init script I called /etc/init/prometheus_cloudwatch_exporter.conf:
# cat /etc/init/prometheus_cloudwatch_exporter.conf
# Run prometheus cloudwatch exporter
start on startup
chdir /opt/prometheus/dist/cloudwatch_exporter
script
/usr/bin/java -jar target/cloudwatch_exporter-0.2-SNAPSHOT-jar-with-dependencies.jar 9106 cloudwatch.json
end script
Then I started up cloudwatch_exporter via Upstart:
# start prometheus_cloudwatch_exporter
If everything goes well, the metrics scraped from CloudWatch will be available at http://prometheus.example.com:9106/metrics.
Here are some of the available metrics:
# HELP aws_elb_request_count_sum CloudWatch metric AWS/ELB RequestCount Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Sum Unit: Count
# TYPE aws_elb_request_count_sum gauge
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_httpcode_backend_2_xx_sum CloudWatch metric AWS/ELB HTTPCode_Backend_2XX Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Sum Unit: Count
# TYPE aws_elb_httpcode_backend_2_xx_sum gauge
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_latency_average CloudWatch metric AWS/ELB Latency Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Average Unit: Seconds
# TYPE aws_elb_latency_average gauge
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 0.5571935176849365
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 0.5089397430419922
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 0.035556912422180176
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 0.0031794110933939614
Note that there are 3 labels available to query the metrics above: job, load_balancer_name and availability_zone.
# TYPE aws_elb_request_count_sum gauge
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_request_count_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_httpcode_backend_2_xx_sum CloudWatch metric AWS/ELB HTTPCode_Backend_2XX Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Sum Unit: Count
# TYPE aws_elb_httpcode_backend_2_xx_sum gauge
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 1.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 2.0
aws_elb_httpcode_backend_2_xx_sum{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 12.0
# HELP aws_elb_latency_average CloudWatch metric AWS/ELB Latency Dimensions: [AvailabilityZone, LoadBalancerName] Statistic: Average Unit: Seconds
# TYPE aws_elb_latency_average gauge
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2a",} 0.5571935176849365
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB1”,availability_zone="us-west-2c",} 0.5089397430419922
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2c",} 0.035556912422180176
aws_elb_latency_average{job="aws_elb",load_balancer_name=“LB2”,availability_zone="us-west-2a",} 0.0031794110933939614
If we specify something like aws_elb_request_count_sum{job="aws_elb"} in the expression evaluator at http://prometheus.example.com:9090/graph, we'll see 4 graphs, one for each load_balancer_name/availability_zone combination.
To see only graphs related to a specific load balancer, say LB1, we can specify an expression of the form:
aws_elb_request_count_sum{job="aws_elb",load_balancer_name="LB1"}
In this case, we'll see 2 graphs for LB1, one for each availability zone.
In order to see the request count across all availability zones for a specific load balancer, we need to apply the sum function: sum(aws_elb_request_count_sum{job="aws_elb",load_balancer_name="LB1"}) by (load_balancer_name)
In this case, we'll see one graph with the request count across the 2 availability zones pertaining to LB1.
If we want to graph all load balancers but only show one graph per balancer, summing all availability zones for each balancer, we would use an expression like this: sum(aws_elb_request_count_sum{job="aws_elb"}) by (load_balancer_name)
So in this case we'll see 2 graphs, one for LB1 and one for LB2, with each graph summing the request count across the availability zones for LB1 and LB2 respectively.
Note that in all the expressions above, since the job label has the value "aws_elb" common to all metrics, it can be dropped from the queries because it doesn't produce any useful filtering.
For other AWS CloudWatch metrics, consult the Amazon CloudWatch Namespaces, Dimensions and Metrics Reference.
Instrumenting Go code with Prometheus
For me, the most interesting feature of Prometheus is that allows for easy instrumentation of the code. Instead of pushing metrics a la statsd and Graphite, a web app needs to implement a /metrics handler and use the Prometheus client library code to publish app-level metrics to that handler. The Prometheus server will then hit /metrics on the client and pull/scrape the metrics.
More specifics for Go code instrumentation
1) Declare and register Prometheus metrics in your code
I have the following 2 variables defined in an init.go file in a common package that gets imported in all of the webapp code:
var PrometheusHTTPRequestCount = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: "myapp",
Name: "http_request_count",
Help: "The number of HTTP requests.",
},
[]string{"method", "type", "endpoint"},
)
var PrometheusHTTPRequestLatency = prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Namespace: "myapp",
Name: "http_request_latency",
Help: "The latency of HTTP requests.",
},
[]string{"method", "type", "endpoint"},
)
Note that the first metric is a CounterVec, which in the Prometheus client_golang library specifies a Counter metric that can also get labels associated with it. The labels in my case are "method", "type" and "endpoint". The purpose of this metric is to measure the HTTP request count. Since it's a Counter, it will increase monotonically, so for graphing purposes we'll need to plot its rate and not its absolute value.
The second metric is a SummaryVec, which in the client_golang library specifies a Summary metric with labels. I have the same labels are for the CounterVec metric. The purpose of this metric is to measure the HTTP request latency. Because it's a Summary, it will provide the absolute measurement, the count, as well as quantiles for the measurements.
These 2 variables then get registered in the init function:
func init() {
// Register Prometheus metric trackers
prometheus.MustRegister(PrometheusHTTPRequestCount)
prometheus.MustRegister(PrometheusHTTPRequestLatency)
}
2) Let Prometheus handle the /metrics endpoint
The GitHub README for client_golang shows the simplest way of doing this:
http.Handle("/metrics", prometheus.Handler())
http.ListenAndServe(":8080", nil)
However, most of the Go webapp code will rely on some sort of web framework, so YMMV. In our case, I had to insert the prometheus.Handler function as a variable pretty deep in our framework code in order to associate it with the /metrics endpoint.
3) Modify Prometheus metrics in your code
The final step in getting Prometheus to instrument your code is to modify the Prometheus metrics you registered by incrementing Counter variables and taking measurements for Summary variables in the appropriate places in your app. In my case, I increment PrometheusHTTPRequestCount in every HTTP handler in my webapp by calling its Inc() method. I also measure the HTTP latency, i.e. the time it took for the handler code to execute, and call the Observe() method on the PrometheusHTTPRequestLatency variable.
The values I associate with the "method", "type" and "endpoint" labels come from the endpoint URL associated with each instrumented handler. As an example, for an HTTP GET request to a URL such as http://api.example.com/customers/find, "method" is the HTTP method used in the request ("GET"), "type" is "customers", and "endpoint" is "/customers/find".
Here is the code I use for modifying the Prometheus metrics (R is an object/struct which represents the HTTP request):
// Modify Prometheus metrics
pkg, endpoint := common.SplitUrlForMonitoring(R.URL.Path)
method := R.Method
PrometheusHTTPRequestCount.WithLabelValues(method, pkg, endpoint).Inc()
PrometheusHTTPRequestLatency.WithLabelValues(method, pkg, endpoint).Observe(float64(elapsed) / float64(time.Millisecond))
4) Retrieving your metrics
Assuming your web app runs on port 8080, you'll need to modify the Prometheus server configuration file and add a scrape job for app-level metrics. I have something similar to this in /etc/prometheus/prometheus.xml:
- job_name: 'myapp-api'
target_groups:
- targets:
- api01.example.com:8080
- api02.example.com:8080
labels:
group: 'production'
- targets:
- test-api01.example.com:8080
- test-api02.example.com:8080
labels:
group: 'test'
Note an extra label called "group" defined in the configuration file. It has the values "production" and "test" respectively, and allows for the filtering of Prometheus measurements by the environment of the monitored nodes.
Whenever the Prometheus configuration file gets modified, you need to restart the Prometheus server:
# stop prometheus
# start prometheus
At this point, the metrics scraped from the webapp servers will be available at http://api01.example.com:8080/metrics.
# HELP myapp_http_request_count The number of HTTP requests.
# TYPE myapp_http_request_count counter
myapp_http_request_count{endpoint="/merchant/register",method="GET",type="admin"} 2928
# HELP myapp_http_request_latency The latency of HTTP requests.
# TYPE myapp_http_request_latency summary
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.5"} 31.284808
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.9"} 33.353354
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.99"} 33.353354
myapp_http_request_latency_sum{endpoint="/merchant/register",method="GET",type="admin"} 93606.57930099976
myapp_http_request_latency_count{endpoint="/merchant/register",method="GET",type="admin"} 2928
Note that myapp_http_request_count and myapp_http_request_latency_count show the same value for the method/type/endpoint combination in this example. You could argue that myapp_http_request_count is redundant in this case. There could be instances where you want to increment a counter without taking a measurement for the summary, so it's still useful to have both.
More specifics for Go code instrumentation
1) Declare and register Prometheus metrics in your code
I have the following 2 variables defined in an init.go file in a common package that gets imported in all of the webapp code:
var PrometheusHTTPRequestCount = prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: "myapp",
Name: "http_request_count",
Help: "The number of HTTP requests.",
},
[]string{"method", "type", "endpoint"},
)
var PrometheusHTTPRequestLatency = prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Namespace: "myapp",
Name: "http_request_latency",
Help: "The latency of HTTP requests.",
},
[]string{"method", "type", "endpoint"},
)
Note that the first metric is a CounterVec, which in the Prometheus client_golang library specifies a Counter metric that can also get labels associated with it. The labels in my case are "method", "type" and "endpoint". The purpose of this metric is to measure the HTTP request count. Since it's a Counter, it will increase monotonically, so for graphing purposes we'll need to plot its rate and not its absolute value.
The second metric is a SummaryVec, which in the client_golang library specifies a Summary metric with labels. I have the same labels are for the CounterVec metric. The purpose of this metric is to measure the HTTP request latency. Because it's a Summary, it will provide the absolute measurement, the count, as well as quantiles for the measurements.
These 2 variables then get registered in the init function:
func init() {
// Register Prometheus metric trackers
prometheus.MustRegister(PrometheusHTTPRequestCount)
prometheus.MustRegister(PrometheusHTTPRequestLatency)
}
2) Let Prometheus handle the /metrics endpoint
The GitHub README for client_golang shows the simplest way of doing this:
http.Handle("/metrics", prometheus.Handler())
http.ListenAndServe(":8080", nil)
However, most of the Go webapp code will rely on some sort of web framework, so YMMV. In our case, I had to insert the prometheus.Handler function as a variable pretty deep in our framework code in order to associate it with the /metrics endpoint.
3) Modify Prometheus metrics in your code
The final step in getting Prometheus to instrument your code is to modify the Prometheus metrics you registered by incrementing Counter variables and taking measurements for Summary variables in the appropriate places in your app. In my case, I increment PrometheusHTTPRequestCount in every HTTP handler in my webapp by calling its Inc() method. I also measure the HTTP latency, i.e. the time it took for the handler code to execute, and call the Observe() method on the PrometheusHTTPRequestLatency variable.
The values I associate with the "method", "type" and "endpoint" labels come from the endpoint URL associated with each instrumented handler. As an example, for an HTTP GET request to a URL such as http://api.example.com/customers/find, "method" is the HTTP method used in the request ("GET"), "type" is "customers", and "endpoint" is "/customers/find".
Here is the code I use for modifying the Prometheus metrics (R is an object/struct which represents the HTTP request):
// Modify Prometheus metrics
pkg, endpoint := common.SplitUrlForMonitoring(R.URL.Path)
method := R.Method
PrometheusHTTPRequestCount.WithLabelValues(method, pkg, endpoint).Inc()
PrometheusHTTPRequestLatency.WithLabelValues(method, pkg, endpoint).Observe(float64(elapsed) / float64(time.Millisecond))
4) Retrieving your metrics
Assuming your web app runs on port 8080, you'll need to modify the Prometheus server configuration file and add a scrape job for app-level metrics. I have something similar to this in /etc/prometheus/prometheus.xml:
- job_name: 'myapp-api'
target_groups:
- targets:
- api01.example.com:8080
- api02.example.com:8080
labels:
group: 'production'
- targets:
- test-api01.example.com:8080
- test-api02.example.com:8080
labels:
group: 'test'
Note an extra label called "group" defined in the configuration file. It has the values "production" and "test" respectively, and allows for the filtering of Prometheus measurements by the environment of the monitored nodes.
Whenever the Prometheus configuration file gets modified, you need to restart the Prometheus server:
# stop prometheus
# start prometheus
At this point, the metrics scraped from the webapp servers will be available at http://api01.example.com:8080/metrics.
Here are some of the available metrics:
# HELP myapp_http_request_count The number of HTTP requests.
# TYPE myapp_http_request_count counter
myapp_http_request_count{endpoint="/merchant/register",method="GET",type="admin"} 2928
# HELP myapp_http_request_latency The latency of HTTP requests.
# TYPE myapp_http_request_latency summary
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.5"} 31.284808
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.9"} 33.353354
myapp_http_request_latency{endpoint="/merchant/register",method="GET",type="admin",quantile="0.99"} 33.353354
myapp_http_request_latency_sum{endpoint="/merchant/register",method="GET",type="admin"} 93606.57930099976
myapp_http_request_latency_count{endpoint="/merchant/register",method="GET",type="admin"} 2928
Also note that myapp_http_request_latency, being a summary, computes 3 different quantiles: 0.5, 0.9 and 0.99 (so 50%, 90% and 99% of the measurements respectively fall under the given numbers for the latencies).
5) Graphing your metrics with PromDash
5) Graphing your metrics with PromDash
The PromDash tool provides an easy way to create dashboards with a look and feel similar to Graphite. PromDash is available at http://prometheus.example.com:3000.
First you need to define a server by clicking on the Servers link up top, then entering a name ("prometheus") and the URL of the Prometheus server ("http://prometheus.example.com:9090/").
Then click on Dashboards up top, and create a new directory, which offers a way to group dashboards. You can call it something like "myapp". Now you can create a dashboard (you also need to select the directory it belongs to). Once you are in the Dashboard create/edit screen, you'll see one empty graph with the default title "Title".
When you hover over the header of the graph, you'll see other buttons available. You want to click on the 2nd button from the left, called Datasources, then click Add Expression. Note that the server field is already pre-filled. If you start typing myapp in the expression field, you should see the metrics exposed by your application (for example myapp_http_request_count and myapp_http_request_latency).
To properly graph a Counter-type metric, you need to plot its rate. Use this expression to show the HTTP request/second rate measured in the last minute for all the production endpoints in my webapp:
rate(myapp_http_request_count{group="production",job="myapp-api"}[1m])
(the job and group values correspond to what we specified in /etc/prometheus/prometheus.xml)
If you want to show the HTTP request/second rate for test endpoints of "admin" type, use this expression:
rate(myapp_http_request_count{group="test",job="myapp-api",type="admin"}[1m])
If you want to show the HTTP request/second rate for a specific production endpoint, use an expression similar to this:
rate(myapp_http_request_count{group="production",job="myapp-api",endpoint="/merchant/register",type="admin"}[1m])
Once you enter the expression you want, close the Datasources form (it will save everything). Also change the title by clicking on the button called "Graph and Axis Settings". In that form, you can also specify that you want the plot lines stacked as opposed to regular lines.
For latency metrics, you don't need to look at the rate. Instead, you can look at a specific quantile. Let's say you want to plot the 99% quantile for latencies observed in all production endpoint, for write operations (corresponding to HTTP methods which are not GET). Then you would use an expression like this:
myapp_http_request_latency{method!="GET",quantile="0.99",group="production",job="myapp-api"}
As for the HTTP request/second graphs, you can refine the latency queries by specifying a type, an endpoint or both:
myapp_http_request_latency{method!="GET",quantile="0.99",group="production",type="admin",endpoint="/merchant/register",job="myapp-api"}
Wrapping up
I wanted to write this blog post so I don't forget all the stuff that was involved in setting up and using Prometheus. It's a lot, but it's also not that bad once you get a hang for it. In particular, the Prometheus server itself is remarkably easy to set up and maintain, a refreshing change from other monitoring systems I've used before.
One thing I haven't touched on is the alerting mechanism used in Prometheus. I haven't looked at that yet, since I'm still using a combination of Pingdom, monit and Jenkins for my alerting. I'll tackle Prometheus alerting in another blog post.
I really like Prometheus so far and I hope you'll give it a try!
No comments:
Post a Comment