What prompted this post was an incident we had in the very early hours of this past Tuesday, when we started to see a lot of packet loss, increased latency and timeouts between some of our servers hosted at a data center on the US East Coast, and some instances we have running in EC2, also in the US East region. The symptoms were increased error rates in some application calls that we were making from one back-end server cluster at the data center into another back-end cluster in EC2. These errors weren't affecting our customers too much, because all failed requests were posted to various queues and reprocessed.
There had also been network maintenance done that night within the data center, so we weren't sure initially if it's our outbound connectivity into EC2 or general inbound connectivity into EC2 that was the culprit. What was strange (and unexpected) too was that several EC2 availability zones seemed to be affected -- mostly us-east-1d, but we were also seeing increased latency and timeouts into 1b and 1a. That made it hard to decide whether the issue was with EC2 or with us.
Running traceroutes from different source machines (some being our home machines in California, another one being a Rackspace cloud server instance in Chicago) revealed that packet loss and increased latency occurred almost all the time at the same hop: a router within the Level 3 network upstream from the Amazon EC2 network. What was frustrating too was that the AWS Status dashboard showed everything absolutely green. Now you can argue that this wasn't necessarily an EC2 issue, but if I were Amazon I would like to monitor the major inbound network paths into my infrastructure -- especially when it has the potential to affect several availability zones at once.
This whole issue lasted approximately 3.5 hours, then it miraculously stopped. Somebody must have fixed a defective router. Twitter reports from other people experiencing the exact same issue revealed that the issue was seen as fixed for them at the very minute that it was fixed for us too.
This incident brought home a valuable point for me though: we needed more monitors than we had available. We were monitoring connectivity 1) within the data center, 2) within EC2, and 3) between our data center and EC2. However, we also needed to monitor 4) inbound connectivity into EC2 going from sources that were outside of our data center infrastructure. Only by triangulating (for lack of a better term) our monitoring in this manner would we be sure which network path was to blame. Note that we already had Pingdom set up to monitor various URLs within our site, but like I said, the front-end stuff wasn't affected too much by that particular issue that night.
So...the next day we started up a small Rackspace cloud server in Chicago, and a small Linode VPS in Fremont, California, and we added them to our Nagios installation. We run the same exact checks from these servers into EC2 that we run from our data center into EC2. This makes network issues faster to troubleshoot, although unfortunately not easier to solve -- because we could be depending on a 3rd party to solve them.
I guess a bigger point to make, other than ABM/Always Be Monitoring, is OYA/Own Your Availability (I didn't come up with this, I personally first saw it mentioned by the @fastip guys). To me, what this means is to deploy your infrastructure across multiple providers (data centers/clouds) so that you don't have a single point of failure at the provider level. This is obviously easier said than done....but we're working on it as far as our infrastructure goes.