Friday, September 16, 2011

Slow database? Check RAID battery!

Executive Summary: 

If your Dell database servers get slow suddenly, and I/O seems sluggish, do yourself a favor and check if the RAID battery is currently going through its 'relearning' cycle. If this is so, then the Write-Back policy is disabled and Write-Through is enabled -- as a result writes become very slow compared to the standard operation.

Details:

This turns out to be a fairly well known problem with RAID controllers in Dell servers, specifically LSI controllers. The default mode of operation for the RAID battery is to periodically go through a so-called 'relearn cycle', where it discharges, then charges and recalibrates itself by finding the current charge. In this timeframe, as I mentioned, Write-Back is disabled and Write-Through is enabled.

For our MySQL servers, we have innodb_flush_log_at_trx_commit set to 1, which means that every commit if flushed to disk. In consequence, the Write-Through mode will severely impact the performance of the writes to the database. A symptom is that CPU I/O wait is high, and the database gets sluggish. Pain all around.

We started to experience this database slowness on 3 database server at almost the same time. Two of them were configured as slaves, and one as master. The symptoms included high CPU I/O wait, slow queries on the master, and replication lag on the slaves. Nothing pointed to something specific to MySQL. We opened an emergency ticket with Percona and were fortunate to be assigned to Aurimas Mikalauskas, a Percona principal consultant and a MySQL/RAID hardware guru. It took him less than a minute to correctly diagnose the issue based on these symptoms. Now that we knew what the issue was, some Google searches turned out other articles and blog posts talking about it. It turns out one of the most frequently cited posts belongs to Robin Bowes, my ex-coworker from RIS Technology/Reliam! It also turns out Percona engineers blogged about this issue extensively (see this post which references other posts).

In any case, for future reference, here is what we did on all the servers that have the LSI MegaRaid controller (these servers are Dell C2100s in our case):

1) Install MegaCli utilities

I had a hard time finding these utilities, since the LSI support site doesn't seem to have them anymore. I found this blog post talking about a zip file containing the tools, then I googled the zip filename and I found an updated version on this Gentoo-related site. Then I followed the steps in the blog post above to extract the statically-linked binaries:

# apt-get install rpm2cpio
# mkdir megacli; cd megacli

# wget http://download.gocept.com/gentoo/mirror/distfiles/4.00.11_Linux_MegaCLI.zip
# unzip 4.00.11_Linux_MegaCLI.zip 
# unzip MegaCliLin.zip 
# rpm2cpio MegaCli-4.00.11-1.i386.rpm| cpio -idmv

At this point I had 2 statically-linked binaries called MegaCli and MegaCli64 in megacli/opt/MegaRAID/MegaCli.

2) Inspect event log for RAID controller to figure out what has been going on in that subsystem (this command was actually run by Aurimas during the troubleshooting he did):

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL
# cat events.log
...
Event Description: Time established as 09/09/11 15:27:18; (48 seconds since power on)
--
Time: Fri Sep 9 16:27:36 2011
Event Description: Battery temperature is normal
--
Time: Fri Sep 9 16:56:51 2011
Event Description: Battery started charging
--
Time: Fri Sep 9 17:08:46 2011
Event Description: Battery charge complete
--
Time: Sat Sep 10 19:54:16 2011
Event Description: Battery relearn will start in 4 days
--
Time: Mon Sep 12 19:53:46 2011
Event Description: Battery relearn will start in 2 day
--
Time: Tue Sep 13 19:54:36 2011
Event Description: Battery relearn will start in 1 day
--
Time: Wed Sep 14 14:54:16 2011
Event Description: Battery relearn will start in 5 hours
--
Time: Wed Sep 14 19:55:26 2011
Event Description: Battery relearn pending: Battery is under charge
--
Time: Wed Sep 14 19:55:26 2011
Event Description: Battery relearn started
--
Time: Wed Sep 14 19:55:29 2011
Event Description: BBU disabled; changing WB virtual disks to WT, Forced WB VDs are not affected
--
Time: Wed Sep 14 19:55:29 2011
Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=00,ap=0,dc=0] from [ID=00,dcp=01,ccp=01,ap=0,dc=0]
Previous LD Properties
Current Cache Policy: 1
Default Cache Policy: 1
New LD Properties
Current Cache Policy: 0
Default Cache Policy: 1
--
Time: Wed Sep 14 19:56:31 2011
Event Description: Battery is discharging
--
Time: Wed Sep 14 19:56:31 2011
Event Description: Battery relearn in progress
--
Time: Wed Sep 14 22:43:21 2011
Event Description: Battery relearn completed
--
Time: Wed Sep 14 22:44:26 2011
Event Description: Battery started charging
--
Time: Wed Sep 14 23:53:46 2011
Event Description: BBU enabled; changing WT virtual disks to WB
--
Time: Wed Sep 14 23:53:46 2011
Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=01,ap=0,dc=0] from [ID=00,dcp=01,ccp=00,ap=0,dc=0]
Previous LD Properties
Current Cache Policy: 0
Default Cache Policy: 1
New LD Properties
Current Cache Policy: 1
Default Cache Policy: 1


So as you can see, the battery relearn started at 19:55:26, then 3 seconds later the Write-Back policy was changed to Write-Through, and it stayed like this until 23:53:46, when it was changed back to Write-Back. This shows that the I/O was impacted for 4 hours. Luckily for us it was outside of our high  traffic period for the day, but it was still painful.

3) Disable autoLearnMode for the RAID battery

This is so we don't have this type of surprise in the future. The autoLearnMode variable is ON by default. You can see its current setting if you run this command:

# ./MegaCli64 -AdpBbuCmd -GetBbuStatus -a0

We followed Robin's blog post (thanks, Robin!) and did:

# echo "autoLearnMode=1" > tmp.txt
# ./MegaCli64 -AdpBbuCmd -SetBbuProperties -f tmp.txt -a0

4) Force battery relearn cycle

It is still recommended to run the battery relearn cycle manually periodically, so we did it on all servers that are not yet in production. For the rest of the servers we'll do it at night, during a time frame when traffic is lowest. In the future, we'll take maintenance windows every N months (where N is probably 6 or 12) and force the relearn cycle.

Here's the command to force the relearn:

# ./MegaCli64 -AdpBbuCmd -BbuLearn -a0

For reference, LSI has good documentation for the MegaCli utilities on one of their KB sites. Another good reference is this Dell PERC cheatsheet.

I hope this will be a good troubleshooting guide for people faced with mysterious I/O slowness. Thanks again to Aurimas from Percona for his help. These guys are awesome!

Thursday, September 08, 2011

Setting up RAID1 with mdadm on a running Ubuntu server

For the last couple of weeks we've been working on setting up a bunch of Dell C2100 servers with Ubuntu 10.04. These servers come with 2 x 500 GB internal disks that can be set up in RAID1 with the on-board RAID controller. However, when we did that during the Ubuntu install, we never managed to get back into the OS after the initial reboot, or in some cases GRUB refused to write to the array (I think it was when we tried 11.04 in a desperation move). To make matters worse, even software RAID via mdadm stopped working, with the servers going to the initramfs BusyBox prompt after the initial reboot. My guess is that it all stems from GRUB not writing properly to /dev/md0 (the root partition RAID-ed during the install) and instead writing to /dev/sda1 and /dev/sdb1. So we decided to install the root partition and the swap space on /dev/sda and leave /dev/sdb alone.

I started to look for articles on how to set up RAID1 post-install, when you have the OS installed on /dev/sda, and you want to add /dev/sdb to a RAID1 setup with mdadm. After some fruitless searches, I finally hit the jackpot with this article on HowtoForge written by Falko Timme. I really don't have much to add, just follow the instructions closely and it will work ;-) Kudos to Falko for a great resource.

Friday, August 19, 2011

New location for the Python Testing Tools Taxonomy

I was taken by surprise by Baiju Muthukadan's announcement which I read on Planet Python -- the Python Testing Tools Taxonomy page which I started years ago has a new incarnation on the Python wiki. I think it's a good thing (although I still wish I had been notified as a courtesy). In any case, feel free to add more tools to the page!

Wednesday, August 17, 2011

Anybody using lxc or OpenVZ in production?

I asked a similar question yesterday on Twitter ("Anybody using Linux Containers (lxc) in production, preferably with Ubuntu?") and it seemed to have struck a chord, because many people asked me to post the answers to this question, and many other people answered the question.

Both Linux Containers (or lxc as the project is known) and OpenVZ are lightweight virtualization systems that operate at the file system level, and as such can be attractive to people who are looking to split a big physical server into containers, while achieving resource isolation per container. I personally want to look into both primarily as a means to run several MySQL instances per physical server while ensuring better resource isolation , especially in regards to RAM.

In any case, I thought it would be interesting to post the replies I got on Twitter to my question.

From AlTobey:

"I'm using straight cgroups without namespaces in production. It's pretty nice for fine-grained scheduler control."


From ohlol:

"I just began using lxc. Have three hosts in it so far as a test run. Not doing NAT, just plain bridging right now."


From vvuksan:

"I have been using it for about a week on my laptop to replace Vagrant/Virtualbox. Works great so far."

"I just posted a short write up of how I use LXC on my laptop http://t.co/CQXTPMv"

From ohlol:

"Have you tried lxc without libvirt? I found it to be a bit easier to deal with."

From vvuksan:

"Yes that is a red herring. You do not need libvirt. I had it installed already so went with it by default."

"It just helps me not have to set up dnsmasq, iptables etc. :-) But you can certainly do away with it."

From ohlol:
"Have you tried doing an apt-get upgrade in lxc? What a PITA :)"

"btw, if you ever get to that point: http://t.co/2WvYaeX helped get me to a working solution"

From ichilton:

"ive been using OpenVZ in production with Debian Stable (on the host and guests) for over a year with no problems...."

From griggheo:

"@ichilton you had to recompile the kernel for OpenVZ support in Debian right?"

From ichilton:

"I didn't, there was an OpenVZ kernel package but it was Lenny at the time and not upgraded yet - will have to check Squeeze."

From ichilton:

"@vvuksan interested why you did that originally and what the advantages are in hindsight?"

From vvuksan:

"Speed. The dev env needs 5-6 boxes running at the same time and with Vbox my laptop becomes really slow. With LXC not so much."

From sstatik:

"LXC should be considerably smoother in 11.10 for both 11.10/10.04 guests. I want to see laptop-based microclouds become common."
From mitchellh:

"@sstatik @griggheo Laptop based microclouds are the future. We're just missing quality software to help manage it."

From heckj:

"@sstatik @griggheo documentation and details getting better? its arcane to use in 11.04, and that is 1000x better than 10.x..."


So there you have it, a small snapshot of why and how people are using lxc/OpenVZ, especially on Ubuntu. I'll post my own experiences as I start playing with lxc and potentially OpenVZ.




Wednesday, July 27, 2011

Processing mail logs with Elastic MapReduce and Pig

These are some notes I took while trying out Elastic MapReduce (EMR), and more specifically its Pig functionality, by processing sendmail mail logs. A big help was Eric Lubow's blog post on EMR and Pig. Before I go into  details, here's my general processing flow:

  • N mail servers (running sendmail) send their mail logs to a central server running syslog-ng.
  • A process running on the central logging server tails the aggregated mail log (at 5 minute intervals), parses the lines it finds, extracts relevant information from each line, and saves the output in JSON format to a local file (actually there are 2 types of files generated, one for sender information and one for recipient information, corresponding to the 'from' and 'to' lines in the mail log -- see below)
  • Another process compresses the generated files in bzip2 format and uploads them to S3.
I have 2 sets of files, one set with names similar to "from-2011-07-12-20-58" and containing JSON records of the following form, one per line:

{"nrcpts": "1", "src": "info@example.com", "sendmailid": "p6D0r0u1006229", "relay": "app03.example.com", "classnumber": "0", "msgid": "WARQZCXAEMSSVWPPOOYZXRLQIKMFUY.155763@example.com", "
pid": "6229", "month": "Jul", "time": "20:53:00", "day": "12", "mailserver": "mail5", "size": "57395"}

The second set contains files with names similar to "to-2011-07-12-20-58" and containing JSON records of the following form, one per line:

{"sendmailid": "p6D0qwvm006395", "relay": "gmail-smtp-in.l.google.com.", "dest": "somebody@gmail.com", "pid": "6406", "stat": "Sent (OK 1310518380 pd12si6025606vcb.162)", "month": "Jul", "delay": "00:00:02", "time": "20:53:00", "xdelay": "00:00:02", "day": "12", "mailserver": "mail2"}

For the initial EMR/Pig setup, I followed "Parsing Logs with Apache Pig and Elastic MapReduce". It's fairly simple to end up with an EC2 instance running Hadoop and Pig that you can play with.

I then ssh-ed into the EMR master instance (note that it was still shown in 'Waiting' state in the EMR console, but once it got assigned an IP and internal name I was able to ssh into it).

In order for Pig to be able to process input in JSON format, you need to use Kevin Weil's elephant-bird library. I followed Eric Lubow's post to get that set up:

$ mkdir git && mkdir pig-jars
$ cd git && wget --no-check-certificate https://github.com/kevinweil/elephant-bird/tarball/eb1.2.1_with_jsonloader
$ tar xvfz eb1.2.1_with_jsonloader
$ cd kevinweil-elephant-bird-ecf8356/
$ cp lib/google-collect-1.0.jar ~/pig-jars/
$ cp lib/json-simple-1.1.jar ~/pig-jars/
$ ant nonothing
$ cd build/classes/
$ jar -cf ../elephant-bird-1.2.1-SNAPSHOT.jar com
$ cp ../elephant-bird-1.2.1-SNAPSHOT.jar ~/pig-jars/

I then copied 3 elephant-bird jar files to S3 so I can register them every time I run Pig. I did that via the grunt command prompt:

$ pig -x local

grunt> cp file:///home/hadoop/pig-jars/google-collect-1.0.jar s3://MY_S3_BUCKET/jars/pig/
grunt> cp file:///home/hadoop/pig-jars/json-simple-1.1.jar s3://MY_S3_BUCKET/jars/pig/                                               
grunt> cp file:///home/hadoop/pig-jars/elephant-bird-1.2.1-SNAPSHOT.jar s3://MY_S3_BUCKET/jars/pig/         
At this point, I was ready to process some of the files I uploaded to S3. 
I first tried processing a single file, using Pig's local mode (which doesn't involve HDFS). It turns out that Pig doesn't load compressed files correctly via elephant-bird when you run in local mode, so I tested this on an uncompressed file previously uploaded to S3:
$ pig -x local

grunt> REGISTER s3://MY_S3_BUCKET/jars/pig/google-collect-1.0.jar;                                                                   
grunt> REGISTER s3://MY_S3_BUCKET/jars/pig/json-simple-1.1.jar;
grunt> REGISTER s3://MY_S3_BUCKET/jars/pig/elephant-bird-1.2.1-SNAPSHOT.jar;grunt> json = LOAD 's3://MY_S3_BUCKET/mail_logs/2011-07-12/to-2011-07-12-16-49' USING com.twitter.elephantbird.pig.load.JsonLoader();
Note that I used the JSON loader from the elephant-bird JAR file.
I wanted to know the top 3 mail servers from the file I loaded (this is again heavily inspired by Eric Lubow's example in his blog post):
grunt> mailservers = FOREACH json GENERATE $0#'mailserver' AS mailserver;
grunt> mailserver_count = FOREACH (GROUP mailservers BY $0) GENERATE $0, COUNT($1) AS cnt;
grunt> mailserver_sorted_count = LIMIT(ORDER mailserver_count BY cnt DESC) 3;
grunt> DUMP mailserver_sorted_count;

I won't go into detail as far as the actual Pig operations I ran -- I recommend going through some Pig Latin tutorials or buying the O'Reilly 'Programming Pig' book. Suffice to say that I extracted the 'mailserver' JSON field, then I grouped the records by mail server and counted how many there are in each group. Finally, I dumped the 3 top mail servers found.

Here's a slightly more interesting exercise: finding out the top 10 mail recipients by looking at all the to-* files uploaded to S3 (still uncompressed in this case):

grunt> to = LOAD 's3://MY_S3_BUCKET/mail_logs/2011-07-13/to*' USING com.twitter.elephantbird.pig.load.JsonLoader();
grunt> to_emails = FOREACH to GENERATE $0#'dest' AS dest;                                                                      
grunt> to_count = FOREACH (GROUP to_emails BY $0) GENERATE $0, COUNT($1) AS cnt;                                              
grunt> to_sorted_count = LIMIT(ORDER to_count BY cnt DESC) 10;                                                                                  
grunt> DUMP to_sorted_count;


I tried the same processing steps on bzip2-compressed files using Pig's Hadoop mode (which you invoke by just running 'pig' and not 'pig -x local'). The files were loaded correctly this time, but the MapReduce phase failed with messages similar to this in  /mnt/var/log/apps/pig.log:



Pig Stack Trace
---------------
ERROR 6015: During execution, encountered a Hadoop error.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias to_sorted_count
at org.apache.pig.PigServer.openIterator(PigServer.java:482)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:546)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:374)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: During execution, encountered a Hadoop error.
at .apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:862)
at .apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474)
at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:109)
at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:255)
at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)
at .apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at .apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
at .apache.hadoop.mapred.MapTask.run(MapTask.java:312)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText
... 9 more

A quick Google search revealed JIRA Pig ticket #919 which offered a workaround. Basically this happens when a value coming out of a map is used in a group/cogroup/join. By default the type of that value is bytearray, and you need to cast it to chararray to make things work (I confess I didn't dig too much into the nitty-gritty of this issue yet, I was just happy I made it work).

So what I had to do was to modify a single line and cast the value used in the GROUP BY clause to chararray:

grunt> to_count = FOREACH (GROUP to_emails BY (chararray)$0) GENERATE $0, COUNT($1) AS cnt;  

At this point, I was able to watch Elastic MapReduce in action, slower than in local mode becase I only had 1 m1.small instance. I'll try it next with several instances and hopefully see a near-linear improvement.

That's it for now. This was just a toy example, but it got me started with EMR and Pig. Hopefully I'll follow up with more interesting log processing and analysis.

Friday, July 22, 2011

Results of a survey of the SoCal Piggies group

My colleague Warren Runk had the idea of putting together a survey to be sent to the mailing list of the SoCal Python Interest Group (aka SoCal Piggies), with the purpose of finding out which topics or activities would be most interesting to the members of the group in terms of future meetings. We had 10 topics in the survey, and people responded by choosing their top 5. We also had free-form response fields for 2 questions: "What do you like most about the meetings?" and "What meeting improvements are most important to you?".

We had 26 responses. Here are the votes results for the 10 topics we proposed:

#1 (18 votes): "Good practice, pitfall avoidance, and module introductions for beginners"


#2 (17 votes): "5 minute lightning talks"

#3 - #4 (15 votes): "Excellent code examples from established Python projects"
and "New and upcoming Python open source projects"


#5 (14 votes): "30 minute presentations"


#6 (13 votes): "Ice breakers/new member introductions"


#7 (12 votes): "Algorithm discussions and dissections"


#8 (11 votes): "Good testing practices and pointers to new methods/tools"


#9 (10 votes): "Moderated relevant/cutting edge general tech discussions"


#10 (9 votes): "Short small group programming exercises"

It's pretty clear that people are interested most of all in good Python programming practices and practical examples of 'excellent' code from established projects. Presentations are popular too, with lightning talks edging the longer 30-minute talks. A pretty good percentage of the people attending our meetings are beginners, so we're going to try to focus on making our meetings more beginner-friendly.

As far as what people like most about the meetings, here are a few things:
  • "I love hearing about how Python is being used in multiple locations throughout large corporations.  It helps me to promote Python at every opportunity when I can say that Python is being used at Acme Corp for XYZ!"
  • "High level introductions to Python modules. Often this is not the main thrust of a  talk, but the speaker chose some module for a given task and that helps me expand my horizon."
  • "Becoming aware of how various companies use python, which libraries and tools are used most often, the opportunity to connect with members during breaks."
  • "I like being exposed to things I don't normally see at work of if I've seen them I get to see them from a different angle.  "
  • "I don't have other geeks at my office so I like having the chance to hang out and get to know other Python programmers."
  • ...and many people expressed their satisfaction in seeing Raymod Hettinger's presentation at Disney Animation Studios (thanks to Paul Hildebrandt for putting that together!)
Here's what people said when asked about possible improvements:
  • "More best practices and module intros."
  • "Keep the meetings loose, don't have too many controls. "
  • "In addition to the aptly proposed "ice-breakers / introductions" how can we current members more-actively welcome beginners?"
  • "Time (and some format) to discuss the issues brought up in the talks. Sometimes I think it'd be useful for the group to get more directly involved in vetting/providing critique for some of the decisions a speaker made. Controversial points made in talks are great, but sometimes I think everyone might benefit from a few other perspectives."
  • "Friendlier onboarding of new members would be great."
  • "Keeping the total noobs in mind"
  • "I would like introductions. I have met a couple people at each of the meetings that I have attended, but I would also like to know who else is there."
  • "I would like the opportunity to meet resourceful programmers and learn techniques and abilities that I can't pick up from youtube or online tutorials!"
  • "I think we should try to come up with and stick with a consistent format. I like the discussion-style presentation so long as it does not detract from the topic at hand. I think we need to make sure that people stick with shorter presentations, so that there is plenty of time for Q&A without the risk of running on too long.  30 minutes should really be 30 minutes! "
  • "It would be good to identify the difficulty/skill level of a presentation ahead of time so that beginners are not scared off or at least know what they're getting into. Perhaps we could try to always mix it up by warming up with a beginner/intermediate preso and follow up with an intermediate/advanced."
I think you can see a theme here -- friendliness and attention towards beginners is a wish that many people have. I believe in the past we tended to ignore this side in our meetings, so we definitely need to do a better job at it.

We had a meeting last night where we discussed some of these topics. We tried to appoint point persons for given topics. These persons would be responsible for doing research on that topic (for example 'New and upcoming Python open source projects') and give a short presentation to the group at every meeting, while also looking for other group members to delegate this responsibility to in the future. I think this 'lieutenant' system will work well, but time will tell. My personal observation from the 7 years I've been organizing this group is that the hardest part is to get people to volunteer in any capacity, and most of all in presenting to the group. But this infusion of new ideas is very welcome, and I hope it will invigorate the participation in our group.

I hope the results of this survey and the feedback we got will be useful to other Python user groups out there.

I want to thank Warren Runk and Danny Greenfeld for their feedback, ideas and participation in making the SoCal Piggies Group better.

Wednesday, July 20, 2011

Accessing the data center from the cloud with OpenVPN

This post was inspired by a recent exercise I went through at the prompting of my colleague Dan Mesh. The goal was to have Amazon EC2 instances connect securely to servers at a data center using OpenVPN.

In this scenario, we have a server within the data center running OpenVPN in server mode. The server has a publicly accessible IP (via a firewall NAT) with port 1194 exposed via UDP. Cloud instances which run OpenVPN in client mode are connecting to the server, get a route pushed to them to an internal network within the data center, and are then able to access servers on that internal network over a VPN tunnel.

Here are some concrete details about the network topology that I'm going to discuss.

Server A at the data center has an internal IP address of 10.10.10.10 and is part of the internal network 10.10.10.0/24. There is a NAT on the firewall mapping external IP X.Y.Z.W to the internal IP of server A. There is also a rule that allows UDP traffic on port 1194 to X.Y.Z.W.

I have an EC2 instance from which I want to reach server B on the internal data center network, with IP 10.10.10.20.

Install and configure OpenVPN on server A

Since server A is running Ubuntu (10.04 to be exact), I used this very good guide, with an important exception: I didn't want to configure the server in bridging mode, I preferred the simpler tunneling mode. In bridging mode, the internal network which server A is part of (10.10.10.0/24 in my case) is directly exposed to OpenVPN clients. In tunneling mode, there is a tunnel created between clients and server A on a separated dedicated network. I preferred the tunneling option because it doesn't require any modifications to the network setup of server A (no bridging interface required), and because it provides better security for my requirements (I can target individual servers on the internal network and configure them to be accessed via VPN). YMMV of course.

For the initial installation and key creation for OpenVPN, I followed the guide. When it came to configuring the OpenVPN server, I created these entries in /etc/openvpn/server.conf:

server 172.16.0.0 255.255.255.0
push "route 10.10.10.0 255.255.255.0"
tls-auth ta.key 0 

The first directive specifies that the OpenVPN tunnel will be established on a new 172.16.0.0/24 network. The server will get the IP 172.16.0.1, while OpenVPN clients that connect to the server will get 172.16.0.6 etc.

The second directive pushes a static route to the internal data center network 10.10.10.0/24 to all connected OpenVPN clients. This way each client will know how to get to machines on that internal network, without the need to create static routes manually on the client.

The tls_auth entry provides extra security to help prevent DoS attacks and UDP port flooding.

Note that I didn't have to include any bridging-related scripts or other information in server.conf.

At this point, if you start the OpenVPN service on server A via 'service openvpn start', you should see an extra tun0 network interface when you run ifconfig. Something like this:


tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          inet addr:172.16.0.1  P-t-P:172.16.0.2  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:168 (168.0 B)  TX bytes:168 (168.0 B)

Also, the routing information will now include the 172.16.0.0 network:

# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
172.16.0.2      0.0.0.0         255.255.255.255 UH        0 0          0 tun0
172.16.0.0      172.16.0.2      255.255.255.0   UG        0 0          0 tun0
...etc

Install and configure OpenVPN on clients

Here again I followed the Ubuntu OpenVPN guide. The steps are very simple:

1) apt-get install openvpn

2) scp the following files (which were created on the server during the OpenVPN server install process above) from server A to the client, into the /etc/openvpn directory: 

ca.crt
ta.key
client_hostname.crt 
client_hostname.key


3) Customize client.conf:

# cp /usr/share/doc/openvpn/examples/sample-config-files/client.conf /etc/openvpn

Edit client.conf and specify:

remote X.Y.Z.W 1194     (where X.Y.Z.W is the external IP of server A)

cert client_hostname.crt
key client_hostname.key
tls-auth ta.key 1

Now if you start the OpenVPN service on the client via 'service openvpn start', you should see a tun0 interface when you run ifconfig:


tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
          inet addr:172.16.0.6  P-t-P:172.16.0.5  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100 
          RX bytes:168 (168.0 B)  TX bytes:168 (168.0 B)

You should also see routing information related to both the tunneling network 172.16.0.0/24 and to the internal data center network 10.10.10.0/0 (which was pushed from the server):

# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
172.16.0.5      0.0.0.0         255.255.255.255 UH        0 0          0 tun0
172.16.0.1      172.16.0.5      255.255.255.255 UGH       0 0          0 tun0
10.0.10.0       172.16.0.5      255.255.255.0   UG        0 0          0 tun0
....etc

At this point, the client and server A should be able to ping each other on their 172.16 IP addresses. From the client you should be able to ping server A's IP 172.16.0.1, and from server A you should be able to ping the client's IP 172.16.0.6.

Create static route to tunneling network on server B and enable IP forwarding on server A

Remember that the goal was for the client to access server B on the internal data center network, with IP address 10.10.10.20. For this to happen, I needed to add a static route on server B to the tunneling network 172.16.0.0/24, with server A's IP 10.10.10.10 as the gateway:

# route add -net 172.16.0.0/24 gw 10.10.10.10

The final piece of the puzzle is to allow server A to act as a router at this point, by enabling IP forwarding (which is disabled by default). So on server A I did:

# sysctl -w net.ipv4.ip_forward=1
# echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf

At this point, I was able to access server B from the client by using server B's 10.10.10.20 IP address.

We've just started to experiment with this setup, so I'm not yet sure if it's production ready. I wanted to jot down these things though because they weren't necessarily obvious, despite some decent blog posts and OpenVPN documentation. Hopefully they'll help somebody else out there too.



Thursday, June 30, 2011

A strategy for handling DNS in EC2 with Route 53

In my previous post I showed how to use the boto library to manage Route 53 DNS zones. Here I will show a strategy for handling DNS within an EC2 infrastructure using Route 53.

Let's assume you have a registered domain name called mycompanycloud.com. You want all your EC2 instances to use that domain name to communicate with each other. Assume you launch a database instance that you want to refer to as db01.mycompanycloud.com. What you do is you add a CNAME record in the DNS zone for mycompanycloud.com and point it to the external AWS name assigned to that instance. For example:
# route53 add_record ZONEID db01.mycompanycloud.com CNAME ec2-51-10-11-89.compute-1.amazonaws.com 3600

The advantage of this method is that DNS queries for db01.mycompanycloud.com from within EC2 will eventually resolve the CNAME to the internal IP address of the instance, while DNS queries from outside EC2 will resolve it to the external IP address -- which is in general exactly what you want.

There's one more caveat: if you need the default DNS and search domain in /etc/resolv.conf to be mycompanycloud.com, you need to configure the DHCP client to use that domain, by adding this line to /etc/dhcp3/dhclient.conf:

supersede domain-name "mycompanycloud.com ec2.internal compute-1.internal" ;

Then edit/overwrite /etc/resolv.conf and specify:

nameserver 172.16.0.23
domain mycompanycloud.com
search mycompanycloud.com ec2.internal compute-1.internal

The line in dhclient.conf will ensure that your custom resolv.conf file will be preserved across reboots -- which is not usually the case in EC2 with the default DHCP behavior (thanks to Gerald Chao for pointing out this solution to me).

Of course, you should have all this in the Chef or Puppet recipes you use when you build out a new instance.

I've been applying this strategy for a while and it works out really well, and it also allows me to not run and take care of my own BIND servers in EC2.


Monday, June 20, 2011

Managing Amazon Route 53 DNS with boto

Here's a quick post that shows how to manage Amazon Route 53 DNS zones and records using the ever-useful boto library from Mitch Garnaat. Route 53 is a typical pay-as-you-go inexpensive AWS service which you can use to host your DNS zones. I wanted to play with it a bit, and some Google searches revealed two good blog posts: "Boto and Amazon Route53" by Chris Moyer and "Using boto to manage Route 53" by Rob Ballou. I want to thank those two guys for blogging about Route 53, their posts were a great help to me in figuring things out.

Install boto

My machine is running Ubuntu 10.04 with Python 2.6. I ran 'easy_install boto', which installed boto-2.0rc1. This also installs several utilities in /usr/local/bin, of interest to this article being /usr/local/bin/route53 which provides an easy command-line-oriented way of interacting with Route 53.

Create boto configuration file

I created ~/.boto containing the Credentials section with the AWS access key and secret key:
# cat ~./boto
[Credentials]
aws_access_key_id = "YOUR_ACCESS_KEY"
aws_secret_access_key = "YOUR_SECRET_KEY"


Interact with Route 53 via the route53 utility

If you just run 'route53', the command will print the help text for its usage. For our purpose, we'll make sure there are no errors when we run:

# route53 ls

If you don't have any DNS zones already created, this will return nothing.

Create a new DNS zone with route53

We'll create a zone called 'mytestzone':

# route53 create mytestzone.com
Pending, please add the following Name Servers:
 ns-674.awsdns-20.net
 ns-1285.awsdns-32.org
 ns-1986.awsdns-56.co.uk
 ns-3.awsdns-00.com

Note that you will have to properly register 'mytestzone.com' with a registrar, then point the name server information at that registrat to the name servers returned when the Route 53 zone was created (in our case the 4 name servers above).

At this point, if you run 'route53 ls' again, you should see your newly created zone. You need to make note of the zone ID:

root@m2:~# route53 ls
================================================================================
| ID:   MYZONEID
| Name: mytestzone.com.
| Ref:  my-ref-number
================================================================================
{}

You can also get the existing records from a given zone by running the 'route53 get' command which also takes the zone ID as an argument:

# route53 get MYZONEID
Name                                   Type  TTL                  Value(s)
mytestzone.com.                        NS    172800               ns-674.awsdns-20.net.,ns-1285.awsdns-32.org.,ns-1986.awsdns-56.co.uk.,ns-3.awsdns-00.com.
mytestzone.com.                        SOA   900                  ns-674.awsdns-20.net. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

Adding and deleting DNS records using route53

Let's add an A record to the zone we just created. The route53 utility provides an 'add_record' command which takes the zone ID as an argument, followed by the name, type, value and TTL of the new record, and an optional comment. The TTL is also optional, and defaults to 600 seconds if not specified. Here's how to add an A record with a TTL of 3600 seconds:

# route53 add_record MYZONEID test.mytestzone.com A SOME_IP_ADDRESS 3600
{u'ChangeResourceRecordSetsResponse': {u'ChangeInfo': {u'Status': u'PENDING', u'SubmittedAt': u'2011-06-20T23:01:23.851Z', u'Id': u'/change/CJ2GH5O38HYKP0'}}}

Now if you run 'route53 get MYZONEID' you should see your newly added record.

To delete a record, use the 'route53 del_record' command, which takes the same arguments as add_record. Here's how to delete the record we just added:

# route53 del_record Z247A81E3SXPCR test.mytestzone.com. A SOME_IP_ADDRESS
{u'ChangeResourceRecordSetsResponse': {u'ChangeInfo': {u'Status': u'PENDING', u'SubmittedAt': u'2011-06-21T01:14:35.343Z', u'Id': u'/change/C2B0EHROD8HEG8'}}}

Managing Route 53 programmatically with boto

As useful as the route53 command-line utility is, sometimes you need to interact with the Route 53 service from within your program. Since this post is about boto, I'll show some Python code that uses the Route 53 functionality.

Here's how you open a connection to the Route 53 service:

from boto.route53.connection import Route53Connection
conn = Route53Connection()

(this assumes you have the AWS credentials in the ~/.boto configuration file)

Here's how you retrieve and walk through all your Route 53 DNS zones, selecting a zone by name:

ROUTE53_ZONE_NAME = "mytestzone.com."

zones = {}
conn = Route53Connection()

results = conn.get_all_hosted_zones()
zones = results['ListHostedZonesResponse']['HostedZones']
found = 0
for zone in zones:
    print zone
    if zone['Name'] == ROUTE53_ZONE_NAME:
        found = 1
        break
if not found:
    print "No Route53 zone found for %s" % ROUTE53_ZONE_NAME

(note that you need the ending period in the zone name that you're looking for, as in "mytestzone.com.")

Here's how you add a CNAME record with a TTL of 60 seconds to an existing zone (assuming the 'zone' variable contains the zone you're looking for). You need to operate on the zone ID, which is the identifier following the text '/hostedzone/' in the 'Id' field of the variable 'zone'.

from boto.route53.record import ResourceRecordSets
zone_id = zone['Id'].replace('/hostedzone/', '')
changes = ResourceRecordSets(conn, zone_id)
change = changes.add_change("CREATE", 'test2.%s' % ROUTE53_ZONE_NAME, "CNAME", 60)
change.add_value("some_other_name")
changes.commit()

To delete a record, you use the exact same code as above, but with "DELETE" instead of "CREATE".

I leave other uses of the 'route53' utility and of the boto Route 53 API as an exercise to the reader.

Wednesday, June 01, 2011

Technical books that influenced my career

Here's a list of 25 technical books that had a strong influence on my career, presented in a somewhat chronological order of my encounters with them:

  1. "The Art of Computer Programming", esp. vol. 3 "Sorting and Searching" - Donald Knuth
  2. "Operating Systems" - William Stallings
  3. "Introduction to Algorithms" - Thomas Cormen et al.
  4. "The C Programming Language" - Brian Kernighan and Dennis Ritchie
  5. "Programming Windows" - Charles Petzold
  6. "Writing Solid Code" - Steve Maguire
  7. "The Practice of Programming" - Brian Kernighan and Rob Pike
  8. "Computer Networks - a Systems Approach" - Larry Peterson and Bruce Davie
  9. "TCP/IP Illustrated" - W. Richard Stevens
  10. "Distributed Systems - Concepts And Design" - George Coulouris et al.
  11. "DNS and BIND" - Cricket Liu and Paul Albitz
  12. "UNIX and Linux System Administration Handbook" - Evi Nemeth et al.
  13. "The Mythical Man-Month" - Fred Brooks
  14. "Programming Perl" - Larry Wall et al.
  15. "Counter Hack Reloaded: a Step-by-Step Guide to Computer Attacks and Effective Defenses" - Edward Skoudis and Tom Liston
  16. "Programming Python" - Mark Lutz
  17. "Lessons Learned in Software Testing" - Cem Kaner, James Bach, Bret Pettichord
  18. "Refactoring - Improving the Design of Existing Code" - Martin Fowler
  19. "The Pragmatic Programmer" - Andrew Hunt and David Thomas
  20. "Becoming a Technical Leader" - Gerald Weinberg
  21. "Extreme Programming Explained" - Kent Beck
  22. "Programming Amazon Web Services" - James Murty
  23. "Building Scalable Web Sites" - Cal Henderson
  24. "RESTful Web Services" - Leonard Richardson, Sam Ruby
  25. "The Art of Capacity Planning" - John Allspaw
What is your list?

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...