Friday, September 16, 2011

Slow database? Check RAID battery!

Executive Summary: 

If your Dell database servers get slow suddenly, and I/O seems sluggish, do yourself a favor and check if the RAID battery is currently going through its 'relearning' cycle. If this is so, then the Write-Back policy is disabled and Write-Through is enabled -- as a result writes become very slow compared to the standard operation.

Details:

This turns out to be a fairly well known problem with RAID controllers in Dell servers, specifically LSI controllers. The default mode of operation for the RAID battery is to periodically go through a so-called 'relearn cycle', where it discharges, then charges and recalibrates itself by finding the current charge. In this timeframe, as I mentioned, Write-Back is disabled and Write-Through is enabled.

For our MySQL servers, we have innodb_flush_log_at_trx_commit set to 1, which means that every commit if flushed to disk. In consequence, the Write-Through mode will severely impact the performance of the writes to the database. A symptom is that CPU I/O wait is high, and the database gets sluggish. Pain all around.

We started to experience this database slowness on 3 database server at almost the same time. Two of them were configured as slaves, and one as master. The symptoms included high CPU I/O wait, slow queries on the master, and replication lag on the slaves. Nothing pointed to something specific to MySQL. We opened an emergency ticket with Percona and were fortunate to be assigned to Aurimas Mikalauskas, a Percona principal consultant and a MySQL/RAID hardware guru. It took him less than a minute to correctly diagnose the issue based on these symptoms. Now that we knew what the issue was, some Google searches turned out other articles and blog posts talking about it. It turns out one of the most frequently cited posts belongs to Robin Bowes, my ex-coworker from RIS Technology/Reliam! It also turns out Percona engineers blogged about this issue extensively (see this post which references other posts).

In any case, for future reference, here is what we did on all the servers that have the LSI MegaRaid controller (these servers are Dell C2100s in our case):

1) Install MegaCli utilities

I had a hard time finding these utilities, since the LSI support site doesn't seem to have them anymore. I found this blog post talking about a zip file containing the tools, then I googled the zip filename and I found an updated version on this Gentoo-related site. Then I followed the steps in the blog post above to extract the statically-linked binaries:

# apt-get install rpm2cpio
# mkdir megacli; cd megacli

# wget http://download.gocept.com/gentoo/mirror/distfiles/4.00.11_Linux_MegaCLI.zip
# unzip 4.00.11_Linux_MegaCLI.zip 
# unzip MegaCliLin.zip 
# rpm2cpio MegaCli-4.00.11-1.i386.rpm| cpio -idmv

At this point I had 2 statically-linked binaries called MegaCli and MegaCli64 in megacli/opt/MegaRAID/MegaCli.

2) Inspect event log for RAID controller to figure out what has been going on in that subsystem (this command was actually run by Aurimas during the troubleshooting he did):

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL
# cat events.log
...
Event Description: Time established as 09/09/11 15:27:18; (48 seconds since power on)
--
Time: Fri Sep 9 16:27:36 2011
Event Description: Battery temperature is normal
--
Time: Fri Sep 9 16:56:51 2011
Event Description: Battery started charging
--
Time: Fri Sep 9 17:08:46 2011
Event Description: Battery charge complete
--
Time: Sat Sep 10 19:54:16 2011
Event Description: Battery relearn will start in 4 days
--
Time: Mon Sep 12 19:53:46 2011
Event Description: Battery relearn will start in 2 day
--
Time: Tue Sep 13 19:54:36 2011
Event Description: Battery relearn will start in 1 day
--
Time: Wed Sep 14 14:54:16 2011
Event Description: Battery relearn will start in 5 hours
--
Time: Wed Sep 14 19:55:26 2011
Event Description: Battery relearn pending: Battery is under charge
--
Time: Wed Sep 14 19:55:26 2011
Event Description: Battery relearn started
--
Time: Wed Sep 14 19:55:29 2011
Event Description: BBU disabled; changing WB virtual disks to WT, Forced WB VDs are not affected
--
Time: Wed Sep 14 19:55:29 2011
Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=00,ap=0,dc=0] from [ID=00,dcp=01,ccp=01,ap=0,dc=0]
Previous LD Properties
Current Cache Policy: 1
Default Cache Policy: 1
New LD Properties
Current Cache Policy: 0
Default Cache Policy: 1
--
Time: Wed Sep 14 19:56:31 2011
Event Description: Battery is discharging
--
Time: Wed Sep 14 19:56:31 2011
Event Description: Battery relearn in progress
--
Time: Wed Sep 14 22:43:21 2011
Event Description: Battery relearn completed
--
Time: Wed Sep 14 22:44:26 2011
Event Description: Battery started charging
--
Time: Wed Sep 14 23:53:46 2011
Event Description: BBU enabled; changing WT virtual disks to WB
--
Time: Wed Sep 14 23:53:46 2011
Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=01,ap=0,dc=0] from [ID=00,dcp=01,ccp=00,ap=0,dc=0]
Previous LD Properties
Current Cache Policy: 0
Default Cache Policy: 1
New LD Properties
Current Cache Policy: 1
Default Cache Policy: 1


So as you can see, the battery relearn started at 19:55:26, then 3 seconds later the Write-Back policy was changed to Write-Through, and it stayed like this until 23:53:46, when it was changed back to Write-Back. This shows that the I/O was impacted for 4 hours. Luckily for us it was outside of our high  traffic period for the day, but it was still painful.

3) Disable autoLearnMode for the RAID battery

This is so we don't have this type of surprise in the future. The autoLearnMode variable is ON by default. You can see its current setting if you run this command:

# ./MegaCli64 -AdpBbuCmd -GetBbuStatus -a0

We followed Robin's blog post (thanks, Robin!) and did:

# echo "autoLearnMode=1" > tmp.txt
# ./MegaCli64 -AdpBbuCmd -SetBbuProperties -f tmp.txt -a0

4) Force battery relearn cycle

It is still recommended to run the battery relearn cycle manually periodically, so we did it on all servers that are not yet in production. For the rest of the servers we'll do it at night, during a time frame when traffic is lowest. In the future, we'll take maintenance windows every N months (where N is probably 6 or 12) and force the relearn cycle.

Here's the command to force the relearn:

# ./MegaCli64 -AdpBbuCmd -BbuLearn -a0

For reference, LSI has good documentation for the MegaCli utilities on one of their KB sites. Another good reference is this Dell PERC cheatsheet.

I hope this will be a good troubleshooting guide for people faced with mysterious I/O slowness. Thanks again to Aurimas from Percona for his help. These guys are awesome!

Thursday, September 08, 2011

Setting up RAID1 with mdadm on a running Ubuntu server

For the last couple of weeks we've been working on setting up a bunch of Dell C2100 servers with Ubuntu 10.04. These servers come with 2 x 500 GB internal disks that can be set up in RAID1 with the on-board RAID controller. However, when we did that during the Ubuntu install, we never managed to get back into the OS after the initial reboot, or in some cases GRUB refused to write to the array (I think it was when we tried 11.04 in a desperation move). To make matters worse, even software RAID via mdadm stopped working, with the servers going to the initramfs BusyBox prompt after the initial reboot. My guess is that it all stems from GRUB not writing properly to /dev/md0 (the root partition RAID-ed during the install) and instead writing to /dev/sda1 and /dev/sdb1. So we decided to install the root partition and the swap space on /dev/sda and leave /dev/sdb alone.

I started to look for articles on how to set up RAID1 post-install, when you have the OS installed on /dev/sda, and you want to add /dev/sdb to a RAID1 setup with mdadm. After some fruitless searches, I finally hit the jackpot with this article on HowtoForge written by Falko Timme. I really don't have much to add, just follow the instructions closely and it will work ;-) Kudos to Falko for a great resource.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...