Agile Testing: October 2011

Friday, October 28, 2011

More fun with the LSI MegaRAID controllers

As I mentioned in a previous post, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.

Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.

For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).

Check RAID event log for errors

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL

This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:

===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
Error: 3

seqNum: 0x00001654
Time: Fri Oct 28 04:45:26 2011

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path
5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02
Event Data:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
CDB Length: 10
CDB Data:
002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000
0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002
0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00001655

Time: Fri Oct 28 04:45:26 2011

Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 0b(e0x0c/s11)
Event Data:

Check state of particular physical drive

You need the enclosure index (12 in the output above) and the device ID (11 in the output above). Then you run:

./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

In our case, we saw that Media Error Count was non-zero.

Mark physical drive offline

We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.

# ./MegaCli64 -PDOffline -PhysDrv [12:11] aALL

Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.

Hot swap faulty drive

We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:

# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

Enclosure Device ID: 12

Slot Number: 11

Device Id: 13

Sequence Number: 3

Media Error Count: 0

Other Error Count: 0

Predictive Failure Count: 0

Last Predictive Failure Event Seq Number: 0

PD Type: SAS

Raw Size: 931.512 GB [0x74706db0 Sectors]

Non Coerced Size: 931.012 GB [0x74606db0 Sectors]

Coerced Size: 931.0 GB [0x74600000 Sectors]

Firmware state: Rebuild

SAS Address(0): 0x5000c500347da135

SAS Address(1): 0x0

Connected Port Number: 0(path0)

Inquiry Data: SEAGATE ST31000424SS KS689WK4Z08J

FDE Capable: Not Capable

FDE Enable: Disable

Secured: Unsecured

Locked: Unlocked

Foreign State: None

Device Speed: Unknown

Link Speed: Unknown

Media Type: Hard Disk Device

Exit Code: 0x00

Check progress of the rebuilding process

# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL

Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.

Exit Code: 0x00

Check event log again

Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:

# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL

In our case we saw something like this:

# cat t1.log

seqNum: 0x00001720

Time: Fri Oct 28 12:48:27 2011

Code: 0x000000f9

Class: 0

Locale: 0x01

Event Description: VD 00/0 is now OPTIMAL

Event Data:

===========

Target Id: 0

seqNum: 0x0000171f

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000051

Class: 0

Locale: 0x01

Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)

Event Data:

===========

Target Id: 0

Previous state: 2

New state: 3

seqNum: 0x0000171e

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000072

Class: 0

Locale: 0x02

Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Previous state: 20

New state: 24

seqNum: 0x0000171d

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000064

Class: 0

Locale: 0x02

Event Description: Rebuild complete on PD 0d(e0x0c/s11)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

seqNum: 0x0000171c

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000063

Class: 0

Locale: 0x02

Event Description: Rebuild complete on VD 00/0

Event Data:

===========

Target Id: 0

seqNum: 0x000016b7

Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072

Class: 0

Locale: 0x02

Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Previous state: 16

New state: 20

seqNum: 0x000016b6

Time: Fri Oct 28 08:55:42 2011

Code: 0x0000006a

Class: 0

Locale: 0x02

Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

seqNum: 0x000016b5

Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072

Class: 0

Locale: 0x02

Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Previous state: 0

New state: 16

seqNum: 0x000016b4

Time: Fri Oct 28 08:55:42 2011

Code: 0x000000f7

Class: 0

Locale: 0x02

Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000

Event Data:

===========

Device ID: 13

Enclosure Device ID: 12

Enclosure Index: 1

Slot Number: 11

SAS Address 1: 5000c500347da135

SAS Address 2: 0

seqNum: 0x000016b3

Time: Fri Oct 28 08:55:42 2011

Code: 0x0000005b

Class: 0

Locale: 0x02

Event Description: Inserted: PD 0d(e0x0c/s11)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Here are some other useful troubleshooting commands:

Check firmware state across all physical drives

This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:

# ./MegaCli64 -PDList -a0 | grep Firmware

In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.

Check virtual drive state

You can also check the virtual drive state:

# ./MegaCli64 -LDInfo -L0 -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None

Exit Code: 0x00

Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:

# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online

Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N > 0, e.g.:

Previous state: 0

New state: 16

Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.

Resources

MegaCli command reference from LSI
Perc RAID cheatsheet from ronaldbradford.com

Thursday, October 06, 2011

Good email sending practices

I'm not going to call these 'best practices' but I hope they'll be useful if you're looking for ways to improve your email sending capabilities so as to maximize the odds that a message intended for a given recipient actually reaches that recipient's inbox.

Make sure your mail servers are not configured as open relays

This should go without saying, but it should still be your #1 concern
Use ACLs and only allow relaying from the IPs of your application servers
Check your servers by using a mail relay testing service

Make sure you have reverse DNS entries for the IP addresses you're sending mail from

This is another one of the oldies but goldies that you should double-check

Use DKIM

From the Wikipedia entry: DomainKeys Identified Mail (DKIM) is a method for associating a domain name to an email message, thereby allowing a person, role, or organization to claim some responsibility for the message. The association is set up by means of a digital signature which can be validated by recipients. Responsibility is claimed by a signer —independently of the message's actual authors or recipients— by adding a DKIM-Signature: field to the message's header. The verifier recovers the signer's public key using the DNS, and then verifies that the signature matches the actual message's content.
Check out dkim.org
Check out my blog post on setting up OpenDKIM with postfix

Use SPF

From the Wikipedia entry: Sender Policy Framework (SPF) is an email validation system designed to prevent email spam by detecting email spoofing, a common vulnerability, by verifying sender IP addresses. SPF allows administrators to specify which hosts are allowed to send mail from a given domain by creating a specific SPF record (or TXT record) in the Domain Name System (DNS). Mail exchangers use the DNS to check that mail from a given domain is being sent by a host sanctioned by that domain's administrators.
Some SPF-testing tools are available at this openspf.org page

Use anti-virus/anti-spam filtering tools as a precaution for guarding against sending out malicious content

a list of such tools for postfix
for sendmail there are 'milters' such as MIMEDefang

Monitor the mail queues on your mail servers

Watch for out-of-ordinary spikes or drops which can mean that somebody might try to exploit your mail subsystem
Use tools such as nagios, munin and ganglia to monitor and alert on your mail queues

Monitor the rate of your outgoing email

Again spikes or drops should alert you to suspicious activity or even errors in your application that can have an adverse effect on your mail subsystem
Check out my blog post on visualizing mail logs although these days we are using Graphite to store and visualize these data points

If you have the budget, use deliverability and reputation monitoring services such as ReturnPath

These services typically monitor the IP addresses of your mail servers and register them with various major ISPs
They can alert you when users at major ISPs are complaining about emails originating from you (most likely marking them as spam)
They can also whitelist your mail server IPs with some ISPs
They can provide lists of mailboxes they maintain at major ISPs and allow you to send email to those mailboxes so you can see the percentage of your messages that reach the inbox, or are missing, or go to a spam folder

Again if you have the budget (it's not that expensive), use a service such as Litmus which shows you how your email messages look in various mail clients on a variety of operating systems and mobile devices
If you don't have the budget, at least check that your mail server IPs are not blacklisted by various RBL sites

You can use a multi-RBL check tool such as this one

Make sure the headers of your messages match the type of email you're sending

For example, if your messages are MIME multipart, make sure you mark them as such in the Content-type headers. You should have a Content-type header showing 'multipart/related' followed by other headers showing various types of content such as 'multipart/alternative', 'text/plain' etc.
This is usually done via an API; if you're using Python, this can be done with something like

message = MIMEMultipart("related")
message_alternative = MIMEMultipart("alternative")
message.attach(message_alternative)
message_alternative.attach(MIMEText(plaintext_body.encode('utf-8'), "plain", 'utf-8'))
message_alternative.attach(MIMEText(html_body.encode('utf-8'), 'html', 'utf-8'))

Amazon offers the Simple Email Service (SES) as part of their AWS offerings
Mailgun has a simple API
Dyn Inc has recently entered the email sending arena too

Agile Testing

Friday, October 28, 2011

More fun with the LSI MegaRAID controllers

Thursday, October 06, 2011

Good email sending practices

Modifying EC2 security groups via AWS Lambda functions

Followers