Friday, October 28, 2011

More fun with the LSI MegaRAID controllers

As I mentioned in a previous post, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.

Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.

For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).

Check RAID event log for errors

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL

This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:

===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
Error: 3

seqNum: 0x00001654
Time: Fri Oct 28 04:45:26 2011

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path
5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02
Event Data:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
CDB Length: 10
CDB Data:
002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000
0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002
0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00001655

Time: Fri Oct 28 04:45:26 2011

Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 0b(e0x0c/s11)
Event Data:



Check state of particular physical drive

You need the enclosure index (12 in the output above) and the device ID (11 in the output above). Then you run:

./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

In our case, we saw that Media Error Count was non-zero.

Mark physical drive offline

We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.

# ./MegaCli64 -PDOffline  -PhysDrv [12:11] aALL

Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.

Hot swap faulty drive

We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:

# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

Enclosure Device ID: 12
Slot Number: 11
Device Id: 13
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 931.512 GB [0x74706db0 Sectors]
Non Coerced Size: 931.012 GB [0x74606db0 Sectors]
Coerced Size: 931.0 GB [0x74600000 Sectors]
Firmware state: Rebuild
SAS Address(0): 0x5000c500347da135
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST31000424SS    KS689WK4Z08J            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: Unknown 
Link Speed: Unknown 
Media Type: Hard Disk Device


Exit Code: 0x00

Check progress of the rebuilding process

# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL
                                     
Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.

Exit Code: 0x00

Check event log again

Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:

# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL

In our case we saw something like this:


# cat t1.log 



seqNum: 0x00001720
Time: Fri Oct 28 12:48:27 2011

Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL
Event Data:
===========
Target Id: 0


seqNum: 0x0000171f
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3


seqNum: 0x0000171e
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 20
New state: 24


seqNum: 0x0000171d
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000064
Class: 0
Locale: 0x02
Event Description: Rebuild complete on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x0000171c
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000063
Class: 0
Locale: 0x02
Event Description: Rebuild complete on VD 00/0
Event Data:
===========
Target Id: 0


seqNum: 0x000016b7
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 16
New state: 20


seqNum: 0x000016b6
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000006a
Class: 0
Locale: 0x02
Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x000016b5
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 0
New state: 16


seqNum: 0x000016b4
Time: Fri Oct 28 08:55:42 2011

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000
Event Data:
===========
Device ID: 13
Enclosure Device ID: 12
Enclosure Index: 1
Slot Number: 11
SAS Address 1: 5000c500347da135
SAS Address 2: 0


seqNum: 0x000016b3
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11

Here are some other useful troubleshooting commands:

Check firmware state across all physical drives


This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:

# ./MegaCli64 -PDList -a0 | grep Firmware

In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.

Check virtual drive state

You can also check the virtual drive state:

# ./MegaCli64 -LDInfo -L0 -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None


Exit Code: 0x00

Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:

# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online


Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N > 0, e.g.:


Previous state: 0

New state: 16

Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.

Resources



Thursday, October 06, 2011

Good email sending practices


I'm not going to call these 'best practices' but I hope they'll be useful if you're looking for ways to improve your email sending capabilities so as to maximize the odds that a message intended for a given recipient actually reaches that recipient's inbox.

  • Make sure your mail servers are not configured as open relays
    • This should go without saying, but it should still be your #1 concern
    • Use ACLs and only allow relaying from the IPs of your application servers
    • Check your servers by using a mail relay testing service
  • Make sure you have reverse DNS entries for the IP addresses you're sending mail from
    • This is another one of the oldies but goldies that you should double-check
  • Use DKIM
    • From the Wikipedia entryDomainKeys Identified Mail (DKIM) is a method for associating a domain name to an email message, thereby allowing a person, role, or organization to claim some responsibility for the message. The association is set up by means of a digital signature which can be validated by recipients. Responsibility is claimed by a signer —independently of the message's actual authors or recipients— by adding a DKIM-Signature: field to the message's header. The verifier recovers the signer's public key using the DNS, and then verifies that the signature matches the actual message's content.
    • Check out dkim.org
    • Check out my blog post on setting up OpenDKIM with postfix
  • Use SPF
    • From the Wikipedia entrySender Policy Framework (SPF) is an email validation system designed to prevent email spam by detecting email spoofing, a common vulnerability, by verifying sender IP addresses. SPF allows administrators to specify which hosts are allowed to send mail from a given domain by creating a specific SPF record (or TXT record) in the Domain Name System (DNS). Mail exchangers use the DNS to check that mail from a given domain is being sent by a host sanctioned by that domain's administrators.
    • Some SPF-testing tools are available at this openspf.org page
  • Use anti-virus/anti-spam filtering tools as a precaution for guarding against sending out malicious content
    • list of such tools for postfix
    • for sendmail there are 'milters' such as MIMEDefang 
  • Monitor the mail queues on your mail servers
    • Watch for out-of-ordinary spikes or drops which can mean that somebody might try to exploit your mail subsystem
    • Use tools such as nagios, munin and ganglia to monitor and alert on your mail queues
  • Monitor the rate of your outgoing email
    • Again spikes or drops should alert you to suspicious activity or even errors in your application that can have an adverse effect on your mail subsystem
    • Check out my blog post on visualizing mail logs although these days we are using Graphite to store and visualize these data points
  • If you have the budget, use deliverability and reputation monitoring services such as ReturnPath
    • These services typically monitor the IP addresses of your mail servers and register them with various major ISPs
    • They can alert you when users at major ISPs are complaining about emails originating from you (most likely marking them as spam)
    • They can also whitelist your mail server IPs with some ISPs
    • They can provide lists of mailboxes they maintain at major ISPs and allow you to send email to those mailboxes so you can see the percentage of your messages that reach the inbox, or are missing, or go to a spam folder
  • Again if you have the budget (it's not that expensive), use a service such as Litmus which shows you how your email messages look in various mail clients on a variety of operating systems and mobile devices
  • If you don't have the budget, at least check that your mail server IPs are not blacklisted by various RBL sites
    • You can use a multi-RBL check tool such as this one
  • Make sure the headers of your messages match the type of email you're sending
    • For example, if your messages are MIME multipart, make sure you mark them as such in the Content-type headers. You should have a Content-type header showing 'multipart/related' followed by other headers showing various types of content such as 'multipart/alternative', 'text/plain' etc.
    • This is usually done via an API; if you're using Python, this can be done with something like
    • message = MIMEMultipart("related")
      message_alternative = MIMEMultipart("alternative")
      message.attach(message_alternative)
      message_alternative.attach(MIMEText(plaintext_body.encode('utf-8'), "plain", 'utf-8'))
      message_alternative.attach(MIMEText(html_body.encode('utf-8'), 'html', 'utf-8'))
      

    Perhaps the most important recommendation I have: unless you really know what you're doing, look into outsourcing your email sending needs to a 3rd party service which preferably offers an API:

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...