Friday, October 28, 2011

More fun with the LSI MegaRAID controllers

As I mentioned in a previous post, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.

Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.

For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).

Check RAID event log for errors

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL

This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:

===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
Error: 3

seqNum: 0x00001654
Time: Fri Oct 28 04:45:26 2011

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path
5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02
Event Data:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
CDB Length: 10
CDB Data:
002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000
0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002
0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00001655

Time: Fri Oct 28 04:45:26 2011

Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 0b(e0x0c/s11)
Event Data:



Check state of particular physical drive

You need the enclosure index (12 in the output above) and the device ID (11 in the output above). Then you run:

./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

In our case, we saw that Media Error Count was non-zero.

Mark physical drive offline

We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.

# ./MegaCli64 -PDOffline  -PhysDrv [12:11] aALL

Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.

Hot swap faulty drive

We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:

# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

Enclosure Device ID: 12
Slot Number: 11
Device Id: 13
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 931.512 GB [0x74706db0 Sectors]
Non Coerced Size: 931.012 GB [0x74606db0 Sectors]
Coerced Size: 931.0 GB [0x74600000 Sectors]
Firmware state: Rebuild
SAS Address(0): 0x5000c500347da135
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST31000424SS    KS689WK4Z08J            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: Unknown 
Link Speed: Unknown 
Media Type: Hard Disk Device


Exit Code: 0x00

Check progress of the rebuilding process

# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL
                                     
Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.

Exit Code: 0x00

Check event log again

Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:

# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL

In our case we saw something like this:


# cat t1.log 



seqNum: 0x00001720
Time: Fri Oct 28 12:48:27 2011

Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL
Event Data:
===========
Target Id: 0


seqNum: 0x0000171f
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3


seqNum: 0x0000171e
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 20
New state: 24


seqNum: 0x0000171d
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000064
Class: 0
Locale: 0x02
Event Description: Rebuild complete on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x0000171c
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000063
Class: 0
Locale: 0x02
Event Description: Rebuild complete on VD 00/0
Event Data:
===========
Target Id: 0


seqNum: 0x000016b7
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 16
New state: 20


seqNum: 0x000016b6
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000006a
Class: 0
Locale: 0x02
Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x000016b5
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 0
New state: 16


seqNum: 0x000016b4
Time: Fri Oct 28 08:55:42 2011

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000
Event Data:
===========
Device ID: 13
Enclosure Device ID: 12
Enclosure Index: 1
Slot Number: 11
SAS Address 1: 5000c500347da135
SAS Address 2: 0


seqNum: 0x000016b3
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11

Here are some other useful troubleshooting commands:

Check firmware state across all physical drives


This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:

# ./MegaCli64 -PDList -a0 | grep Firmware

In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.

Check virtual drive state

You can also check the virtual drive state:

# ./MegaCli64 -LDInfo -L0 -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None


Exit Code: 0x00

Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:

# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online


Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N > 0, e.g.:


Previous state: 0

New state: 16

Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.

Resources



3 comments:

Robert said...

Software RAID ftw.

Anonymous said...

Software raid is too slow. Be a man and use hardware raid.

Anonymous said...

Software raid won't recover from failures so easy. Hardware raid is faster and more tolerant to disk issues.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...