Agile Testing: More fun with the LSI MegaRAID controllers

As I mentioned in a previous post, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.

Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.

For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).

Check RAID event log for errors

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL

This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:

===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
Error: 3

seqNum: 0x00001654
Time: Fri Oct 28 04:45:26 2011

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path
5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02
Event Data:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
CDB Length: 10
CDB Data:
002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000
0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002
0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00001655

Time: Fri Oct 28 04:45:26 2011

Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 0b(e0x0c/s11)
Event Data:

Check state of particular physical drive

You need the enclosure index (12 in the output above) and the device ID (11 in the output above). Then you run:

./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

In our case, we saw that Media Error Count was non-zero.

Mark physical drive offline

We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.

# ./MegaCli64 -PDOffline -PhysDrv [12:11] aALL

Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.

Hot swap faulty drive

We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:

# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

Enclosure Device ID: 12

Slot Number: 11

Device Id: 13

Sequence Number: 3

Media Error Count: 0

Other Error Count: 0

Predictive Failure Count: 0

Last Predictive Failure Event Seq Number: 0

PD Type: SAS

Raw Size: 931.512 GB [0x74706db0 Sectors]

Non Coerced Size: 931.012 GB [0x74606db0 Sectors]

Coerced Size: 931.0 GB [0x74600000 Sectors]

Firmware state: Rebuild

SAS Address(0): 0x5000c500347da135

SAS Address(1): 0x0

Connected Port Number: 0(path0)

Inquiry Data: SEAGATE ST31000424SS KS689WK4Z08J

FDE Capable: Not Capable

FDE Enable: Disable

Secured: Unsecured

Locked: Unlocked

Foreign State: None

Device Speed: Unknown

Link Speed: Unknown

Media Type: Hard Disk Device

Exit Code: 0x00

Check progress of the rebuilding process

# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL

Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.

Exit Code: 0x00

Check event log again

Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:

# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL

In our case we saw something like this:

# cat t1.log

seqNum: 0x00001720

Time: Fri Oct 28 12:48:27 2011

Code: 0x000000f9

Class: 0

Locale: 0x01

Event Description: VD 00/0 is now OPTIMAL

Event Data:

===========

Target Id: 0

seqNum: 0x0000171f

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000051

Class: 0

Locale: 0x01

Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)

Event Data:

===========

Target Id: 0

Previous state: 2

New state: 3

seqNum: 0x0000171e

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000072

Class: 0

Locale: 0x02

Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Previous state: 20

New state: 24

seqNum: 0x0000171d

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000064

Class: 0

Locale: 0x02

Event Description: Rebuild complete on PD 0d(e0x0c/s11)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

seqNum: 0x0000171c

Time: Fri Oct 28 12:48:27 2011

Code: 0x00000063

Class: 0

Locale: 0x02

Event Description: Rebuild complete on VD 00/0

Event Data:

===========

Target Id: 0

seqNum: 0x000016b7

Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072

Class: 0

Locale: 0x02

Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Previous state: 16

New state: 20

seqNum: 0x000016b6

Time: Fri Oct 28 08:55:42 2011

Code: 0x0000006a

Class: 0

Locale: 0x02

Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

seqNum: 0x000016b5

Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072

Class: 0

Locale: 0x02

Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Previous state: 0

New state: 16

seqNum: 0x000016b4

Time: Fri Oct 28 08:55:42 2011

Code: 0x000000f7

Class: 0

Locale: 0x02

Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000

Event Data:

===========

Device ID: 13

Enclosure Device ID: 12

Enclosure Index: 1

Slot Number: 11

SAS Address 1: 5000c500347da135

SAS Address 2: 0

seqNum: 0x000016b3

Time: Fri Oct 28 08:55:42 2011

Code: 0x0000005b

Class: 0

Locale: 0x02

Event Description: Inserted: PD 0d(e0x0c/s11)

Event Data:

===========

Device ID: 13

Enclosure Index: 12

Slot Number: 11

Here are some other useful troubleshooting commands:

Check firmware state across all physical drives

This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:

# ./MegaCli64 -PDList -a0 | grep Firmware

In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.

Check virtual drive state

You can also check the virtual drive state:

# ./MegaCli64 -LDInfo -L0 -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None

Exit Code: 0x00

Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:

# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online

Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N > 0, e.g.:

Previous state: 0

New state: 16

Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.

Resources

MegaCli command reference from LSI
Perc RAID cheatsheet from ronaldbradford.com

3 comments:

Robert said...: Software RAID ftw.; Sunday, October 30, 2011 at 9:36:00 AM PDT
Anonymous said...: Software raid is too slow. Be a man and use hardware raid.; Friday, March 30, 2012 at 3:04:00 PM PDT
Anonymous said...: Software raid won't recover from failures so easy. Hardware raid is faster and more tolerant to disk issues.; Friday, March 30, 2012 at 3:05:00 PM PDT

Agile Testing

Friday, October 28, 2011

More fun with the LSI MegaRAID controllers

3 comments:

Modifying EC2 security groups via AWS Lambda functions

Followers