Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.
For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).
Check RAID event log for errors
# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL
This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
Error: 3
seqNum: 0x00001654
Time: Fri Oct 28 04:45:26 2011
Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path
5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02
Event Data:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
CDB Length: 10
CDB Data:
002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000
0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002
0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
seqNum: 0x00001655
Time: Fri Oct 28 04:45:26 2011
Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 0b(e0x0c/s11)
Event Data:
Check state of particular physical drive
You need the enclosure index (12 in the output above) and the device ID (11 in the output above). Then you run:
./MegaCli64 -pdInfo -PhysDrv [12:11] aALL
In our case, we saw that Media Error Count was non-zero.
Mark physical drive offline
We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.
# ./MegaCli64 -PDOffline -PhysDrv [12:11] aALL
Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.
Hot swap faulty drive
We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:
# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL
Enclosure Device ID: 12
Slot Number: 11
Device Id: 13
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 931.512 GB [0x74706db0 Sectors]
Non Coerced Size: 931.012 GB [0x74606db0 Sectors]
Coerced Size: 931.0 GB [0x74600000 Sectors]
Firmware state: Rebuild
SAS Address(0): 0x5000c500347da135
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: SEAGATE ST31000424SS KS689WK4Z08J
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: Unknown
Link Speed: Unknown
Media Type: Hard Disk Device
Exit Code: 0x00
Check progress of the rebuilding process
# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL
Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.
Exit Code: 0x00
Check event log again
Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:
# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL
In our case we saw something like this:
# cat t1.log
seqNum: 0x00001720
Time: Fri Oct 28 12:48:27 2011
Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL
Event Data:
===========
Target Id: 0
seqNum: 0x0000171f
Time: Fri Oct 28 12:48:27 2011
Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3
seqNum: 0x0000171e
Time: Fri Oct 28 12:48:27 2011
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 20
New state: 24
seqNum: 0x0000171d
Time: Fri Oct 28 12:48:27 2011
Code: 0x00000064
Class: 0
Locale: 0x02
Event Description: Rebuild complete on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
seqNum: 0x0000171c
Time: Fri Oct 28 12:48:27 2011
Code: 0x00000063
Class: 0
Locale: 0x02
Event Description: Rebuild complete on VD 00/0
Event Data:
===========
Target Id: 0
seqNum: 0x000016b7
Time: Fri Oct 28 08:55:42 2011
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 16
New state: 20
seqNum: 0x000016b6
Time: Fri Oct 28 08:55:42 2011
Code: 0x0000006a
Class: 0
Locale: 0x02
Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
seqNum: 0x000016b5
Time: Fri Oct 28 08:55:42 2011
Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 0
New state: 16
seqNum: 0x000016b4
Time: Fri Oct 28 08:55:42 2011
Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000
Event Data:
===========
Device ID: 13
Enclosure Device ID: 12
Enclosure Index: 1
Slot Number: 11
SAS Address 1: 5000c500347da135
SAS Address 2: 0
seqNum: 0x000016b3
Time: Fri Oct 28 08:55:42 2011
Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Here are some other useful troubleshooting commands:
Check firmware state across all physical drives
This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:
# ./MegaCli64 -PDList -a0 | grep Firmware
In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.
Check virtual drive state
You can also check the virtual drive state:
# ./MegaCli64 -LDInfo -L0 -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Exit Code: 0x00
Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:
# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Check firmware state across all physical drives
This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:
# ./MegaCli64 -PDList -a0 | grep Firmware
In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.
Check virtual drive state
You can also check the virtual drive state:
# ./MegaCli64 -LDInfo -L0 -aALL
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Exit Code: 0x00
# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N > 0, e.g.:
Previous state: 0
New state: 16
Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.
Resources
- MegaCli command reference from LSI
- Perc RAID cheatsheet from ronaldbradford.com
3 comments:
Software RAID ftw.
Software raid is too slow. Be a man and use hardware raid.
Software raid won't recover from failures so easy. Hardware raid is faster and more tolerant to disk issues.
Post a Comment