Henry Xie 's blog: OS Crashes Due To MegaRAID Issues and LVM snapshots On DB Home

Wednesday, April 25, 2018

OS Crashes Due To MegaRAID Issues and LVM snapshots On DB Home

Symptom:

The host CPU load is gradually ramping up
The Load avg is not so high around 20 comparing that we have 72 CPUs
Use top command, you can't find out the obvious top CPU consumers

Eventually OS crashes and reboot itself.

Diagnosis:

From OS logs, it reports dm device hang and MegaRAID SAS controller dead and reset
Error from OS messages like
Apr 23 02:32:00 host1 kernel: [160668.787368] INFO: task jbd2/dm-0-8:1513 blocked for more than 120 seconds. Apr 23 02:32:00host1 kernel: [160668.795203]       Tainted: P           OE   4.1.12-94.7.8.el6uek.x86_64 #2 Apr 23 02:32:00host1 kernel: [160668.803039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag

Apr 23 02:32:43 host1 kernel: [160711.104126] sd 0:2:0:0: [sda] tag#138 megasas: target reset FAILED!! Apr 23 02:32:43host1 kernel: [160711.104138] megaraid_sas 0000:23:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware Apr 23 02:32:43host1 kernel: [160711.493900] megaraid_sas 0000:23:00.0: Number of host crash buffers allocated: 512
We find there is LVM snapshot on DB home (/u01)
LV                       VG      Attr       LSize   Pool Origin   Data%
u01_backup               VGExaDb swi-a-s---  13.65g      LVDbOra1 14.54
Data% is 14.54 . As /u01 has most of local disk writes , the snapshot slows the overall performance.
Eventually it is MegaRAID firmware fault.

Solutions:

Remove the LVM snapshot and replace MegaRAID card. Recommend don't put snapshot on DB homes .If you have to back it up, remove it as soon as possible.

Wednesday, April 25, 2018

OS Crashes Due To MegaRAID Issues and LVM snapshots On DB Home

Symptom:

Diagnosis:

Solutions:

No comments: