Wednesday, April 25, 2018

OS Crashes Due To MegaRAID Issues and LVM snapshots On DB Home

Symptom:

The host CPU load  is gradually ramping  up
The Load avg is not so high around 20 comparing that we have 72 CPUs
Use top command, you can't find out the obvious top CPU consumers

Eventually OS crashes and reboot itself.

Diagnosis:


From OS logs, it reports dm device hang and MegaRAID SAS controller dead and reset
Error from OS messages like
Apr 23 02:32:00 host1 kernel: [160668.787368] INFO: task jbd2/dm-0-8:1513
blocked for more than 120 seconds.
Apr 23 02:32:00
host1 kernel: [160668.795203]       Tainted: P          
OE   4.1.12-94.7.8.el6uek.x86_64 #2
Apr 23 02:32:00
host1 kernel: [160668.803039] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this messag

  
Apr 23 02:32:43 host1 kernel: [160711.104126] sd 0:2:0:0: [sda] tag#138
megasas: target reset FAILED!!
Apr 23 02:32:43
host1 kernel: [160711.104138] megaraid_sas 0000:23:00.0:
IO/DCMD timeout is detected, forcibly FAULT Firmware
Apr 23 02:32:43
host1 kernel: [160711.493900] megaraid_sas 0000:23:00.0:
Number of host crash buffers allocated: 512
 

We find there is LVM snapshot on DB home (/u01)
 LV                       VG      Attr       LSize   Pool Origin   Data% 
u01_backup               VGExaDb swi-a-s---  13.65g      LVDbOra1 14.54
Data% is 14.54 .  As /u01 has most of local disk writes , the snapshot slows the overall performance.
Eventually it is MegaRAID firmware fault. 

Solutions:

Remove the LVM snapshot and replace MegaRAID card. Recommend don't put snapshot on DB homes .If you have to back it up, remove it as soon as possible.



No comments: