Henry Xie 's blog

Wednesday, May 09, 2018

How Putty Into Public Cloud VMs in Your Corporate Network

Symptom:

In your corporate network, there is no access to outside public IP address. We need to putty into public cloud VMs to check services . Disconnect intranet would be painful for that.

Solution:

In your putty session setting, connection --> proxy , set your corporate proxy details in it. Then your putty would be able to access it.
Pay attention to putty timeout alerts as proxy may cut the idle sessions in a while depends how corporate proxy is set.

Tuesday, May 01, 2018

Session Blockers In DB After OS Panic or DB Aborted

Symptom:

Due to OS panic or we did shutdown abort on oracle DB. DB was started well.
However in a while , there were quite many blocking sessions ( enq: TX - row lock contention) which caused Application performance issues

Diagnosis:

From blocking sid and run sql checking, we found the blockers were idle sessions and had such wait event: "SQL * NET message from client ".
About how idle sessions can be blockers , please refer stackexchange link
So it turns out that the unreleased locks of last sessions before OS panic and shutdown abort will remain in the DB. After DB restarts, the locks are still in place.

Solutions:

We need to clear the blockers manually or bounce the MT to refresh all the connections.

Wednesday, April 25, 2018

OS Crashes Due To MegaRAID Issues and LVM snapshots On DB Home

Symptom:

The host CPU load is gradually ramping up
The Load avg is not so high around 20 comparing that we have 72 CPUs
Use top command, you can't find out the obvious top CPU consumers

Eventually OS crashes and reboot itself.

Diagnosis:

From OS logs, it reports dm device hang and MegaRAID SAS controller dead and reset
Error from OS messages like
Apr 23 02:32:00 host1 kernel: [160668.787368] INFO: task jbd2/dm-0-8:1513 blocked for more than 120 seconds. Apr 23 02:32:00host1 kernel: [160668.795203]       Tainted: P           OE   4.1.12-94.7.8.el6uek.x86_64 #2 Apr 23 02:32:00host1 kernel: [160668.803039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messag

Apr 23 02:32:43 host1 kernel: [160711.104126] sd 0:2:0:0: [sda] tag#138 megasas: target reset FAILED!! Apr 23 02:32:43host1 kernel: [160711.104138] megaraid_sas 0000:23:00.0: IO/DCMD timeout is detected, forcibly FAULT Firmware Apr 23 02:32:43host1 kernel: [160711.493900] megaraid_sas 0000:23:00.0: Number of host crash buffers allocated: 512
We find there is LVM snapshot on DB home (/u01)
LV                       VG      Attr       LSize   Pool Origin   Data%
u01_backup               VGExaDb swi-a-s---  13.65g      LVDbOra1 14.54
Data% is 14.54 . As /u01 has most of local disk writes , the snapshot slows the overall performance.
Eventually it is MegaRAID firmware fault.

Solutions:

Remove the LVM snapshot and replace MegaRAID card. Recommend don't put snapshot on DB homes .If you have to back it up, remove it as soon as possible.

Tuesday, April 24, 2018

ORA-15203: diskgroup RECO contains disks from an incompatible version of ASM

Symptom:

After you upgrade GI of oracle 12c , we bounce the ASM and hit this error when it mounts the asm disk groups. Diskgroups can't be mounted

ORA-15203: diskgroup RECO contains disks from an incompatible version of ASM

Diagnose:

When you dig in more details in ASM alert logs, you find such error

ORA-15203: diskgroup RECO contains disks from an incompatible version of ASM
ORA-15038: disk 'o/10.230.49.55/DBFS_DG_CD_02_test0008' mismatch on 'Time Stamp' with target disk group [2111006055] [2029972656]
ORA-15038: disk 'o/10.230.49.55/DBFS_DG_CD_07_test0008' mismatch on 'Time Stamp' with target disk group [2111006055] [2029972656]
ORA-15038: disk 'o/10.230.49.55/DBFS_DG_CD_09_test0008' mismatch on 'Time Stamp' with target disk group [2111006
   It indicates something wrong with this cell test0008

Check   /etc/oracle/cell/network-config/cellip.ora
Find out test0008 does not belong to this GI and test0008 is placed in another GI env.

Solution:

Remove test0008 from cellip.ora

Saturday, April 21, 2018

Error: Authentication token is no longer valid

Symptom:

When we try to edit crontab via crontab -e
It was working before,however it error out recently.

Authentication token is no longer valid; new one required
You (oracle) are not allowed to access to (crontab) because of pam configuration.

Solution:

The linux user expires in OS. It prevents it to run crontab
Use this command to check details
chage -l <user>

Use below to update the attribution of expire date
chage <user> --- it is interactive

Another reason is about /etc/security/access.conf
Need to allow user to access cron resource
i.e
+ : oracle : ALL