Wednesday, April 25, 2018

OS Crashes Due To MegaRAID Issues and LVM snapshots On DB Home

Symptom:

The host CPU load  is gradually ramping  up
The Load avg is not so high around 20 comparing that we have 72 CPUs
Use top command, you can't find out the obvious top CPU consumers

Eventually OS crashes and reboot itself.

Diagnosis:


From OS logs, it reports dm device hang and MegaRAID SAS controller dead and reset
Error from OS messages like
Apr 23 02:32:00 host1 kernel: [160668.787368] INFO: task jbd2/dm-0-8:1513
blocked for more than 120 seconds.
Apr 23 02:32:00
host1 kernel: [160668.795203]       Tainted: P          
OE   4.1.12-94.7.8.el6uek.x86_64 #2
Apr 23 02:32:00
host1 kernel: [160668.803039] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this messag

  
Apr 23 02:32:43 host1 kernel: [160711.104126] sd 0:2:0:0: [sda] tag#138
megasas: target reset FAILED!!
Apr 23 02:32:43
host1 kernel: [160711.104138] megaraid_sas 0000:23:00.0:
IO/DCMD timeout is detected, forcibly FAULT Firmware
Apr 23 02:32:43
host1 kernel: [160711.493900] megaraid_sas 0000:23:00.0:
Number of host crash buffers allocated: 512
 

We find there is LVM snapshot on DB home (/u01)
 LV                       VG      Attr       LSize   Pool Origin   Data% 
u01_backup               VGExaDb swi-a-s---  13.65g      LVDbOra1 14.54
Data% is 14.54 .  As /u01 has most of local disk writes , the snapshot slows the overall performance.
Eventually it is MegaRAID firmware fault. 

Solutions:

Remove the LVM snapshot and replace MegaRAID card. Recommend don't put snapshot on DB homes .If you have to back it up, remove it as soon as possible.



Tuesday, April 24, 2018

ORA-15203: diskgroup RECO contains disks from an incompatible version of ASM

Symptom:

  After you upgrade GI of oracle 12c ,  we bounce the ASM and hit this error when it mounts the asm disk groups. Diskgroups can't be mounted

ORA-15203: diskgroup RECO contains disks from an incompatible version of ASM

  Diagnose:

  When you dig in more details in ASM alert logs, you find such error

ORA-15203: diskgroup RECO contains disks from an incompatible version of ASM
ORA-15038: disk 'o/10.230.49.55/DBFS_DG_CD_02_test0008' mismatch on 'Time Stamp' with target disk group [2111006055] [2029972656]
ORA-15038: disk 'o/10.230.49.55/DBFS_DG_CD_07_
test0008' mismatch on 'Time Stamp' with target disk group [2111006055] [2029972656]
ORA-15038: disk 'o/10.230.49.55/DBFS_DG_CD_09_
test0008' mismatch on 'Time Stamp' with target disk group [2111006
   It indicates something wrong with this cell test0008
  
  Check   /etc/oracle/cell/network-config/cellip.ora
  Find out test0008 does not belong to this GI and test0008 is placed in another GI env.

Solution:

   Remove test0008 from cellip.ora

Saturday, April 21, 2018

Error: Authentication token is no longer valid

Symptom:

When we try to edit crontab via crontab -e
It was working before,however  it error  out recently.

 Authentication token is no longer valid; new one required
You (oracle) are not allowed to access to (crontab) because of pam configuration.

Solution:

 The linux user expires in OS. It prevents it to run crontab
  Use this command to check details
  chage  -l  <user>

  Use below to update the attribution of expire date
  chage <user>           ---  it is interactive

Another reason is about  /etc/security/access.conf
Need to allow user to access cron resource
i.e
+ : oracle : ALL

Sunday, April 15, 2018

ORA-19909: datafile 1 belongs to an orphan incarnation

Symptom:

Our physical Standby DB MRP is aborted and error code is:

ORA-19909: datafile 1 belongs to an orphan incarnation

Reasons:

 Due to some APPS issues, we did flashback(1am)  and resetlogs on Primary DB while MRP was running
 The standby has been applied archivelogs to timestamp (2am) well after we exec the flashback(1am)

 See note link which well documents the scenario

Solutions:

2 options:
1. Flashback standby DB to 1am and start MRP
2. Restore standby datafiles backup before 1am and start MRP ( no need to restore controlfiles on standby db)

In 12c, if flashback is enabled in standby DB, the dg broker can automatically flashback DB for you to 1am


Sunday, April 08, 2018

How To Modify SCAN of CRS(Grid Infrastructure)

 Requirement:

    We re-purpose a 12c RAC Cluster for a new Application. We need to update CRS scan name to reflect new name.
    Old scan name:  oldapps- r
    New scan name: newapps-r

Solution:

 In OS,  nslookup newapps-r   --- to make sure scan is registered in DNS

Stop scan listener:
$GI_HOME/bin/srvctl stop scan_listener
$GI_HOME/bin/srvctl stop scan

Modification:
as root
$GI_HOME/bin/srvctl modify scan -n newapps-r
$GI_HOME/bin/crsctl modify type ora.scan_vip.type -attr "ATTRIBUTE=SCAN_NAME,DEFAULT_VALUE=newapps-r" -unsupported
       --------------- must add -unsupported flag to bypass check.  see Note: 1918102.1
Start scan:
$GI_HOME/bin/srvctl start scan_listener
$GI_HOME/bin/srvctl config scan

Thursday, April 05, 2018

How To Force Listener To Rigster a DB

Requirement: 

 Sometimes we need to register DB to different listeners which have different ports.
 The loca_listener parameter in init<db>.ora normally let you register 1 listener only.

  Example :
    testdb is registered in listener_1533 via init<db>.ora file, we need to register testdb into listener_1521 as well


Solution:

  We need to edit the listener.ora and manually add the sid and dbhome  . The listener service  would be shown as "status UNKNOWN". It does not matter as client still can connect to it

Example
LISTENER_1521 =
 (DESCRIPTION_LIST =
  (DESCRIPTION =
   (ADDRESS = (PROTOCOL = TCP)
   (HOST = test-vip)
   (PORT = 1521)(IP = FIRST))))

SID_LIST_LISTENER_1521 =
  (SID_LIST =
    (SID_DESC =
      (SDU = 32768)
      (GLOBAL_DBNAME = testdb)
      (ORACLE_HOME = /u01/app/oracle/product/12.1.0.2/dbhome_2)
      (SID_NAME = testdb1)
    )



   (SID_DESC =
      (SDU = 32768)
      (GLOBAL_DBNAME = testhr)
      (ORACLE_HOME = /u01/app/oracle/product/12.1.0.2/dbhome_2)
      (SID_NAME = testhr1)
   )

)

Wednesday, April 04, 2018

Rman Catalog Sync Internal Error

Symptom:

 RMAN>  resync catalog;

starting partial resync of recovery catalog
RMAN-00571: ==============================================
RMAN-00569: ======== ERROR MESSAGE STACK FOLLOWS ========
RMAN-00571: ===============================================
RMAN-03009: failure of resync command on default channel at 04/04/2018 19:01:45
RMAN-20999: internal error


Solution:

Use  rman trace=mytrace.log debug target /

find some datafile info inconsistent on catalog DB. see note Doc ID 2240323.1
 

One quick solution is to drop and recreate catalog
RMAN> drop catalog;

RMAN> create catalog;

RMAN> register database;

Monday, April 02, 2018

Find Top 10 Space used Segments Including LOB in Oracle DB

Find Top 10 Space used Segments Including LOB  in Oracle DB

select s.owner || '.' ||  l.table_name table_name,
       l.column_name,
       s.SEGMENT_TYPE,
       s.segment_NAME,
       s.PARTITION_NAME,
       s.TABLESPACE_NAME,
       count(1) segment_cnt,
       trunc(sum(bytes)/1024/1024) size_MB,
       l.SECUREFILE
from dba_segments s,
     dba_lobs l
WHERE s.segment_name = l.segment_name
AND s.owner = l.owner
group by  s.owner || '.' ||  l.table_name, l.column_name,
       s.SEGMENT_TYPE,
       s.segment_NAME, s.partition_name,
       s.TABLESPACE_NAME, l.SECUREFILE
ORDER BY  size_MB desc fetch first 10 rows only ;

or

select * from
(
select s.owner || '.' ||  l.table_name table_name,
       l.column_name,
       s.SEGMENT_TYPE,
       s.segment_NAME,
       s.PARTITION_NAME,
       s.TABLESPACE_NAME,
       count(1) segment_cnt,
       trunc(sum(bytes)/1024/1024) size_MB,
       l.SECUREFILE
from dba_segments s,
     dba_lobs l
WHERE s.segment_name = l.segment_name
AND s.owner = l.owner
group by  s.owner || '.' ||  l.table_name, l.column_name,
       s.SEGMENT_TYPE,
       s.segment_NAME, s.partition_name,
       s.TABLESPACE_NAME, l.SECUREFILE
ORDER BY  size_MB desc
)
where rownum < 11
/