Sunday, July 29, 2018

Bigip Irule Uri Findstr Causes Unexpected APPS Behaviour

Symptom:

   We have bigip irule which contains findstr to match "node="
       set nodeselect [findstr [string tolower [HTTP::uri] ] "node=" 5 ";" ]
    However we see some confluence pages are spinning after we implement the irule
   

Diagnosis:

     When I dig in a bit more,  I find  there is request url which contains "node=root" when you click troubled page. You can  get details in Chrome developer tools

      The "node=root" captured by bigip irule which causes the issue.

https://confluence-uat.test.com/confluence/pages/children.action?spaceKey=test&node=root
    

Solution:

      I update the irule :  change  findstr to match "testnode=" and the issue is fixed

       set nodeselect [findstr [string tolower [HTTP::uri] ] "testnode=" 9 ";" ]
   

Wednesday, July 25, 2018

A Shortcoming Of Auto Tuned Undo Retention

Symptom:

    When Apps / DB team are doing data migration , there are big batches with insert sql and  the migration tool failed with error:

  ORA-01628: max # extents (32765) reached for rollback segment    XXXXXXXX


Diagnosis:


       The error informs that there were high volume of data undo segments and it reached the limit 32k.
     Use sql to check status of undo segments:

SQL> SELECT DISTINCT STATUS, SUM(BYTES)/1024/1024, COUNT(*) FROM 
DBA_UNDO_EXTENTS GROUP BY STATUS; 

STATUS                      SUM(BYTES)/1024/1024   COUNT(*) 
--------------------------- -------------------- ---------- 
UNEXPIRED                             162676.563      55123 
EXPIRED                               12121.5625      12319 

So majority of the space is used by UNEXPIRED extents 

So something is wrong here, undo extents should not have majority as UNEXPIRED unless there are huge queries running right now.
   From AWR report, we can find the longest query is about 7654s. As long as undo tablespace has enough space for this longest query, the undo extents it uses should expire soon after the query is finished.

From 10g , oracle DB added auto tuned undo which is based on statistic data. Default it is TRUE
Use sql to check this hidden parameter

SELECT 
a.ksppinm Param , b.ksppstvl SessionVal ,c.ksppstvl InstanceVal,a.ksppdesc Descr 
FROM  x$ksppi a , x$ksppcv b , x$ksppsv c WHERE 
a.indx = b.indx AND 
a.indx = c.indx AND 
a.ksppinm LIKE '/_%undo_autotune%' escape '/'
ORDER BY 1;

The extents are not expired as we expect due to another hidden parameter _highthreshold_undoretention.  It is used to control auto tuned undo behaviour.
By default it is set 4294967294 (50 days) . It normally works if no big migration.
However giving big migration project,  it turns out to keep large volume of undo extents unexpired with this big value.  So the solution is to make it small 86400s (1 day) and they will expire in 1 day.

Solution:

 
ALTER SYSTEM SET "_highthreshold_undoretention"=86400 scope=both and sid='*';


   

Thursday, July 05, 2018

SQL*PLUS Date Output Format Change

Symptom:

   Sqlplus date output normally is like '11-MAY-11'  which does not have much detailed information we would like to see.

Solution:

  To change sqlplus date output, use sql:

alter session set NLS_DATE_FORMAT = 'yyyy-mm-dd HH24:mi:ss';

StuckThreadMaxTime of "600" seconds In WebLogic

Symptom:

   In weblogic out logs, we see quite a few stuck threads alerts
   Error Stack is like:

<Apr 30, 2018, 11:01:02,922 AM UTC> <Error> <WebLogicServer> <BEA-000337> <[STUCK] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "637" seconds working on the request "Http Request Information: weblogic.servlet.internal.ServletRequestImpl@49cb4b67[GET /sites/Satellite]
", which is more than the configured time (StuckThreadMaxTime) of "600" seconds in "server-failure-trigger". Stack trace:
    java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    java.net.SocketInputStream.read(SocketInputStream.java:170)
    java.net.SocketInputStream.read(SocketInputStream.java:141)
    oracle.net.nt.MetricsEnabledInputStream.read(TcpNTAdapter.java:759)
    oracle.net.ns.Packet.receive(Packet.java:312)
    oracle.net.ns.DataPacket.receive(DataPacket.java:106)
    oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:306)
    oracle.net.ns.NetInputStream.read(NetInputStream.java:250)
    oracle.net.ns.NetInputStream.read(NetInputStream.java:172)
    oracle.net.ns.NetInputStream.read(NetInputStream.java:90)
    oracle.jdbc.driver.T4CSocketInputStreamWrapper.readNextPacket
(T4CSocketInputStreamWrapper.java:124)
    oracle.jdbc.driver.T4CSocketInputStreamWrapper.read
(T4CSocketInputStreamWrapper.java:80)


Diagnosis:

    The  [GET /sites/Satellite]  is causing problems. First we use web browser to test the url. ie http://test.domain.com/sites/Satellite
     It is possible that bad design of the page causes problem. Open developer tools of browser (Chrome) to check which sections take most of time

   Then we need to check any bad sql queries involved in the Database. Get AWR report and find top consumers.


Sunday, July 01, 2018

DISK IO Impact of Linux Audit Service

Symptom:

    We record that the local DISK IO is high, there are 2 majority consumers.

  • One is index operation.On Jira APPS, the apps update, delete index files quite intensively . Same apply to ELK apps.   They both use same lucent index technology. 
  • The other is Linux Audit Service

Use iostat 2 10 to find top used volumes
Use lsblk to find out which volume is related to APPS or Audit

You can see massive writes on /var/log/audit/audit.log , and it rotates very quickly
Audit entries like
type=CWD msg=audit(1529554416.727:219411617):  cwd="/data/atlassian/product/jira/bin"
type=PATH msg=audit(1529554416.727:219411617): item=0 name="/u01/app/atlassian/admin/jira/caches/indexes/changes/" inode=11288602 dev=ca:71 mode=040755 ouid=1000 ogid=1001 rdev=00:00
type=PATH msg=audit(1529554416.727:219411617): item=1 name="/data/atlassian/admin/jira/caches/indexes/changes/_13469.fdt" inode=11288601 dev=ca:71 mode=0100644 ouid=1000 ogid=1001 rdev=00:00
type=SYSCALL msg=audit(1529554416.727:219411618): arch=c000003e syscall=87 success=yes exit=0 a0=7fbbef337d60 a1=7f9e9c009ec8 a2=7fbbef337d60 a3=61632f6172696a2f items=2 ppid=1 pid=7146 auid=501742 uid=1000 gid=1001 euid=1000 suid=1000 fsuid=1000 egid=1001 sgid=1001 fsgid=1001 tty=(none) ses=38734 comm="java" exe="/u01/app/atlassian/product/jira/jdk/bin/java" key="delete

Jira index operation is expected. We can reduce IO via reduce unnecessary audit entries.

Solution:


  • Remove Linux Audit Configuration Immutable . comment "-e2"  in /etc/audit/audit.rules
  • Reboot OS
  • Remove SYSCALL CWD and PATH from audit. Add below into audit.rules
-a always,exclude -F msgtype=SYSCALL
-a always,exclude -F msgtype=CWD
-a always,exclude -F msgtype=PATH
  • Bounce the audit service
  • Unfortunately audit service only can exclude via msgtype , it can't exclude based on exe file or directory or key.