一次诡异的缓慢”死亡” CASE

今天下午遇到的一个诡异CASE,现象为系统诡异的死亡,目前判定为sqluldr2工具导致 (同事使用sqluldr2 用具 执行了sqluldr2_linux64_10204.bin -help 命令)
抛出的错误如下 (/var/log/message)

Sep 11 09:00:54 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 09:07:08 db18 last message repeated 2 times
Sep 11 09:08:47 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 09:09:54 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 09:10:07 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 09:10:46 db18 kernel: tnslsnr[19787]: segfault at 0000000000000000 rip 00002b92c7d5bebe rsp 00007fff0a1fea90 error 4
Sep 11 10:49:59 db18 auditd[8397]: Audit daemon rotating log files
Sep 11 10:53:41 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 10:54:23 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 12:50:44 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 12:54:37 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808
Sep 11 13:56:56 db18 nrpe[15359]: Host 10.4.0.238 is not allowed to talk to us!
Sep 11 13:57:08 db18 nrpe[15425]: Host 10.4.0.238 is not allowed to talk to us!
Sep 11 14:17:49 db18 setroubleshoot: SELinux is preventing sqluldr2_linux6 from loading /home/oracle/product/10205/db1/lib/libnnz10.so which requires text relocation. For complete SELinux messages. run sealert -l 012919a7-837b-42af-9509-367d9d743808

QQ图片20130913002200

当时系统的SELinux为enable,阻止了sqluldr2对于oracle库文件的操作,多次尝试之后在09:10:46系统出现了异常,当时仅仅表现为listener的异常关闭,自此系统开始了”慢性”死忙

Last login: Tue Sep 10 14:59:41 2013 from 10.4.0.232
[oracle@db18 ~]$ lsnrctl status

LSNRCTL for Linux: Version 10.2.0.5.0 - Production on 11-SEP-2013 09:18:47

Copyright (c) 1991, 2010, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=1522)))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
  TNS-00511: No listener
   Linux Error: 111: Connection refused
[oracle@db18 ~]$ 
[oracle@db18 ~]$ 
[oracle@db18 ~]$ ora active
09:18:54 up 702 days, 14:57,  2 users,  load average: 3.06, 2.08, 0.93

no rows selected

[oracle@db18 ~]$ lsnrctl start

LSNRCTL for Linux: Version 10.2.0.5.0 - Production on 11-SEP-2013 09:19:16

Copyright (c) 1991, 2010, Oracle.  All rights reserved.

Starting /home/oracle/product/10205/db1/bin/tnslsnr: please wait...

TNSLSNR for Linux: Version 10.2.0.5.0 - Production
System parameter file is /home/oracle/product/10205/db1/network/admin/listener.ora
Log messages written to /home/oracle/product/10205/db1/network/log/listener.log
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=0.0.0.0)(PORT=1522)))

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=0.0.0.0)(PORT=1522)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 10.2.0.5.0 - Production
Start Date                11-SEP-2013 09:19:16
Uptime                    0 days 0 hr. 0 min. 0 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /home/oracle/product/10205/db1/network/admin/listener.ora
Listener Log File         /home/oracle/product/10205/db1/network/log/listener.log
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=0.0.0.0)(PORT=1522)))
Services Summary...
Service "misc" has 1 instance(s).
  Instance "misc", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully
[oracle@db18 ~]$ 
[oracle@db18 ~]$ ora active   
 09:19:22 up 702 days, 14:57,  2 users,  load average: 3.09, 2.18, 1.00

12号下午出现的问题,很多process占用大量的CPU资源,包括background process,多个进程处于running假死阶段,sys占用大量系统资源。

QQ图片20130913000507

TBQTS6(34${(BX$GLS~@PVP

database alert log如下:

Errors in file /data/oracle/admin/misc/udump/misc_ora_28401.trc:
ORA-27300: OS system dependent operation:invalid_process_id failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpalive1

分析问题,查看ora background process进程状态 发现如下异常

[oracle@db18 arch]$ ps -ef | grep -i ora_
oracle   24019     1  0 May18 ?        01:49:30 ora_pmon_misc
oracle   24023     1  0 May18 ?        00:33:02 ora_psp0_misc
oracle   24027     1  0 May18 ?        00:34:32 ora_mman_misc
oracle   24031     1  0 May18 ?        00:59:23 ora_dbw0_misc
oracle   24033     1  0 May18 ?        01:02:13 ora_dbw1_misc
oracle   24035     1  0 May18 ?        00:58:57 ora_dbw2_misc
oracle   24037     1  0 May18 ?        00:59:24 ora_dbw3_misc
oracle   24039     1  0 May18 ?        00:59:06 ora_dbw4_misc
oracle   24043     1  0 May18 ?        00:59:35 ora_dbw5_misc
oracle   24047     1  0 May18 ?        04:28:36 ora_lgwr_misc
oracle   24049     1  0 May18 ?        09:41:15 ora_ckpt_misc
oracle   24053     1  0 May18 ?        00:36:44 ora_smon_misc
oracle   24057     1  0 May18 ?        00:28:39 ora_reco_misc
oracle   24061     1  1 May18 ?        1-06:42:09 ora_mmon_misc
oracle   24065     1  0 May18 ?        22:12:24 ora_mmnl_misc
oracle   24347     1  0 May18 ?        00:37:10 ora_arc0_misc
oracle   24351     1  0 May18 ?        00:43:10 ora_arc1_misc
oracle   24355     1  0 May18 ?        01:42:59 ora_lns1_misc
oracle   24543     1  0 May18 ?        00:28:31 ora_qmnc_misc
oracle   24780     1  0 May18 ?        00:10:05 ora_q000_misc
oracle   24969     1  0 May18 ?        00:09:25 ora_q001_misc
oracle   26077 25306  0 15:31 pts/8    00:00:00 tail -f misc_ora_21763.trc
oracle   27634 22109  0 15:44 pts/6    00:00:00 grep -i ora_
oracle   27867 24023 98 Sep11 ?        1-06:03:47 ora_psp0_misc   ------------刷新时间为Sep11
oracle   27868 27867 98 Sep11 ?        1-06:03:38 ora_psp0_misc   ------------为衍生进程

PSP进程出现异常状态,对于PSP0进程的解释如下:

PSP stands for Process SPawner and this process has the job of creating and managing other Oracle processes.
The initial set of background processes are spawned by the startup process. After the instance is started, PSP0 is responsible for spawning any
required background processes.

当时系统的状态为无法checkpoint 无法 syn log ,包括alter database set 等命令也出现hang死状态

[oracle@db18 ~]$ ora active
15:57:03 up 703 days, 21:35, 12 users,  load average: 24.11, 26.99, 50.27

   SID USERNAME   MACHINE          EVENT                      PARAM                W   WT SQL                        ST     LT LOGON_TIME
------ ---------- ---------------------- ------------------------------ -------------------- ---- ---- ---------------------------- -- ------ ----------
  3601 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006158/0       0    0 /frw83qw2fvsa1             K        0    1088732
  3607 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006175/0       0    0 /frw83qw2fvsa1             K        0        3857
  3611 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006137/0       0    0 /frw83qw2fvsa1             K        0    1088741
  3616 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006169/0       0    0 /frw83qw2fvsa1             K        0        3895
  3634 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006140/0       0    0 /frw83qw2fvsa1             K        0    1088735
  3635 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006173/0       0    0 /frw83qw2fvsa1             K        0        3889
  3646 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006209/0       0    0 /frw83qw2fvsa1             A        0        1023
  3650 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006174/0       0    0 /frw83qw2fvsa1             K        0        3882
  3661 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006152/0       0    0 /frw83qw2fvsa1             K        0        5346
  3664 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006138/0       0    0 /frw83qw2fvsa1             K        0      877657
  3668 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006200/0       0    0 /frw83qw2fvsa1             A        0        1027
  3671 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006198/0       0    0 /frw83qw2fvsa1             A        0        1027
  3676 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006184/0       0    0 /6p07vqnwqrsyw            A        0        1032
  3691 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006210/0       0    0 /frw83qw2fvsa1             A        0        1011
  3696 LOG_DATA2  JDBC Thin Client       log file sync                     10795/2810006141/0       0    0 /frw83qw2fvsa1             K        0        5346
  3698 SYS     sqlplus@db18         reliable message            70088494424/70158516   0    0 /                         A           0         390

表现为ckpt lgwr dbwn进程全部死亡 .. oradebug dump systemstat 速度极其缓慢 oradebug dump errorstack hang —- sys 占用大量资源

awk -f ass109.awk systemdump.txt > systemdump.log —并没有发现任何异常等待以及blocker信息

[oracle@db18 ~]$ tail -f -n 50 systemdump.log 
     Cmd: Select
296:waiting for 'SQL*Net message from client' 
     Cmd: Select
297:waiting for 'SQL*Net message from client' 
     Cmd: Select
298:waiting for 'SQL*Net message from client' 
     Cmd: PL/SQL Execute
299:waiting for 'SQL*Net message from client' 
     Cmd: PL/SQL Execute
300:waiting for 'SQL*Net message from client' 
     Cmd: PL/SQL Execute
302:waiting for 'SQL*Net message from client' 
     Cmd: Insert
444:waiting for 'SQL*Net message from client' 
     Cmd: Select
532:waiting for 'SQL*Net message from client' 
     Cmd: Select
536:waiting for 'SQL*Net message from client' 
     Cmd: Select
538:waiting for 'SQL*Net message from client' 
     Cmd: Select

NO BLOCKING PROCESSES FOUND

System State 2
~~~~~~~~~~~~~~~~
1:                                      
2:  waiting for 'pmon timer'            
3:  waiting for 'rdbms ipc message'     
4:  last wait for 'rdbms ipc message'   
5:  waiting for 'rdbms ipc message'     
6:  waiting for 'rdbms ipc message'     
7:  waiting for 'rdbms ipc message'     
8:  waiting for 'rdbms ipc message'     
9:  waiting for 'rdbms ipc message'     
10: waiting for 'rdbms ipc message'     
11: waiting for 'rdbms ipc message'     
12: waiting for 'rdbms ipc message'     
13: waiting for 'rdbms ipc message'     
14: waiting for 'smon timer'            
15: last wait for 'os thread startup'   
16: last wait for 'rdbms ipc message'   
17: waiting for 'SQL*Net message from client' 
     Cmd: Select
18:                                     

NO BLOCKING PROCESSES FOUND


171760 Lines Processed.

其现象为系统的一个”缓慢”的死亡行为 (通过SAR 定位到一个折点时间)

sar -f /var/log/sa/sar11   ---- sep 11

09:00:01 AM       all      0.01      0.00      0.01      0.00      0.00     99.97
09:10:01 AM       all      0.03      0.00      0.02      0.00      0.00     99.95   ----->   (09:10:46 db18 kernel: tnslsnr[19787]: segfault at 0000000000000000 rip 00002b92c7d5bebe rsp 00007fff0a1fea90 error 4)
09:20:01 AM       all      2.18      0.00      2.16      0.00      0.00     95.66
09:30:01 AM       all      3.15      0.00      3.13      0.00      0.00     93.71
09:40:01 AM       all      3.15      0.00      3.13      0.00      0.00     93.72
09:50:01 AM       all      3.15      0.00      3.13      0.00      0.00     93.72

03:00:01 PM       all      3.99      0.00      4.39      0.00      0.00     91.62
03:10:01 PM       all      4.04      0.00      4.32      0.00      0.00     91.64
03:20:01 PM       all      4.47      0.00      4.87      0.00      0.00     90.66
03:30:01 PM       all      4.98      0.00      5.47      0.00      0.00     89.55
03:40:01 PM       all      4.99      0.00      5.45      0.00      0.00     89.56
03:50:01 PM       all      4.99      0.00      5.45      0.00      0.00     89.56
04:00:01 PM       all      4.99      0.00      5.45      0.00      0.00     89.56
04:10:01 PM       all      4.94      0.00      5.50      0.00      0.00     89.56
04:20:01 PM       all      4.82      0.00      5.62      0.00      0.00     89.56


10:20:01 PM       all     10.72      0.00     12.22      0.00      0.00     77.06
10:30:01 PM       all     10.73      0.00     12.21      0.00      0.00     77.06
10:40:01 PM       all     10.69      0.00     12.25      0.00      0.00     77.06
10:50:01 PM       all     10.67      0.00     12.27      0.00      0.00     77.06
11:00:01 PM       all     10.52      0.00     12.41      0.00      0.00     77.06
11:10:01 PM       all     10.42      0.00     12.52      0.00      0.00     77.06
11:20:01 PM       all     10.38      0.00     12.56      0.00      0.00     77.06


sar -f /var/log/sa/sar12   ---- sep 12

12:00:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM       all     11.23      0.00     11.72      0.10      0.00     76.95
12:20:02 AM       all     11.29      0.00     11.65      0.10      0.00     76.96
12:30:01 AM       all     11.30      0.00     11.63      0.09      0.00     76.96
12:40:01 AM       all     11.71      0.00     12.36      0.10      0.00     75.83
12:50:01 AM       all     12.14      0.00     12.88      0.10      0.00     74.87
01:00:01 AM       all     12.08      0.00     12.94      0.10      0.00     74.88
01:10:01 AM       all     12.24      0.00     12.79      0.10      0.00     74.88
01:20:01 AM       all     12.25      0.00     12.77      0.10      0.00     74.88

10:00:01 AM       all     12.08      0.00     37.94      0.10      0.00     49.88
10:10:01 AM       all     12.07      0.00     37.95      0.09      0.00     49.88
10:20:01 AM       all     12.09      0.00     37.94      0.11      0.00     49.87
10:30:01 AM       all     12.10      0.00     37.93      0.10      0.00     49.88
10:40:01 AM       all     12.02      0.00     38.00      0.11      0.00     49.87
10:50:01 AM       all     12.06      0.00     37.96      0.10      0.00     49.88
11:00:01 AM       all     12.37      0.00     37.65      0.10      0.00     49.88
11:10:01 AM       all     12.32      0.00     37.70      0.10      0.00     49.88
11:20:01 AM       all     12.50      0.00     37.52      0.10      0.00     49.88
11:30:01 AM       all     12.59      0.00     37.43      0.10      0.00     49.88
11:40:01 AM       all     12.77      0.00     37.25      0.11      0.00     49.87

01:50:01 PM       all     13.41      0.00     40.77      0.10      0.00     45.71
02:00:01 PM       all     13.41      0.00     40.78      0.11      0.00     45.71
02:10:01 PM       all     13.41      0.00     40.78      0.10      0.00     45.71
02:20:01 PM       all     13.17      0.00     63.22      0.05      0.00     23.55
02:30:02 PM       all     13.21      0.00     86.78      0.00      0.00      0.00
02:40:01 PM       all     13.29      0.00     86.71      0.00      0.00      0.00
02:50:01 PM       all     13.42      0.00     86.58      0.00      0.00      0.00
03:00:01 PM       all     13.58      0.00     86.42      0.00      0.00      0.00
03:10:01 PM       all     13.38      0.00     86.62      0.00      0.00      0.00
03:20:01 PM       all     13.25      0.00     86.75      0.00      0.00      0.00
03:30:01 PM       all     13.19      0.00     86.80      0.00      0.00      0.00
03:40:01 PM       all     12.95      0.00     86.25      0.00      0.00      0.79

假想:sqluldr2 与 Selinux的冲突导致了oracle so文件的异常,表现为segfault–listener的异常关闭,进而系统缓慢的”死亡” (建议不要在生产系统直接使用sqluldr2而使用TNS远程连接)

vmcd

About vmcd

Phone: +86 18666668061 Email & Gtalk: ylouis83@gmail.com Personal Blog: http://www.vmcd.org 2010 Oracle Certified Database 10g Administrator Master SHOUG Member