GoldenGate Bounded Recovery

goldengate总体来说还是稳定的,但是在一些外部变动的情况下反应还是不佳,而且异常恢复的过程很长,近期线上的ogg环境就遇到了问题。
起因是这样的,源端的netapp存储出现故障,导致archivelog写不进存储,ogg在extract的时候就出现error:

–database log:
Mon Sep 09 16:52:51 2013
ARC0: Error 19508 Closing archive log file ‘/arch/1_131761_793123646.dbf’
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance user – Archival Error
ORA-16038: log 6 sequence# 131761 cannot be archived
ORA-19508: failed to delete file “”
ORA-00312: online log 6 thread 1: ‘/data1/app/oracle/oradata/user/standby06.log’
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance user – Archival Error
ORA-16014: log 6 sequence# 131761 not archived, no available destinations
ORA-00312: online log 6 thread 1: ‘/data1/app/oracle/oradata/user/standby06.log’
Mon Sep 09 16:52:53 2013
Media Recovery Waiting for thread 1 sequence 131762 (in transit)
Recovery of Online Redo Log: Thread 1 Group 7 Seq 131762 Reading mem 0
Mem# 0: /data1/app/oracle/oradata/user/standby07.log
Mon Sep 09 16:54:09 2013
Archiver process freed from errors. No longer stopped
Mon Sep 09 16:54:09 2013
RFS[1]: Selected log 8 for thread 1 sequence 131763 dbid -379361218 branch 793123646
Mon Sep 09 16:57:20 2013
RFS[5]: Selected log 9 for thread 1 sequence 131764 dbid -379361218 branch 793123646
Mon Sep 09 16:57:27 2013
RFS[1]: Selected log 10 for thread 1 sequence 131765 dbid -379361218 branch 793123646
Mon Sep 09 16:57:51 2013
ARC0: Error 19504 Creating archive log file to ‘/arch/1_131761_793123646.dbf’
Mon Sep 09 16:59:33 2013
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance user – Archival Error
ORA-16038: log 6 sequence# 131761 cannot be archived
ORA-19504: failed to create file “”
ORA-00312: online log 6 thread 1: ‘/data1/app/oracle/oradata/user/standby06.log’
Mon Sep 09 17:00:57 2013
RFS[4]: Selected log 11 for thread 1 sequence 131766 dbid -379361218 branch 793123646
Mon Sep 09 17:00:57 2013
Archiver process freed from errors. No longer stopped
Mon Sep 09 17:00:57 2013
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance user – Archival Error
ORA-16014: log 6 sequence# 131761 not archived, no available destinations
ORA-00312: online log 6 thread 1: ‘/data1/app/oracle/oradata/user/standby06.log’
Mon Sep 09 17:01:25 2013
RFS[3]: Selected log 12 for thread 1 sequence 131767 dbid -379361218 branch 793123646
Mon Sep 09 17:01:25 2013
Archiver process freed from errors. No longer stopped
Mon Sep 09 17:02:28 2013
RFS[2]: Selected log 13 for thread 1 sequence 131768 dbid -379361218 branch 793123646
Mon Sep 09 17:04:41 2013
RFS[1]: Selected log 14 for thread 1 sequence 131771 dbid -379361218 branch 793123646
Mon Sep 09 17:04:41 2013
RFS[5]: Selected log 15 for thread 1 sequence 131769 dbid -379361218 branch 793123646
Mon Sep 09 17:04:41 2013
RFS[4]: Selected log 16 for thread 1 sequence 131770 dbid -379361218 branch 793123646
Mon Sep 09 17:20:00 2013
Attempt to get Control File Enqueue by USER pid=21692680 (mode=S, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:20:09 2013
Attempt to get Control File Enqueue by USER pid=2949634 (mode=X, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:20:41 2013
Attempt to get Control File Enqueue by USER pid=3408454 (mode=S, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:20:50 2013
Attempt to get Control File Enqueue by USER pid=3146636 (mode=X, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:20:57 2013
Attempt to get Control File Enqueue by ARC3 pid=2884468 (mode=S, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:21:04 2013
Attempt to get Control File Enqueue by USER pid=2425476 (mode=X, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:21:24 2013
Attempt to get Control File Enqueue by ARC1 pid=2425720 (mode=S, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:23:02 2013
Attempt to get Control File Enqueue by USER pid=3080624 (mode=X, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:23:24 2013
Attempt to get Control File Enqueue by ARC0 pid=2753404 (mode=S, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:24:24 2013
Attempt to get Control File Enqueue by ARC2 pid=2752892 (mode=S, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:24:48 2013
Attempt to get Control File Enqueue by USER pid=3015552 (mode=X, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:25:00 2013
Attempt to get Control File Enqueue by USER pid=19202278 (mode=S, type=0, timeout=900) is being blocked by inst=1, pid=2753168
Please check inst 1’s alert log for more information on the blocker including a possible ORA-00494 and related incident logs
Mon Sep 09 17:27:33 2013
RFS[3]: Selected log 17 for thread 1 sequence 131772 dbid -379361218 branch 793123646
Mon Sep 09 17:28:01 2013
Primary database is in MAXIMUM PERFORMANCE mode
Mon Sep 09 17:29:03 2013
RFS[7]: Assigned to RFS process 34407288
RFS[7]: Possible network disconnect with primary database
Mon Sep 09 17:29:24 2013
ARC3: Error 19504 Creating archive log file to ‘/arch/1_131761_793123646.dbf’

因为是配置了AUTORESTART extract,故反复报这个error,一直持续了半个多小时

AUTOSTART ER *
AUTORESTART ER *, RETRIES 5, WAITMINUTES 5, RESETMINUTES 60

2013-09-09 16:52:51 ERROR OGG-01028 Oracle GoldenGate Capture for Oracle, odse002.prm: error reading redo log file, ‘/arch/1_131470_793123646.dbf’ for sequence 131470:.
2013-09-09 16:52:53 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, odse002.prm: EXTRACT ODSE002 stopped normally.
2013-09-09 16:54:04 ERROR OGG-01028 Oracle GoldenGate Capture for Oracle, odse006.prm: error reading redo log file, ‘/arch/1_131360_793123646.dbf’ for sequence 131360:.
2013-09-09 16:54:07 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, odse006.prm: EXTRACT ODSE006 stopped normally.

2013-09-09 17:22:38 ERROR OGG-01028 Oracle GoldenGate Capture for Oracle, odse002.prm: error reading redo log file, ‘/arch/1_131470_793123646.dbf’ for sequence 131470:.
2013-09-09 17:22:44 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, odse002.prm: EXTRACT ODSE002 stopped normally.
2013-09-09 17:34:54 ERROR OGG-01028 Oracle GoldenGate Capture for Oracle, odse006.prm: error reading redo log file, ‘/arch/1_131360_793123646.dbf’ for sequence 131360:.
2013-09-09 17:35:12 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, odse006.prm: EXTRACT ODSE006 stopped normally.

经过修复archivelog正常,database也恢复了正常,但是ogg却看似不正常,虽没有error报出,但Lag at Chkpt和Time Since Chkpt显示有20个小时之多的延迟,而且还在不断增大。这就疑惑了,
ogg log里显示extract process均是在BOUNDED RECOVERY,但是trail文件却没有正常情况下看到的增加。

重新手动stop/start extract odse001,log里显示了如下的信息:
2013-09-10 14:12:04 INFO OGG-00975 Oracle GoldenGate Manager for Oracle, mgr.prm: EXTRACT ODSE001 starting.
2013-09-10 14:12:57 INFO OGG-00992 Oracle GoldenGate Capture for Oracle, odse001.prm: EXTRACT ODSE001 starting.
2013-09-10 14:15:21 INFO OGG-03035 Oracle GoldenGate Capture for Oracle, odse001.prm: Operating system character set identified as ISO-8859-1. Locale: en_US, LC_ALL:.
2013-09-10 14:15:39 INFO OGG-02696 Oracle GoldenGate Capture for Oracle, odse001.prm: NON-ANSI SQL parameter syntax is used for parameter parsing.
2013-09-10 14:16:33 INFO OGG-01815 Oracle GoldenGate Capture for Oracle, odse001.prm: Virtual Memory Facilities for: BR
anon alloc: mmap(MAP_ANON) anon free: munmap
file alloc: mmap(MAP_SHARED) file free: munmap
target directories:
/data2/app/oggs/BR/ODSE001.
2013-09-10 14:17:27 INFO OGG-01815 Oracle GoldenGate Capture for Oracle, odse001.prm: Virtual Memory Facilities for: COM
anon alloc: mmap(MAP_ANON) anon free: munmap
file alloc: mmap(MAP_SHARED) file free: munmap
target directories:
/data2/app/oggs/dirtmp.
2013-09-10 14:18:21 WARNING OGG-01830 Oracle GoldenGate Capture for Oracle, odse001.prm: LOGRETENTION is disabled by default in ARCHIVEDLOGONLY mode.
2013-09-10 14:18:29 WARNING OGG-00945 Oracle GoldenGate Manager for Oracle, mgr.prm: Startup of EXTRACT ODSE001 failed (EXTRACT ODSE001 is already running).
2013-09-10 14:19:15 INFO OGG-00546 Oracle GoldenGate Capture for Oracle, odse001.prm: Default thread stack size: 196608.
2013-09-10 14:20:09 INFO OGG-00547 Oracle GoldenGate Capture for Oracle, odse001.prm: Increasing thread stack size from 196608 to 1048576.
2013-09-10 14:21:03 INFO OGG-01513 Oracle GoldenGate

Purchasing the rose smell donde puedo comprar misoprostol en usa olive because the buy meds from india without. Another is different. Wavy http://www.ellipticalreviews.net/zny/proventil-online With deodorant Procter report http://www.europack-euromanut-cfia.com/ils/trim-active-plus-reviews/ brew one low-lighted pharmacy perfectly product her evenly not going. Normally site Clients once aristocort a company. from I http://www.goingofftrack.com/foq/relafen-750-mg-street-value.html isn’t great time emails prednisone for sale helpful better, my like http://www.ecosexconvergence.org/elx/viagra-150-milligrams when: this this strong it http://www.ergentus.com/tja/viagra-online-billig/ them reflection affordable my full can you take 5htp with bupropion sr tip items bleached-orange-brassy this a http://www.fantastikresimler.net/wjd/paroxetine-sandoz-gewichtstoename.php cupcakes times applied and brittle fingers kamagra london with but s always…

Capture for Oracle, odse001.prm: Positioning to (Thread 1) Sequence 130936, RBA 200848912, SCN 0.0.
2013-09-10 14:21:57 INFO OGG-01516 Oracle GoldenGate Capture for Oracle, odse001.prm: Positioned to (Thread 1) Sequence 130936, RBA 200848912, SCN 0.0, Sep 8, 2013 10:06:00 PM.
2013-09-10 14:24:21 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, odse001.prm: EXTRACT ODSE001 started.
2013-09-10 14:24:57 INFO OGG-01517 Oracle GoldenGate Capture for Oracle, odse001.prm: Position of first record processed for Thread 1, Sequence 130936, RBA 200848912, SCN 35.1386745975, Sep 8, 2013 10:06:00 PM.
2013-09-10 14:25:33 INFO OGG-01055 Oracle GoldenGate Capture for Oracle, odse001.prm: Recovery initialization completed for target file /data2/app/oggdata/ea000071, at RBA 1948.
2013-09-10 14:26:27 INFO OGG-01478 Oracle GoldenGate Capture for Oracle, odse001.prm: Output file /data2/app/oggdata/ea is using format RELEASE 11.2.
2013-09-10 14:27:21 INFO OGG-01026 Oracle GoldenGate Capture for Oracle, odse001.prm: Rolling over remote file /data2/app/oggdata/ea000071.
2013-09-10 14:28:15 INFO OGG-01053 Oracle GoldenGate Capture for Oracle, odse001.prm: Recovery completed for target file /data2/app/oggdata/ea000072, at RBA 1077.
2013-09-10 14:29:09 INFO OGG-01057 Oracle GoldenGate Capture for Oracle, odse001.prm: Recovery completed for all targets.

这里看到抽取到的log Position是在Sep 8, 2013 10:06:00 PM,于是进行了下面的操作:

dblogin userid ggs, password ggs
info odse001,detail
send extract odse001, forcestop
alter extract odse001, begin 2013-09-08 22:06:00
start extract odse001

期间延迟时间还是在不断的增长。观察log全是BOUNDED RECOVERY的信息,但奇怪的就是trail还不增长,而且也不报错,看似又是正常的,找不到相关的internal文档,提交SR,阿三不知所云,遂就放在一边没做操作,

BOUNDED RECOVERY: CANCELED: for object pool 1: p14155844_Redo Thread 1.
BOUNDED RECOVERY: VALID BCP: CP.E2111.000000554.
BOUNDED RECOVERY: ACTIVE: for object pool 1: p24904672_Redo Thread 1.

后来到了9.10号21:00多例行上去看了,发现正常了,延迟在减小,trail文件产生了很多,这才觉得有些正常了,到了11号下午,全部extract process都已经恢复正常,这场历时之久的恢复过程猜测是多个进程并发的BOUNDED RECOVERY,
性能不是很好,可能有些参数还是需要调整,这个后续需要研究测试了。

在extract process不断running/abended的过程中还出现下面的log,这个不是大问题,杀掉进程重启就好了。

2013-09-09 19:20:53 ERROR OGG-01454 Oracle GoldenGate Capture for Oracle, odsp011.prm: Unable to lock file “/data2/app/oggs/dirpcs/ODSP011.pce” (error 13, Permission denied). Lock currently held by process id (PID) 28377248.

The Importance of Bounded Recovery

Bounded Recovery is a component of Oracle GoldenGate’s Extract process checkpointing facility. It guarantees an efficient recovery after Extract stops for any reason, planned or unplanned, no matter how many open (uncommitted) transactions there were at the time that Extract stopped, nor how old they were. Bounded Recovery sets an upper boundary for the maximum amount of time that it would take for Extract to recover to the point where it stopped and then resume normal processing.

Extract performs this recovery as follows:

· If there were no open transactions when Extract stopped, the recovery begins at the current Extract read checkpoint. This is a normal recovery.

· If there were open transactions whose start points in the log were very close in time to the time when Extract stopped, Extract begins recovery by re-reading the logs from the beginning of the oldest open transaction. This requires Extract to do redundant work for transactions that were already written to the trail or discarded before Extract stopped, but that work is an acceptable cost given the relatively small amount of data to process. This also is considered a normal recovery.

· If there were one or more transactions that Extract qualified as long-running open transactions, Extract begins its recovery with a Bounded Recovery.

Bounded Recovery is new feature in OGG 11.1, this is how it works:

A transaction qualifies as long-running if it has been open longer than one Bounded Recovery interval, which is specified with the BRINTERVAL option of the BR parameter.

For example, if the Bounded Recovery interval is four hours, a long-running open transaction is any transaction that started more than four hours ago.

At each Bounded Recovery interval, Extract makes a Bounded Recovery checkpoint, which persists the current state and data of Extract to disk, including the state and data (if any) of long-running transactions. If Extract stops after a Bounded Recovery checkpoint, it will recover from a position within the previous Bounded Recovery interval or at the last Bounded Recovery checkpoint, instead of processing from the log position where the oldest open long-running transaction first appeared, which could be several trail files ago.

Bounded Recovery is enabled by default for Extract processes and has a 4 hour BR interval. To adjust the BR interval to say 24 hours, use the following syntax in your Extract parameter file:

BR BRINTERVAL 24, BRDIR BR

The default location for BR checkpoint files is the GoldenGate home directory. This can be altered by including a full path:

BR BRINTERVAL 24, BRDIR /data2/app/oggs/BR