Sometimes we end up in a situation where our long running transaction is not completing and we are also not sure how much further time it’s going to take. This happened with one of our DBA where they found MLOG to be bloated because of one orphan snapshot entry. Orphan entries are the one where actual site is not registered on master (no entry in DBA_REGISTERED_SNAPSHOTS), but they see entry for MLOGS (entry in DBA_SNAPSHOT_LOGS). This could happen if we try to drop snapshot from downstream database and it does not get cleaned up on upstream databases.
So in the situation that I faced, upstream team had MLOG which was bloated to 18GB and MLOG also had an index which was bloated to 30GB. (ya, I know its bad :-))
So they identified the orphan snapshot ID and they wanted to purge that from snapshot log to reduce the size of MLOG (after they move the MLOG and rebuild the index after doing the purge).
They used following procedure of DBMS_SNAPSHOT for purging snapshot ID from log
PROCEDURE PURGE_SNAPSHOT_FROM_LOG Argument Name Type In/Out Default? ------------------------------ ----------------------- ------ -------- SNAPSHOT_ID BINARY_INTEGER IN
After they started the activity in the morning and monitoring the same until evening, it was still not complete. I helped them in tracking the progress by checking real time SQL monitoring report and it was showing that session has already read around 60GB and undo used until that time was around 48GB. It was still not clear how the command has read 60GB worth of data when MLOG size was only 18GB.
Also, original base table was just 2GB.
At this point they wanted to kill the session. But killing the session will not help immediately as it has to perform huge rollback as well (48GB of UNDO).
But since command was not completing and took almost entire shift, they decided to kill the session. So session was killed using “ALTER SYSTEM KILL SESSION ‘<sid>,<serial#>’ immediate” and session was marked for kill. But session was just marked as killed and it was still holding the lock (if we check in V$LOCK view). This was because session was doing the rollback. We can monitor the progress of rollback using V$TRANSACTION view
You can look at used_ublk in V$transaction to estimate how long it is going
to take to complete the rollback.
SQL> SELECT a.used_ublk FROM v$transaction a, v$session b WHERE a.addr = b.taddr AND b.sid = <SID>;
For example:
If used_ublk showed 29,900 12 hours ago and is now 22,900, it has taken 12 hours to rollback 7,000 entries. It will take approximately another 36 hours to complete depending on the types of transactions that are rolling back.
Recovery was very slow as session was doing serial recovery. Next we found the OS PID of the session and killed the OS process as well so that recovery can happen in the background using SMON. Within few mins PMON performed the clean up and lock was released.
Rollback continued in the background and this is faster than the rollback performed by the session. If we kill the session and the shadow process at OS level, SMON picks up the rollback part and it goes for parallel rollback, which is faster.
V$FAST_START_TRANSACTIONS & X$KTUXE
We can monitor the progress of rollback in V$FAST_START_TRANSACTIONS view.
V$fast_start_transactions -> contains one row for each one of the transactions that Oracle is recovering in Parallel.
FAST_START_PARALLEL_ROLLBACK shows the maximum number of processes which may exist for performing parallel rollback.
In fast-start parallel rollback, the background process SMON acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes.
Fast start parallel rollback is mainly useful when a system has transactions that run a long time before committing, especially parallel Inserts, Updates, Deletes operations. When SMON discovers that the amount of recovery work is above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes.
The following queries are available to monitor the progress of the transaction recovery
set linesize 100 alter session set NLS_DATE_FORMAT='DD-MON-YYYY HH24:MI:SS'; select usn, state, undoblockstotal "Total", undoblocksdone "Done", undoblockstotal-undoblocksdone "ToDo", decode(cputime,0,'unknown',sysdate+(((undoblockstotal-undoblocksdone) / (undoblocksdone / cputime)) / 86400)) "Estimated time to complete" from v$fast_start_transactions;
Run the above query several times in a row, this will give you a good idea on how SMON is progressing.
- In some versions the cputime does not work (always 0), hence the estimated completion time will not be displayed
- In some cases the v$fast_start_transactions view will not work. If this is the case then you can query the internal data dictionary view x$ktuxe
The ‘ktuxesiz’ column represents the remaining number of undo blocks required for rollback:
select ktuxeusn, to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Time", ktuxesiz, ktuxesta from x$ktuxe where ktuxecfl = 'DEAD';
I was not able to see recover progress using V$FAST_START_TRANSACTIONS, but I was able to see the progress in x$ktuxe view.
select ktuxeusn, to_char(sysdate,'DD-MON-YYYY HH24:MI:SS') "Time", ktuxesiz, ktuxesta from x$ktuxe where ktuxecfl = 'DEAD'; KTUXEUSN|Time | KTUXESIZ|KTUXESTA ----------|--------------------------|----------|---------------- 2167|01-AUG-2016 12:05:14 | 5260156|ACTIVE SQL>/ KTUXEUSN|Time | KTUXESIZ|KTUXESTA ----------|--------------------------|----------|---------------- 2167|01-AUG-2016 12:05:15 | 5259945|ACTIVE SRW1NA>/ KTUXEUSN|Time | KTUXESIZ|KTUXESTA ----------|--------------------------|----------|---------------- 2167|01-AUG-2016 12:05:15 | 5259854|ACTIVE .. .. .. <After 2-3 hours> KTUXEUSN|Time | KTUXESIZ|KTUXESTA ----------|--------------------------|----------|---------------- 2167|01-AUG-2016 16:31:47 | 612697|ACTIVE
Speeding up recovery
We can further improve the speed of recovery by taking following steps
1) There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the pq slaves are interfering with each other. To check the Parallel Recovery processes and there state query:
select * from v$fast_start_servers;
Column STATE shows the state of the server being IDLE or RECOVERING, if only 1 process is in state RECOVERING while the other processes are in state IDLE, then you should disable Parallel Transaction Recovery. How to do this is outlined in the following note:
Note 238507.1: How to Disable Parallel Transaction Recovery When Parallel Txn Re very is Active
2) If all the rows are showing RECOVERING in STATE column of v$fast_start_servers, then you will get benefitted if you add more threads for doing the recovery.
You can do so by setting value of FAST_START_PARALLEL_ROLLBACK parameter. You should set a value of HIGH if you want to speed up the recovery.
Following are the different values of this parameter
- FALSE – Parallel rollback is disabled
- LOW – Limits the maximum degree of parallelism to 2 * CPU_COUNT
- HIGH -Limits the maximum degree of parallelism to 4 * CPU_COUNT
Note that, this parameter is not dynamic and needs database bounce. Also, If you change the value of this parameter, then transaction recovery will be stopped and restarted with the new implied degree of parallelism. So if you are already done more than half the rollback and you think its not worth to change this parameter, you can leave it. Else if you still change this parameter, recovery will start from the beginning again.
3) Increase the parameter ‘_cleanup_rollback_entries’
This parameter determines number of undo entries to apply per transaction cleanup. The default value is 100. You can change that to, say 400.This parameter cannot be changed dynamically, so in order to change this the database will need to be restarted.
In our specific situation, we knew that huge rollback needs to be performed and we were monitoring the rollback progress from the beginning. So we made a decision at the very beginning to set FAST_START_PARALLEL_ROLLBACK to HIGH and bounce the DB. This improved recovery speed right from the beginning.
References:
SMON: Parallel transaction recovery tried (Doc ID 1458738.1) To BottomTo Bottom
Troubleshooting Database Transaction Recovery (Doc ID 1494886.1)
Database Hangs Because SMON Is Taking 100% CPU Doing Transaction Recovery (Doc ID 414242.1)
SMON: Parallel transaction recovery tried (Doc ID 1458738.1)
Note 238507.1: How to Disable Parallel Transaction Recovery When Parallel Txn Re very is Active
Filed under: Backup and Recovery, Oracle Database 11g Tagged: killing OS PID, KTUXESIZ, recovery slaves, transaction recovery, v$fast_start_servers, v$fast_start_transactions, x$ktuxe Image may be NSFW.
Clik here to view.
Clik here to view.
