Skip Headers

Oracle® Data Guard Concepts and Administration
10g Release 1 (10.1)

Part Number B10823-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

A
Troubleshooting Data Guard

This appendix provides help troubleshooting a standby database. This appendix contains the following sections:

A.1 Common Problems

If you encounter a problem when using a standby database, it is probably because of one of the following reasons:

A.1.1 Standby Archive Destination Is Not Defined Properly

If the STANDBY_ARCHIVE_DEST initialization parameter does not specify a valid directory name on the standby database, the Oracle database will not be able to determine the directory in which to store the archived redo log files. Check the DESTINATION and ERROR columns in the V$ARCHIVE_DEST view by entering the following query and ensure the destination is valid:

SQL> SELECT DESTINATION, ERROR FROM V$ARCHIVE_DEST;

A.1.2 Renaming Datafiles with the ALTER DATABASE Statement

You cannot rename the datafile on the standby site when the STANDBY_FILE_MANAGEMENT initialization parameter is set to AUTO. When you set the STANDBY_FILE_MANAGEMENT initialization parameter to AUTO, use of the following SQL statements is not allowed:

If you attempt to use any of these statements on the standby database, an error is returned. For example:

SQL> ALTER DATABASE RENAME FILE '/disk1/oracle/oradata/payroll/t_db2.log' to 'dummy'; 
alter database rename file '/disk1/oracle/oradata/payroll/t_db2.log' to 'dummy' 
* 
ERROR at line 1: 
ORA-01511: error in renaming log/data files 
ORA-01270: RENAME operation is not allowed if STANDBY_FILE_MANAGEMENT is auto

See Section 8.3.1 to learn how to add datafiles to a physical standby database.

A.1.3 Standby Database Does Not Receive Redo Data from the Primary Database

If the standby site is not receiving redo data, query the V$ARCHIVE_DEST view and check for error messages. For example, enter the following query:

SQL> SELECT DEST_ID "ID",
  2> STATUS "DB_status",
  3> DESTINATION "Archive_dest",
  4> ERROR "Error"
  5> FROM V$ARCHIVE_DEST WHERE DEST_ID <=5;

ID DB_status Archive_dest                   Error   
-- --------- ------------------------------ ------------------------------------
 1  VALID    /vobs/oracle/work/arc_dest/arc                          
 2  ERROR    standby1                       ORA-16012: Archivelog standby database identifier mismatch  
 3  INACTIVE                            
 4  INACTIVE                    
 5  INACTIVE                                           
5 rows selected.

If the output of the query does not help you, check the following list of possible issues. If any of the following conditions exist, log transport services will fail to transmit redo data to the standby database:

A.1.4 You Cannot Mount the Physical Standby Database

You cannot mount the standby database if the standby control file was not created with the ALTER DATABASE CREATE [LOGICAL] STANDBY CONTROLFILE ... statement or RMAN command. You cannot use the following types of control file backups:

A.2 Log File Destination Failures

If you specify REOPEN for an OPTIONAL destination, it is possible for the Oracle database to reuse online redo log files even if there is an error archiving to the destination in question. If you specify REOPEN for a MANDATORY destination, log transport services stall the primary database when redo data cannot be successfully transmitted.

The REOPEN attribute is required when you use the MAX_FAILURE attribute. Example A-1 shows how to set a retry time of 5 seconds and limit retries to 3 times.

Example A-1 Setting a Retry Time and Limit

LOG_ARCHIVE_DEST_1='LOCATION=/arc_dest REOPEN=5 MAX_FAILURE=3'

Use the ALTERNATE attribute of the LOG_ARCHIVE_DEST_n parameter to specify alternate archive destinations. An alternate archiving destination can be used when the transmission of redo data to a standby database fails. If transmission fails and the NOREOPEN attribute was specified or the MAX_FAILURE attribute threshold was exceeded, log transport services attempts to transmit redo data to the alternate destination on the next archival operation.

Use the NOALTERNATE attribute to prevent the original archive destination from automatically changing to an alternate archive destination when the original archive destination fails.

Example A-2 shows how to set the initialization parameters so that a single, mandatory, local destination will automatically fail over to a different destination if any error occurs.

Example A-2 Specifying an Alternate Destination

LOG_ARCHIVE_DEST_1='LOCATION=/disk1 MANDATORY ALTERNATE=LOG_ARCHIVE_DEST_2'
LOG_ARCHIVE_DEST_STATE_1=ENABLE
LOG_ARCHIVE_DEST_2='LOCATION=/disk2 MANDATORY'
LOG_ARCHIVE_DEST_STATE_2=ALTERNATE

If the LOG_ARCHIVE_DEST_1 destination fails, the archiving process will automatically switch to the LOG_ARCHIVE_DEST_2 destination at the next log file switch on the primary database.

A.3 Handling Logical Standby Database Failures

An important tool for handling logical standby database failures is the DBMS_LOGSTDBY.SKIP_ERROR procedure. Depending on how important a table is, you might want to do one of the following:

Taking one of these actions prevents SQL Apply from stopping. Later, you can query the DBA_LOGSTDBY_EVENTS view to find and correct any problems that exist. See PL/SQL Packages and Types Reference for more information about using the DBMS_LOGSTDBY package with PL/SQL callout procedures.

A.4 Problems Switching Over to a Standby Database

In most cases, following the steps described in Chapter 7 will result in a successful switchover. However, if the switchover is unsuccessful, the following sections may help you to resolve the problem:

A.4.1 Switchover Fails Because Redo Data Was Not Transmitted

If the switchover does not complete successfully, you can query the SEQUENCE# column in the V$ARCHIVED_LOG view to see if the last redo data transmitted from the original primary database was applied on the standby database. If the last redo data was not transmitted to the standby database, you can manually copy the archived redo log file containing the redo data from the original primary database to the old standby database and register it with the SQL ALTER DATABASE REGISTER LOGFILE file_specification statement. If you then start log apply services, the archived redo log file will be applied automatically. Query the SWITCHOVER_STATUS column in the V$DATABASE view. The TO PRIMARY value in the SWITCHOVER_STATUS column verifies switchover to the primary role is now possible.

SQL> SELECT SWITCHOVER_STATUS FROM V$DATABASE; 
SWITCHOVER_STATUS 
----------------- 
TO PRIMARY 
1 row selected 

See Chapter 14 for information about other valid values for the SWITCHOVER_STATUS column of the V$DATABASE view.

To continue with the switchover, follow the instructions in Section 7.2.1 for physical standby databases or Section 7.3.1 for logical standby databases, and try again to switch the target standby database to the primary role.

A.4.2 Switchover Fails Because SQL Sessions Are Still Active

If you do not include the WITH SESSION SHUTDOWN clause as a part of the ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY statement, active SQL sessions might prevent a switchover from being processed. Active SQL sessions can include other Oracle Database processes.

When sessions are active, an attempt to switch over fails with the following error message:

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY; 
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY * 
ORA-01093: ALTER DATABASE CLOSE only permitted with no sessions connected

Action: Query the V$SESSION view to determine which processes are causing the error. For example:

SQL> SELECT SID, PROCESS, PROGRAM FROM V$SESSION    
  2> WHERE TYPE = 'USER'
  3>  AND SID <> (SELECT DISTINCT SID FROM V$MYSTAT);
SID        PROCESS   PROGRAM 
---------  --------  ------------------------------------------------ 
        7      3537  oracle@nhclone2 (CJQ0)
       10
       14
       16
       19
       21
 6 rows selected.

In the previous example, the JOB_QUEUE_PROCESSES parameter corresponds to the CJQ0 process entry. Because the job queue process is a user process, it is counted as a SQL session that prevents switchover from taking place. The entries with no process or program information are threads started by the job queue controller.

Verify the JOB_QUEUE_PROCESSES parameter is set using the following SQL statement:

SQL> SHOW PARAMETER JOB_QUEUE_PROCESSES; 
NAME                           TYPE      VALUE
------------------------------ -------   -------------------- 
job_queue_processes            integer   5

Then, set the parameter to 0. For example:

SQL> ALTER SYSTEM SET JOB_QUEUE_PROCESSES=0; 
Statement processed.

Because JOB_QUEUE_PROCESSES is a dynamic parameter, you can change the value and have the change take effect immediately without having to restart the instance. You can now retry the switchover procedure.

Do not modify the parameter in your initialization parameter file. After you shut down the instance and restart it after the switchover completes, the parameter will be reset to the original value. This applies to both primary and physical standby databases.

Table A-1 summarizes the common processes that prevent switchover and what corrective action you need to take.

Table A-1  Common Processes That Prevent Switchover
Type of Process Process Description Corrective Action

CJQ0

Job Queue Scheduler Process

Change the JOB_QUEUE_PROCESSES dynamic parameter to the value 0. The change will take effect immediately without having to restart the instance.

QMN0

Advanced Queue Time Manager

Change the AQ_TM_PROCESSES dynamic parameter to the value 0. The change will take effect immediately without having to restart the instance.

DBSNMP

Oracle Enterprise Manager Management Agent

Issue the agentctl stop command from the operating system prompt.

A.4.3 Switchover Fails Because User Sessions Are Still Active

If the switchover fails and returns the error ORA-01093 "Alter database close only permitted with no sessions connected" it is usually because the ALTER DATABASE COMMIT TO SWITCHOVER statement implicitly closed the database, and if there are any other user sessions connected to the database, the close fails.

If you receive this error, disconnect any user sessions that are still connected to the database. To do this, query the V$SESSION fixed view to see which sessions are still active as shown in the following example:

SQL> SELECT SID, PROCESS, PROGRAM FROM V$SESSION;

       SID PROCESS   PROGRAM
---------- --------- ------------------------------------------------
         1 26900     oracle@dbuser-sun (PMON)
         2 26902     oracle@dbuser-sun (DBW0)
         3 26904     oracle@dbuser-sun (LGWR)
         4 26906     oracle@dbuser-sun (CKPT)
         5 26908     oracle@dbuser-sun (SMON)
         6 26910     oracle@dbuser-sun (RECO)
         7 26912     oracle@dbuser-sun (ARC0)
         8 26897     sqlplus@dbuser-sun (TNS V1-V3)
        11 26917     sqlplus@dbuser-sun (TNS V1-V3)

9 rows selected.

In this example, the first seven sessions are all Oracle Database background processes. Among the two SQL*Plus sessions, one is the current SQL*Plus session issuing the query, and the other is an extra session that should be disconnected before you re-attempt the switchover.

A.4.4 Switchover Fails with the ORA-01102 Error

Suppose the standby database and the primary database reside on the same site. After both the ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY and the ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY statements are successfully executed, shut down and restart the physical standby database and the primary database. However, the startup of the second database fails with ORA-01102 error "cannot mount database in EXCLUSIVE mode."

This could happen during the switchover if you did not set the DB_UNIQUE_NAME parameter in the initialization parameter file that is used by the standby database (that is, the original primary database). If the DB_UNIQUE_NAME parameter of the standby database is not set, the standby and the primary databases both use the same mount lock and cause the ORA-01102 error during the startup of the second database.

Action: Add DB_UNIQUE_NAME=unique_database_name to the initialization parameter file used by the standby database, and shut down and restart the standby and primary databases.

A.4.5 Switchover Fails Because Redo Data Is Not Applied After the Switchover

The archived redo log files are not applied to the standby database after the switchover.

This might happen because some environment or initialization parameters were not properly set after the switchover.

Action:

A.4.6 Roll Back After Unsuccessful Switchover and Start Over

For physical standby databases in situations where an error occurred and it is not possible to continue with the switchover, it might still be possible to revert the new physical standby database back to the primary role by using the following steps:

  1. Connect to the new standby database (old primary), and issue the following statement to convert it back to the primary role:
    SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
    
    

    If this statement is successful, then shut down and restart the database. Once restarted, the database will be running in the primary database role, and you do not need to perform any more steps.

    If this statement is unsuccessful, then continue with Step 3.

  2. When the switchover to change the role from primary to physical standby was initiated, a trace file was written in the log directory. This trace file contains the SQL statements required to re-create the original primary control file. Locate the trace file and extract the SQL statements into a temporary file. Execute the temporary file from SQL*Plus. This will revert the new standby database back to the primary role.
  3. Shut down the original physical standby database.
  4. Create a new standby control file. This is necessary to resynchronize the primary database and physical standby database. Copy the physical standby control file to the original physical standby system. Section 3.2.2 describes how to create a physical standby control file.

  5. Restart the original physical standby instance.

    If this procedure is successful and archive gap management is enabled, the FAL processes will start and re-archive any missing archived redo log files to the physical standby database. Force a log switch on the primary database and examine the alert logs on both the primary database and physical standby database to ensure the archived redo log file sequence numbers are correct.

    See Section 5.8 for information about archive gap management and Appendix E for information about locating the trace files.

  6. Try the switchover again.

    At this point, the Data Guard configuration has been rolled back to its initial state, and you can try the switchover operation again (after correcting any problems that might have led to the initial unsuccessful switchover).

A.5 What to Do If SQL Apply Stops

Log apply services cannot apply unsupported DML statements, DDL statements, and Oracle supplied packages to a logical standby database running SQL Apply.

When an unsupported statement or package is encountered, SQL Apply stops. You can take the actions described in Table A-2 to correct the situation and start SQL Apply on the logical standby database again.

Table A-2  Fixing Typical SQL Apply Errors
If... Then...

You suspect an unsupported statement or Oracle supplied package was encountered

Find the last statement in the DBA_LOGSTDBY_EVENTS view. This will indicate the statement and error that caused SQL Apply to fail. If an incorrect SQL statement caused SQL Apply to fail, transaction information, as well as the statement and error information, can be viewed. The transaction information can be used with LogMiner tools to understand the cause of the problem.

An error requiring database management occurred, such as running out of space in a particular tablespace

Fix the problem and resume SQL Apply using the ALTER DATABASE START LOGICAL STANDBY APPLY statement.

An error occurred because a SQL statement was entered incorrectly, such as an incorrect standby database filename being entered in a tablespace statement

Enter the correct SQL statement and use the DBMS_LOGSTDBY.SKIP_TRANSACTION procedure to ensure the incorrect statement is ignored the next time SQL Apply is run. Then, restart SQL Apply using the ALTER DATABASE START LOGICAL STANDBY APPLY statement.

An error occurred because skip parameters were incorrectly set up, such as specifying that all DML for a given table be skipped but CREATE, ALTER, and DROP TABLE statements were not specified to be skipped

Issue the DBMS_LOGSTDBY.SKIP('TABLE','schema_name','table_name',null) procedure, then restart SQL Apply.

See Chapter 14 for information about querying the DBA_LOGSTDBY_EVENTS view to determine the cause of failures.

A.6 Network Tuning for Redo Data Transmission

The process of transmitting redo data involves reading a buffer from the online redo log file and writing it to the archived redo log file location. When the destination is remote, the buffer is written to the archived redo log file location over the network using Oracle Net services.

The default archived redo log file buffer size is 1 megabyte. The default transfer buffer size for Oracle Net is 2 kilobytes. Therefore, the archived redo log file buffer is divided into units of approximately 2 kilobytes for transmission. These units could get further divided depending on the maximum transmission unit (MTU) of the underlying network interface.

The Oracle Net parameter that controls the transport size is session data unit (SDU). This parameter can be adjusted to reduce the number of network packets that are transmitted. This parameter allows a range of 512 bytes to 32 kilobytes.

For optimal performance, set the Oracle Net SDU parameter to 32 kilobytes for the associated SERVICE destination parameter.

The following example shows a database initialization parameter file segment that defines a remote destination netserv:

LOG_ARCHIVE_DEST_3='SERVICE=netserv'
SERVICE_NAMES=srvc

The following example shows the definition of that service name in the tnsnames.ora file:

netserv=(DESCRIPTION=(SDU=32768)(ADDRESS=(PROTOCOL=tcp)(HOST=host) (PORT=1521)) 
(CONNECT_DATA=(SERVICE_NAME=srvc)(ORACLE_HOME=/oracle)))

The following example shows the definition in the listener.ora file:

LISTENER=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)
(HOST=host)(PORT=1521))))

SID_LIST_LISTENER=(SID_LIST=(SID_DESC=(SDU=32768)(SID_NAME=sid)
(GLOBALDBNAME=srvc)(ORACLE_HOME=/oracle)))

If you archive to a remote site using high-latency or high-bandwidth connections, you can improve performance by increasing the TCP send and receive window sizes.

If high-speed WAN links are used to connect the sites in a Data Guard configuration, network throughput can often be substantially improved by using the SQLNET.SEND_BUF_SIZE and SQLNET.RECV_BUF_SIZE Oracle Net profile parameters to increase the size of the network send and receive I/O buffers.

See Oracle Net Services Administrator's Guide.

A.7 Managing Data Guard Network Timeout

For any given Oracle Data Guard network connection, there are two processes communicating with each other. When the network connection is unexpectedly broken, how these processes react differs greatly. This is a discussion of what actually occurs when a network connection is broken, and how it affects the Data Guard environment and configuration. This discussion applies to both physical and logical standby databases.

Data Guard uses a peer-to-peer connection protocol, whereby a primary database process, whether it is the log writer process (LGWR) or archiver processes (ARCn), establishes a network connection to the standby database. As a result of the network connection request, the listener on the standby site creates a separate process on the standby database called the Remote File Server (RFS) process. The RFS process uses network messages from the primary database; it reads from the network and sends an acknowledgment message back to the primary database when it is done processing the request.

During normal Data Guard operations, when redo data is transmitted from the primary database to the standby database, network messages are initiated from the primary database (the network client), and always acknowledged by the standby database (the network server). In this case, the LGWR and ARCH processes are the network clients, and the RFS process is the network server.

Consider the simple scenario where the network between the primary and standby systems is disconnected. When the LGWR process attempts to send a new message to the RFS process over this connection, the LGWR process receives an error from Oracle Net, after a TCP timeout, indicating that the connection is broken. In this way, the LGWR is able to establish that network connectivity is lost, and take corrective action. The Data Guard attributes [NO]MAX_FAILURE, [NO]REOPEN and [NO]NET_TIMEOUT, which are options for the LOG_ARCHIVE_DEST_n parameter, provide LGWR with the desired flexibility to control the timeout intervals and number of retries associated with a network connection that is not responding.

In contrast to the LGWR process, the RFS process on the standby database is always synchronously waiting for a new message to arrive from the primary database. The RFS process that is doing the network read operation is blocked until some data arrives, or until the underlying network software determines the connection is no longer valid.

Oracle Net periodically sends a network probe to verify a client/server connection is still active. This ensures connections are not left open indefinitely due to an abnormal client termination. If the probe finds a broken connection, it returns an error that causes the RFS process to exit.

You can use the Oracle Net SQLNET.EXPIRE_TIME parameter to specify the time interval, expressed in minutes, when to send a probe to verify the network session is active. Setting this parameter to a small value allows for more timely detections of broken connections. Connections that do not respond to this probe signal are disconnected. This parameter should be set up for the standby database, as well as the primary database, to prepare it for future switchover scenarios.

Limitations on using this feature are:

Once the RFS process receives notification of the broken network connection, it will terminate itself. However, until such time as the RFS process terminates itself, it will retain lock information on the archived redo log file on the standby site, or the standby redo log file, whose redo data was being received from the primary database. During this interval, no new RFS processes can receive redo data from the primary database for the same archived redo log file (or the standby redo log file).

Oracle recommends setting the Oracle Net SQLNET.EXPIRE_TIME parameter to 1 minute. This is a reasonable value for most systems, and setting the parameter to a small value does not significantly impact production systems.

Once the network problem is resolved, and the primary database processes are again able to establish network connections to the standby database, a new RFS process will automatically be started on the standby database for each new network connection. These new RFS processes will resume the reception of redo data from the primary database.

A.8 Slow Disk Performance on Standby Databases

If asynchronous I/O on the file system itself is showing performance problems, try mounting the file system using the Direct I/O option or setting the FILESYSTEMIO_OPTIONS=SETALL initialization parameter. The maximum I/O size you should set is 1 MB.

A.9 Log Files Must Match to Avoid Primary Database Shutdown

If you have configured a standby redo log on one or more standby databases in the configuration, ensure the size of the current standby redo log file on each standby database exactly matches the size of the current online redo log file on the primary database.

At log switch time, if there are no available standby redo log files that match the size of the new current online redo log file on the primary database:

For example, if the primary database uses two online redo log groups whose log files are 100K and 200K, respectively, then the standby database should have 4 standby redo log groups with log file sizes of 100K and 200K.

Also, whenever you add a redo log group to the primary database, you must add a corresponding standby redo log group to the standby database. This reduces the probability that the primary database will be adversely affected because a standby redo log file of the required size is not available at log switch time.

See Section 5.6.2, "Configuring Standby Redo Log Files" for more information.