Oracle® High Availability Architecture and Best Practices 10g Release 1 (10.1) Part Number B10726-01 |
|
|
View PDF |
This chapter describes the detailed recovery operations that are referred to in the outages and solutions tables in Chapter 9, "Recovering from Outages". It includes the following topics:
This chapter describes the detailed recovery operations that are referred to in the outages and solutions tables in Chapter 9, "Recovering from Outages". Table 10-1 summarizes the recovery operations that are described in this chapter.
This section describes the following client failover scenarios:
In the complete site failover scenario, existing connections fail and new connections are routed to a secondary or failover site. This occurs when there is a true disaster and where the application stack is replicated.
In the partial site failover scenario, the primary site is intact, and the middle-tier applications need to be redirected after the database has failed over or switched over to a standby database on the secondary site. This configuration is not recommended if performance decreases significantly because of the greater latency between the application servers and the database.
A wide-area traffic manager is implemented on the primary and secondary sites to provide the site failover function. The wide-area traffic manager can redirect traffic automatically if the primary site or a specific application on the primary site is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is directed to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primary site fails, then user traffic is directed to the secondary site.
Figure 10-1 illustrates the network routes before site failover. Client requests enter the client tier of the primary site and travel by the WAN traffic manager. Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier. Requests are then forwarded through the active load balancer to the application servers. They are then sent through another firewall and into the database server tier. The application requests, if required, are routed to a RAC instance. Responses are sent back to the application and clients by a similar path.
Text description of the illustration maxav013.gif
Figure 10-2 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow exactly the same path on the secondary site that they followed on the primary site.
Text description of the illustration maxav028.gif
The following steps describe what happens to network traffic during a failover or switchover.
Failover also depends on the client's web browser. Most browser applications cache the DNS entry for a period of time. Consequently, sessions in progress during an outage may not fail over until the cache timeout expires. The only way to resume service to such clients is to close the browser and restart it.
This usually occurs after the database has been failed over or switched over to the secondary site and the middle-tier applications remain on the primary site. The following steps describe what happens to network traffic during a partial site failover:
Figure 10-3 shows the network routes after partial site failover. Client and application requests enter the primary site at the client tier and follow the same path to the database server tier as in Figure 10-1. When the requests enter the database server tier, they are routed to the database tier of the secondary site through any additional switches, routers, and possible firewalls.
Text description of the illustration maxav031.gif
Failover is the operation of taking the production database offline on one site and bringing one of the standby databases online as the new production database. A failover operation can be invoked when an unplanned catastrophic failure occurs on the production database, and there is no possibility of recovering the production database in a timely manner.
Data Guard enables you to fail over by issuing the SQL statements described in subsequent sections, by using Oracle Enterprise Manager, or by using the Oracle Data Guard broker command-line interface.
See Also:
Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database failover |
Data Guard failover is a series of steps to convert a standby database into a production database. The standby database essentially assumes the role of production. A Data Guard failover is accompanied by a site failover to fail over the users to the new site and database. After the failover, the secondary site contains the production database. The former production database needs to be re-created as a new standby database to restore resiliency. The standby database can be quickly re-created by using Flashback Database. See "Restoring the Standby Database After a Failover".
During a failover operation, little or no data loss may be experienced. The complete description of a failover can be found in Oracle Data Guard Concepts and Administration.
The rest of this section includes the following topics:
Data Guard failover should be used only in the case of an emergency and should be initiated due to an unplanned outage such as:
A failover requires that the initial production database be re-created as a standby database to restore fault tolerance to your environment. The standby database can be quickly re-created by using Flashback Database. See "Restoring the Standby Database After a Failover".
Do not use Data Guard failover when the problem can be fixed locally in a timely manner or when Data Guard switchover can be used. For failover with complete recovery scenarios, either the production database is not accessible or cannot be restarted. Data Guard failover should not be used where object recovery or flashback technology solutions provide a faster and more efficient alternative.
This section includes the following topics:
SELECT THREAD#, LOW_SEQUENCE#, HIGH_SEQUENCE# FROM V$ARCHIVE_GAP;
See Also:
Oracle Data Guard Concepts and Administration for more information about what to do if a gap exists |
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;
SELECT SWITCHOVER_STATUS FROM V$DATABASE;
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
ALTER DATABASE STOP LOGICAL STANDBY APPLY;
ALTER DATABASE REGISTER LOGICAL LOGFILE 'file_name';
NODELAY
and FINISH
clauses.
ALTER DATABASE START LOGICAL STANDBY APPLY NODELAY FINISH;
ALTER DATABASE ACTIVATE LOGICAL STANDBY DATABASE;
Failover has completed, and the new production database is available to process transactions.
A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Thus, following a successful switchover operation, the standby database assumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that only one instance is active for each database, production and standby. At times the term "switchback" is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.
Data Guard enables you to change these roles dynamically by issuing the SQL statements described in subsequent sections, or by using Oracle Enterprise Manager, or by using the Oracle Data Guard broker command-line interface. Using Oracle Enterprise Manager or the Oracle Data Guard broker command-line interface is described in Oracle Data Guard Broker.
This section includes the following topics:
Switchover is a planned operation. Switchover is the capability to switch database roles between the production and standby databases without needing to instantiate any of the databases. Switchover can occur whenever a production database is started, the target standby database is available, and all the archived redo logs are available. It is useful in the following situations:
Switchover is not possible or practical under the following circumstances:
Do not use Data Guard switchover when local recovery solutions provide a faster and more efficient alternative. The complete description of a switchover can be found in Oracle Data Guard Concepts and Administration.
If you are not using Oracle Enterprise Manager, then the high-level steps in this section can be executed with SQL*Plus. These steps are described in detail in Oracle Data Guard Concepts and Administration.
This section includes the following topics:
To identify active sessions, execute the following query:
SELECT SID, PROCESS, PROGRAM FROM V$SESSION WHERE TYPE = 'USER' AND SID <> (SELECT DISTINCT SID FROM V$MYSTAT);
'TO STANDBY'
.
SELECT SWITCHOVER_STATUS FROM V$DATABASE;
ALTER DATABASE COMMIT TO SWITCHOVER TO STANDBY [WITH SESSION SHUTDOWN];
STARTUP MOUNT; RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY [WITH SESSION SHUTDOWN];
ALTER DATABASE PREPARE TO SWITCHOVER TO LOGICAL STANDBY;
Following this step, logs start to ship in both directions, although the current production database does not process the logs coming from the current logical standby database.
ALTER DATABASE PREPARE TO SWITCHOVER TO PRIMARY;
This is the phase where current transactions on the production database are cancelled. All DML-related cursors are invalidated, preventing new records from being applied. The end of redo (EOR) marker is recorded in the online redo log and then shipped (immediately if using real-time apply) to the logical standby database and registered.
ALTER DATABASE COMMIT TO SWITCHOVER TO LOGICAL STANDBY;
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
If real-time apply is required, execute the following statement:
ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
Otherwise execute the following statement:
ALTER DATABASE START LOGICAL STANDBY APPLY;
This section includes the following topics:
This section includes the following topics:
Instance failure occurs when software or hardware problems disable an instance. After instance failure, Oracle automatically uses the online redo log file to perform database recovery as described in this section.
Instance recovery in RAC does not include restarting the failed instance or the recovery of applications that were running on the failed instance. Applications that were running continue by using failure recognition and recovery as described in Oracle Real Application Clusters Installation and Configuration Guide. This provides consistent and uninterrupted service in the event of hardware or software failures. When one instance performs recovery for another instance, the surviving instance reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost. The instance that is performing recovery rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions.
When multiple node failures occur, as long as one instance survives, RAC performs instance recovery for any other instances that fail. If all instances of a RAC database fail, then Oracle automatically recovers the instances the next time one instance opens the database. The instance that is performing recovery can mount the database in either shared or exclusive mode from any node of a RAC database. This recovery procedure is the same for Oracle running in shared mode as it is for Oracle running in exclusive mode, except that one instance performs instance recovery for all the failed instances in exclusive node.
Service reliability is achieved by configuring and failing over among redundant instances. More instances are enabled to provide a service than would otherwise be needed. If a hardware failure occurs and adversely affects a RAC database instance, then RAC automatically moves any services on that instance to another available instance. Then Cluster Ready Services (CRS) attempts to restart the failed nodes and instances.
An installation can specify the "preferred" and "available" configuration for each service. This configuration describes the preferred way to run the system, and is used when the service first starts up. For example, the ERP
service runs on instance1
and instance2
, and the HR
service runs on instance3
when the system first starts. instance2
is available to run HR
in the event of a failure or planned outage, and instance3
and instance4
are available to run ERP
. The service configuration can be designed several ways.
RAC recognizes when a failure affects a service and automatically fails over the service and redistributes the clients across the surviving instance supporting the service. In parallel, CRS attempts to restart and integrate the failed instances and dependent resources back into the system. Notification of failures occur at various levels including notifying external parties through Enterprise Manager and callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving fault domain when the failed domain is out of service. The location and number of fault domains serving a service is transparent to the applications. Auto restart and recovery are automatic, including all the subsystems, not just database.
This section includes the following topics:
When an outage occurs, RAC automatically restarts essential components. Components that are eligible for automatic restart include instances, listeners, and the database as well as several subcomponents. Some scheduled administrative tasks require that you prevent components from automatically restarting. To perform scheduled maintenance that requires a CRS-managed component to be down during the operation, the resource must be disabled to prevent CRS from trying to automatically restart the component. For example, to take a node and all of its instances and services offline for maintenance purposes, disable the instance and its services using either Enterprise Manager or SRVCTL
, and then perform the required maintenance. Otherwise, if the node fails and then restarts, then CRS attempts to restart the instance during the administrative operation.
For a scheduled outage that requires an instance, node, or other component to be isolated, RAC provides the ability to relocate, disable, and enable services. Relocation migrates the service to another instance. The sessions can also be relocated. These interfaces also allow services, instances and databases to be selectively disabled while a repair, change, or upgrade is made and re-enabled after the change is complete. This ensures that the service is not started at the instance being repaired because of a dependency or a start operation on the service. The service is disabled on the instance at the beginning of the planned outage. It is then enabled at the end of the maintenance outage.
For example, to relocate the SALES
service from instance1
to instance3
in order to perform scheduled maintenance on node1
,the tasks can be performed using Enterprise Manager or SRVCTL
commands. The following shows how to use SRVCTL
commands:
SALES
service to instance3
.
srvctl relocate service -d PROD -s SALES -i instance1 -t instance3
SALES
service on instance1
to prevent it from being relocated to instance1
while maintenance is performed.
srvctl disable service -d PROD -s SALES -i instance1
instance1
.
srvctl stop instance -d PROD -i instance1
instance1
.
srvctl start instance -D PROD -i instance1
SALES
service on instance1
.
srvctl enable service -d PROD -s SALES -i instance1
SALES
service running on instance3
back to instance1
.
srvctl relocate service -d PROD -s SALES -i instance3 -t instance1
This section applies to MAA, with RAC and Data Guard on each site.
A standby database can have multiple standby instances. Only one instance can have the managed recovery process (MRP) or the logical standby apply process (LSP). The instance with the MRP or LSP is called the apply instance.
When you have a RAC-enabled standby database, you can fail over the apply instance of the standby RAC environment. Failing over to another apply instance may be necessary when incurring a planned or unplanned outage that affects the apply instance or node. Note the difference between apply instance failover, which utilizes multiple instances of the standby database at the secondary site, and Data Guard failover or Data Guard switchover, which converts the standby database into a production database. The following occurs as a result of apply instance failover:
For apply failover to work correctly, "Configuration Best Practices for MAA" must be followed:
When you follow these configuration recommendations, apply instance failover is automatic for a scheduled or unscheduled outage on the primary instance, and all standby instances have access to archived redo logs. By definition, all RAC standby instances already have access to standby redo logs because they must reside on shared storage.
The method of restarting the physical standby managed recovery process (MRP) or the logical standby apply process (LSP) depends on whether Data Guard Broker is being used. If the Data Guard Broker is in use, then the MRP or LSP is automatically restarted on the first available standby instance if the primary standby instance fails. If the Data Guard Broker is not being used, then the MRP or LSP must be manually restarted on the new standby instance. Consider using a shared file system, such as a clustered file system or a global file system, for the archived redo logs. A shared file system enables you to avoid reshipment of any unapplied archived redo logs that were already shipped to the standby.
See Also:
Oracle Data Guard Concepts and Administration for details about setting up cross-instance archiving |
If apply instance failover does not happen automatically, then follow these steps to restart your production database, if necessary, and restart MRP or LSP following an unscheduled apply instance or node outage:
From the targeted standby instance, run the following query.
SELECT OPEN_MODE, DATABASE_ROLE FROM V$DATABASE;
% lsnrctl status listener_name
% tnsping standby_database_connection_service_name
If the connection cannot be made, then consult Oracle Net Services Administrator's Guide for further troubleshooting.
Use the following statements for a physical standby database:
RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;
Use the following statements for a logical standby database:
ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
Optionally, copy the archived redo logs to the new apply host.
The copy is not necessary for a physical standby database. For a physical standby database, when the managed recovery process detects an archive gap, it requests the production archived redo logs to be resent automatically.
For a logical standby database, unapplied archive file names that have already been registered are not resent automatically to the logical standby database. These archived redo logs must be sent manaully to the same directory structure in the new apply host. You can identify the registered unapplied archived redo logs by executing a statement similar to the following:
SELECT LL.FILE_NAME, LL.THREAD#, LL.SEQUENCE#, LL.FIRST_CHANGE#, LL.NEXT_CHANGE#, LP.APPLIED_SCN, LP.READ_SCN FROM DBA_LOGSTDBY_LOG LL, DBA_LOGSTDBY_PROGRESS LP WHERE LEAST(LP.APPLIED_SCN, LP.READ_SCN) <= LL.NEXT_CHANGE#;
Compare the results of the statement to the contents of the STANDBY_ARCHIVE_DEST
directory.
See Also:
"Oracle9i Data Guard: SQL Apply Best Practices" at |
Query V$ARCHIVE_STATUS
and V$ARCHIVED_DEST_STATUS
.
SELECT NAME_SPACE, STATUS, TARGET, LOG_SEQUENCE , TYPE,PROCESS, REGISTER , ERROR FROM V$ARCHIVE_DEST WHERE STATUS!='INACTIVE'; SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
Issue the following queries to ensure that the sequence number is advancing over time.
Use the following statements for a physical standby database:
SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD#; SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_ STANDBY;
Use the following statements for a logical standby database:
SELECT MAX(SEQUENCE#), THREAD# FROM DBA_LOGSTDBY_LOG GROUP BY THREAD#; SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
Recovering from a data failure is an unscheduled outage scenario. A data failure is usually, but not always, caused by some activity or failure that occurs outside the database, even though the problem may be evident within the database.
Data failure can affect the following types of database objects:
UNDO
tablespace, temporary tablespaceSPFILE
)Data failure can be categorized as either datafile block corruption or media failure:
In all environments, you can resolve a data failure outage by one of the following methods:
In a Data Guard environment, you can also use a Data Guard switchover or failover to a standby database to recover from data failures.
Another category of related outages that result in database objects becoming unavailable or inconsistent are caused by user error, such as dropping a table or erroneously updating table data. Information about recovering from user error can be found in "Recovering from User Error with Flashback Technology".
The rest of this section includes the following topics:
A corrupt datafile block can be accessed, but the contents within the block are invalid or inconsistent. The typical cause of datafile corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.
The database usually remains available when corrupt blocks have been detected, but some corrupt blocks may cause widespread problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an application unusable.
See Also:
|
The rest of this section includes the following topics:
A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application because it has affected the availability of the application. For example:
SYSTEM
tablespace caused by a failing disk controllerRegularly monitor application logs (which may be distributed across the data server, middle-tier and the client machines), the alert log, and Oracle trace files for errors such as ORA-1578 and ORA-1110
ORA-01578: ORACLE data block corrupted (file # 4, block # 26) ORA-01110: data file 4: '/u01/oradata/objrs/obj_corr.dbf'
After you have identified datafile block corruption, follow these steps:
Use the following methods to determine the extent of the corruption:
Gather the file number, file name, and block number from the error messages. For example:
ORA-01578: ORACLE data block corrupted (file # 22, block # 12698) ORA-01110: data file 22: '/oradata/SALES/users01.dbf'
The file number is 22, the block number is 12698, and the file name is /oradata/SALES/users01.dbf
.
Record additional error messages that appear in the alert log, Oracle trace files, or application logs. Note that log files may be distributed across the data server, middle tier, and client machines.
Use Oracle detection tools to find other data failure problems that may exist on the same disk or set of disks that have not yet been reported. For example, if the file number 22 has corrupt blocks, then it is prudent to run the RMAN BACKUP VALIDATE DATAFILE 22
command to detect additional corruption. Table 10-2 summarizes the Oracle tools that are available to detect datafile block corruption.
Some corruption problems are caused by faulty hardware. If there is a hardware fault or a suspect component, then it is sensible to either repair the problem, or make disk space available on a separate disk subsystem before proceeding with a recovery option.
If there are multiple errors, if there are operating system-level errors against the affected file, or if the errors are transient and keep moving about, then there is little point in proceeding until the underlying problem has been addressed or space is available on alternative disks. Ask your hardware vendor to verify system integrity.
In a Data Guard environment, a switchover can be performed to bring up the application and restore service quickly while the corruption problem is handled offline.
Using the file ID (fid
) and block ID (bid
) gathered from error messages and the output from Oracle block checking utilities, determine which database objects are affected by the corruption by using a query similar to the following:
SELECT tablespace_name, partition_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = fid AND bid BETWEEN block_id AND block_id + blocks - 1;
The following is an example of the query and its resulting output:
SQL> select tablespace_name, partition_name, segment_type, 2 owner, segment_name from dba_extents 3 where file_id=4 and 11 between block_id and block_id + blocks -1; TABLESPACE_NAME PARTITION_NAME SEGMENT_TYPE OWNER SEGMENT_NAME --------------- -------------- ------------ ----- ------------ USERS TABLE SCOTT EMP
The integrity of a table or index can be determined by using the ANALYZE
statement.
The recommended recovery methods are summarized in Table 10-3 and Table 10-4. The recovery methods depend on whether Data Guard is being used.
Table 10-3 summarizes recovery methods for data failure when Data Guard is not used.
Table 10-4 summarizes recovery methods for data failure when Data Guard is present.
The proper recovery method to use depends on the following criteria, as indicated in Table 10-3 and Table 10-4:
An object may be critical for the application to function. This includes objects that are critical for the performance and usability of the application. It could also be a history, reporting, or logging table, which may not be as critical. It could also be an object that is no longer in use or a temporary segment.
This criterion only applies to a Data Guard environment and should be used to decide between recovering the affected object locally and using Data Guard failover. Possible values are:
This criterion only applies to a Data Guard environment and should be used to decide between recovering the affected object locally and using Data Guard failover. This is not a business cost which is assumed to be implicit in deciding how critical an object is to the application but cost in terms of feasibility of recovery, resources required and their impact on performance and total time taken.
Cost of local recovery should include the time to restore and recover the object from a valid source; the time to recover other dependent objects like indexes, constraints, and related tables and its indexes and constraints; availability of resources like disk space, data or index tablespace, temporary tablespace; and impact on performance and functionality of current normal application functions due to absence of the corrupt object.
Corruption may be localized so that it affects a known number of blocks within one or a few objects, or it may be widespread so that it affects a large portion of an object.
When media failure occurs, follow these steps:
Use the following methods to determine the extent of the media failure:
Gather the file number and file name from the error messages reported. Typical error messages are ORA-1157, ORA-1110 and ORA-1115, ORA-1110.
For example, from the following error message:
ORA-01115: IO error reading block from file 22 (block # 12698) ORA-01110: data file 22: '/oradata/SALES/users01.dbf'
Record additional error messages that appear in the system logs, volume manager logs, alert log, Oracle trace files, or application logs. Note that log files may be distributed across the data server, middle tier, and the client machines.
If there is a hardware fault or a suspect component, then it is sensible to either repair the problem or make disk space available on a separate disk subsystem before proceeding with a recovery option.
If there are multiple errors, if there are operating system-level errors against the affected file, or if the errors are transient and keep moving about, then there is little point in proceeding until the underlying problem has been addressed or space is available on alternative disks. Ask your hardware vendor to verify system integrity.
The appropriate recovery action depends on what type of file is affected by the media failure. Table 10-5 shows the type of file and the appropriate recovery.
Type of File | Recovery Action |
---|---|
Datafile |
Media failure of a datafile is resolved in the same manner in which widespread datafile block corruption is handled. |
Control file |
Loss of a control file causes the primary database to shut down. The steps to recover from control file failure include making a copy of a good control file, restoring a backup copy of the control file, or manually creating the control file with the
See Also: "Performing User-Managed Flashback and Recovery" in Oracle Database Backup and Recovery Advanced User's Guide |
Standby control file |
Loss of a standby control file causes the standby database to shut down. It may also, depending on the primary database protection mode, cause the primary database to shut down. To recover from a standby control file failure, a new standby control file must be created from the primary database and transferred to the standby system. See Also: "Creating a Physical Standby Database" in Oracle Data Guard Concepts and Administration |
Online redo log file |
If a media failure has affected the online redo logs of a database, then the appropriate recovery procedure depends on the following:
See Also: "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide If the online redo log failure causes the primary database to shut down and incomplete recovery must be used to make the database operational again, then Flashback Database can be used instead of restoring all datafiles. Use Flashback Database to take the database back to an SCN before the SCN of the lost online redo log group. The resetlogs operation that is done as part of the Flashback Database procedure reinitializes all online redo log files. Using Flashback Database is faster than restoring all datafiles. If the online redo log failure causes the primary database to shut down in a Data Guard environment, it may be desirable to perform a Data Guard failover to reduce the time it takes to restore service to users and to reduce the amount of data loss incurred (when using the proper database protection mode). The decision to perform a failover (instead of recovering locally at the primary site with Flashback Database, for example) depends on the estimated time to recover from the outage at the primary site, the expected amount of data loss, and the impact the recovery procedures taken at the primary site may have on the standby database. For example, if the decision is to recover at the primary site, then the recovery steps may require a Flashback Database and open resetlogs, which may incur a full redo log file of lost data. A standby database will have less data loss in most cases than recovering at the primary site because all redo data is available to the standby database. If recovery is done at the primary site and the standby database is ahead of the point to which the primary database is recovered, then the standby database must be re-created or flashed back to a point before the resetlogs SCN on the primary database. See Also: "Creating a Physical Standby Database" in Oracle Data Guard Concepts and Administration |
Standby redo log file |
Standby redo log failure affects only the standby database in a Data Guard environment. Most standby redo log failures are handled automatically by the standby database without affecting the primary database. However, if a standby redo log file fails while being archived to, then the primary database treats it as a log archive destination failure. See Also: "Determine the Data Protection Mode" |
Archived redo log file |
Loss of an archived redo log does not affect availability of the primary database directly, but it may significantly affect availability if another media failure occurs before the next scheduled backup, or if Data Guard is being used and the archived redo log had not been fully received by the standby system and applied to the standby database before losing the file. See Also: "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide If an archived redo log is lost in a Data Guard environment and the log has already been applied to the standby database, then there is no impact. If there is no valid backup copy of the lost file, then a backup should be taken immediately of either the primary or standby database because the lost log will be unavailable for media recovery that may be required for some other outage. If the lost archived redo log has not yet been applied to the standby database, then a backup copy of the file must be restored and made available to the standby database. If there is no valid backup copy of the lost archived redo log, then the standby database must be reinstantiated from a backup of the primary database taken after the |
Server parameter file ( |
Loss of the server parameter file does not affect availability of the database. See Also: "Performing Recovery" of Oracle Database Backup and Recovery Basics |
Oracle Cluster Registry (OCR) |
Loss of the Oracle Cluster Registry file affects the availability of RAC and Cluster Ready Services. The OCR file can be restored from a physical backup that is automatically created or from an export file that is manually created by using the See Also: "Administering Storage in Real Application Clusters" in Oracle Real Application Clusters Administrator's Guide |
The following recovery methods can be used in all environments:
Always use local recovery methods when Data Guard is not being used. Local recovery methods may also be appropriate in a Data Guard environment. This section also includes the following topic:
Datafile media recovery recovers an entire datafile or set of datafiles for a database by using the RMAN RECOVER
command. When a large or unknown number of data blocks are marked media-corrupt and require media recovery, or when an entire file is lost, the affected datafiles must be restored and recovered.
Use RMAN file media recovery when the following conditions are true:
See Also:
"Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide |
Block media recovery (BMR) recovers one or a set of data blocks marked "media corrupt" within a datafile by using the RMAN BLOCKRECOVER
command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole datafiles. This results in lower mean time to recovery (MTTR) because only blocks that need recovery are restored and only necessary corrupt blocks undergo recovery. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected datafiles to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are completely recovered.
Use block media recovery when:
BACKUP VALIDATE
command) and only when complete recovery is required.Block media recovery cannot be used to recover from the following:
The following are useful practices when using block media recovery:
BACKUP VALIDATE
is run once a week, then the flash recovery area should have a retention policy greater than one week. This ensures that corrupt blocks can be recovered quickly using block media recovery with disk-based backups.V$DATABASE_BLOCK_CORRUPTION
view has a list of blocks validated as corrupt by RMAN. RMAN can be instructed to recover all corrupt blocks listed in V$DATABASE_BLOCK_CORRUPTION
using block media recovery.
Some database objects, such as small look-up tables or indexes, can be recovered quickly by manually re-creating the object instead of doing media recovery.
Use manual object re-creation when:
Failover is the operation of taking the production database offline on one site and bringing one of the standby databases online as the new production database. A database switchover is a planned transition in which a standby database and a production database switch roles.
Use Data Guard switchover or failover for data failure when:
Oracle flashback technology revolutionizes data recovery. In the past it took seconds to damage a database but hours to days to recover it. With flashback technology, the time to correct errors can be as short as the time it took to make the error. Fixing user errors that require rewinding the database, table, transaction, or row level changes to a previous point in time is easy and does not require any database or object restoration. Flashback technology provides fine-grained analysis and repair for localized damage such as erroneous row deletion. Flashback technology also enables correction of more widespread damage such as accidentally running the wrong application batch job. Furthermore, flashback technology is exponentially faster than a database restoration.
Flashback technologies are applicable only to repairing the following user errors:
DROP TABLE
statementsFlashback technologies cannot be used for media or data corruption such as block corruption, bad disks, or file deletions. See "Recovery Solutions for Data Failures" and "Database Failover" to repair these outages.
Table 10-6 summarizes the flashback solutions for each type of outage.
Impact of Outage | Examples of User Errors | Flashback Solutions |
---|---|---|
Use a combination of: See Also: "Flashback Query" | ||
See Also:"Resolving Table Inconsistencies" |
||
|
Table 10-7 summarizes each flashback feature.
Flashback Database uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo and multiversion read consistency capabilities. See "Configuration Best Practices for the Database" for configuring flashback technologies to ensure that the resources from these solutions are available at a time of failure.
The rest of this section includes the following topics:
See Also:
Oracle Database Administrator's Guide, Oracle Database Backup and Recovery Basics, and Oracle Database Concepts for more information about flashback technology and automatic undo management |
Resolving row and transaction inconsistencies may require a combination of Flashback Query, Flashback Version Query, Flashback Transaction Query, and the suggested undo statements to rectify the problem. The following sections describe a general approach using a human resources example to resolve row and transaction inconsistencies caused by erroneous or malicious user errors.
This section includes the following topics:
Flashback Query, a feature introduced in the Oracle9i Database, enables an administrator or user to query any data at some point in time in the past. This powerful feature can be used to view and reconstruct data that may have been deleted or changed by accident. For example:
SELECT * FROM EMPLOYEES AS OF TIMESTAMP TO_DATE('28-Aug-03 14:00','DD-Mon-YY HH24:MI') WHERE ...
This partial statement displays rows from the EMPLOYEES
table starting from 2 p.m. on August 28, 2003. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors without delay, rather than burdening administrators to perform this task. Flashback Query is very simple to manage, because the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.
Flashback Version Query provides a way to view changes made to the database at the row level. It is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. For example:
SELECT * FROM EMPLOYEES VERSIONS BETWEEN TIMESTAMP TO_DATE('28-Aug-03 14:00','dd-Mon-YY hh24:mi') AND TO_DATE('28-Aug-03 15:00','dd-Mon-YY hh24:mi') WHERE ...
This statement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. today. A DBA can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. This enables the DBA to track down the source of a logical corruption in the database and correct it. It also enables application developers to debug their code.
Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that enables you to see all changes made by a transaction. For example:
SELECT UNDO_SQL FOMR DBA_TRANSACTION_QUERY WHERE XID = '000200030000002D';
This statement shows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like this, the DBA and application developer can precisely diagnose and correct logical problems in the database or application.
Consider a human resources (HR) example involving the SCOTT
schema. the HR manager reports to the DBA that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was increased. In addition, he has instructed his staff to reset the salary to the previous level of $1250, and this was completed around 9:15 a.m.
The following steps show how to approach the problem.
Fortunately, the HR manager has provided information about the time when the change occurred. We can query the information as it was at 9:00 a.m. with Flashback Query.
SELECT EMPNO, ENAME, SAL FROM EMP AS OF TIMESTAMP TO_DATE('03-SEP-03 09:00','dd-Mon-yy hh24:mi') WHERE ENAME = 'WARD'; EMPNO ENAME SAL ---------- ---------- ---------- 7521 WARD 1875
We can confirm we have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's name, we can now use the employee number for subsequent investigation.
Although it is possible to restrict the row version information to a specific date or SCN range, we decide to query all the row information that we have available for the employee WARD using Flashback Version Query.
SELECT EMPNO, ENAME, SAL, VERSIONS_STARTTIME, VERSIONS_ENDTIME FROM EMP VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE WHERE EMPNO = 7521 ORDER BY NVL(VERSIONS_STARTSCN,1); EMPNO ENAME SAL VERSIONS_STARTTIME VERSIONS_ENDTIME -------- ---------- ---------- ---------------------- ---------------------- 7521 WARD 1250 03-SEP-03 08.48.43 AM 03-SEP-03 08.54.49 AM 7521 WARD 1875 03-SEP-03 08.54.49 AM 03-SEP-03 09.10.09 AM 7521 WARD 1250 03-SEP-03 09.10.09 AM
We can see that WARD's salary was increased from $1250 to $1875 at 08:54:49 the same morning and was subsequently reset to $1250 at approximately 09:10:09.
Also, we can modify the query to determine the transaction information for each of the changes effecting WARD using a similar Flashback Version Query. This time we use the VERSIONS_XID
pseudocolumn.
SELECT EMPNO, ENAME, SAL, VERSIONS_XID FROM EMP VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE WHERE EMPNO = 7521 ORDER BY NVL(VERSIONS_STARTSCN,1); EMPNO ENAME SAL VERSIONS_XID ---------- ---------- ---------- ---------------- 7521 WARD 1250 0006000800000086 7521 WARD 1875 0009000500000089 7521 WARD 1250 000800050000008B
With the transaction information (VERSIONS_XID
pseudocolumn), we can now query the database to determine the scope of the transaction, using Flashback Transaction Query.
SELECT UNDO_SQL FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = HEXTORAW('0009000500000089'); UNDO_SQL ---------------------------------------------------------------------------- update "SCOTT"."EMP" set "SAL" = '950' where ROWID = 'AAACV4AAFAAAAKtAAL'; update "SCOTT"."EMP" set "SAL" = '1500' where ROWID = 'AAACV4AAFAAAAKtAAJ'; update "SCOTT"."EMP" set "SAL" = '2850' where ROWID = 'AAACV4AAFAAAAKtAAF'; update "SCOTT"."EMP" set "SAL" = '1250' where ROWID = 'AAACV4AAFAAAAKtAAE'; update "SCOTT"."EMP" set "SAL" = '1600' where ROWID = 'AAACV4AAFAAAAKtAAB'; 6 rows selected.
We can see that WARD's salary was not the only change that occurred in the transaction. The information that was changed for the other four employees at the same time as WARD can now be passed back to the HR manager for review.
If the HR manager decides that the corrective changes suggested by the UNDO_SQL
column are correct, then the DBA can execute these statements individually.
Oracle provides a FLASHBACK DROP
statement to recover from an accidental DROP TABLE
statement, and a FLASHBACK TABLE
statement to restore a table to a previous point in the database.
This section includes the following topics:
Flashback Table provides the DBA the ability to recover a table, or a set of tables, to a specified point in time quickly and easily. In many cases, Flashback Table alleviates the need to perform more complicated point in time recovery operations. For example:
FLASHBACK TABLE orders, order_items TO TIMESTAMP TO_DATE('29-AUG-03 14.00.00','dd-Mon-yy hh24:mi:ss');
This statement rewinds any updates to the ORDERS
and ORDER_ITEMS
tables that have been done between the current time and specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential integrity constraints between the tables.
Dropping or deleting database objects by accident is a common mistake. Users soon realize their mistake, but by then it is too late and there has been no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped were dropped forever. Loss of very important tables or other objects (like indexes, partitions or clusters) required DBAs to perform a point-in-time recovery, which can be time-consuming and lead to loss of recent transactions.
Flashback Drop provides a safety net when dropping objects in Oracle Database 10g. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations begin to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users can look in the recycle bin and undrop the dropped table and its dependent objects. For example, the employees
table and all its dependent objects would be undropped by the following statement:
FLASHBACK TABLE employees TO BEFORE DROP;
Oracle provides Flashback Database to rewind the entire database to a previous point in time. This section includes the following topics:
To bring an Oracle database to a previous point in time, the traditional method is point-in-time recovery. However, point-in-time recovery can take hours or even days, since it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it will take hours or even days just to restore the whole database.
Flashback Database is a new strategy for doing point-in-time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error. Flashback logs are used to capture old versions of changed blocks. One way to think of it is as a continuous backup or storage snapshot. When recovery needs to be performed the flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. It is extremely fast and reduces recovery time from hours to minutes. In addition, it is easy to use. A database can be recovered to 2:05 p.m. by issuing a single statement. Before the database can be recovered, all instances of the database must be shut down and one of the instances subsequently mounted. The following is an example of a FLASHBACK DATABASE
statement.
FLASHBACK DATABASE TO TIMESTAMP TIMESTAMP'2002-11-05 14:00:00';
No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determine that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE
statement or continue recovery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a production database, a physical standby database, and a logical standby database.
These steps are recommended for using Flashback Database:
SELECT OLDEST_FLASHBACK_SCN, TO_CHAR(OLDEST_FLASHBACK_TIME, 'mon-dd-yyyy HH:MI:SS') FROM V$FLASHBACK_DATABASE_LOG;
FLASHBACK DATABASE TO SCN scn;
or
FLASHBACK DATABASE TO TIMESTAMP TO_DATE date;
ALTER DATABASE OPEN READ ONLY;
If more flashback data is required, then issue another FLASHBACK DATABASE
statement. (The database must be mounted to perform a Flashback Database.)
If you want to move forward in time, issue a statement similar to the following:
RECOVER DATABASE UNTIL [TIME | CHANGE] date | scn;
ALTER DATABASE OPEN RESETLOGS;
Other considerations when using Flashback Database are as follows:
Flashback Database does not automatically fix this problem, but it can be used to dramatically reduce the downtime. You can flash back the production database to a point before the tablespace was dropped and then restore a backup of the corresponding datafiles from the affected tablespace and recover to a time before the tablespace was dropped.
Follow these recommended steps to use Flashback Database to repair a dropped tablespace:
FLASHBACK DATABASE TO BEFORE SCN drop_scn;
ALTER DATABASE RENAME FILE '.../UNNAMED00005' to 'restored_file';
ALTER DATABASE DATAFILE 'name' ONLINE;
SELECT CHECKPOINT_CHANGE# FROM V$DATAFILE_HEADER WHERE FILE#=1; RECOVER DATABASE UNTIL CHANGE checkpoint_change#;
ALTER DATABASE OPEN RESETLOGS;
"One-off" patches or interim patches to database software are usually applied to implement known fixes for software problems an installation has encountered or to apply diagnostic patches to gather information regarding a problem. Such patch application is often carried out during a scheduled maintenance outage.
Oracle now provides the capability to do rolling patch upgrades with Real Application Clusters with little or no database downtime. The tool used to achieve this is the opatch
command-line utility.
The advantage of a RAC rolling upgrade is that it enables at least some instances of the RAC installation to be available during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched needs to be brought down. The other instances can continue to remain available. This means that the impact on the application downtime required for such scheduled outages is further minimized. Oracle's opatch
utility enables the user to apply the patch successively to the different instances of the RAC installation.
Rolling upgrade is available only for patches that have been certified by Oracle to be eligible for rolling upgrades. Typically, patches that can be installed in a rolling upgrade include:
Rolling upgrade of patches is currently available for one-off patches only. It is not available for patch sets.
Rolling patch upgrades are not available for deployments where the Oracle Database software is shared across the different nodes. This is the case where the Oracle home is on Cluster File System (CFS) or on shared volumes provided by file servers or NFS-mounted drives. The feature is only available where each node has its own copy of the Oracle Database software.
This section includes the following topics:
The opatch
utility applies a patch successively to nodes of the RAC cluster. The nature of the patch enables a RAC installation to run in a mixed environment. Different instances of the database may be operating at the same time, and the patch may have been applied to some instances and not others. The opatch
utility automatically detects the nodes of the cluster on which a specific RAC deployment has been implemented. The patch is applied to each node, one at a time. For each node, the DBA is prompted to shut down the instance. The patch is applied to the database software install on that node. After the current node has been patched, the instance can be restarted. After the patch is applied on the current node, the DBA is allowed to choose the next RAC node to apply the patch to. The cycle of instance shutdown, patch application, and instance startup is repeated. Thus, at any time during the patch application, only one node needs to be down.
To check if a patch is a rolling patch, execute the following on UNIX platforms:
opatch query -is_rolling
(On Windows, execute opatch.bat
.)
Enter the patch location after the prompt.
To apply a patch to all nodes of the RAC cluster, execute the following command:
opatch apply patch_location
opatch
automatically recognizes the patch to be a rolling patch and provides the required behavior.
To apply a patch to only the local node, enter the following command:
opatch apply -local patch_location
To check the results of a patch application, check the logs in the following location:
$ORACLE_HOME/.patch_storage/patch_id/patch_id_Apply_timestamp.log
Patches can be rolled back with the opatch
utility. This enables the DBA to remove a troublesome patch or a patch that is no longer required. This can be done as a rolling procedure.
To roll back a patch across all nodes of a RAC cluster, execute the following command:
opatch rollback -id patch_id -ph patch_location
To roll back a patch on the local node only, enter the following command:
opatch rollback -local -id patch_id -ph patch_location
To check the results of a patch rollback, check the logs in the following location:
$ORACLE_HOME/.patch_storage/patch_id/patch_id_RollBack_timestamp.log
The opatch
utility also provides an option to list the installed software components as well as the installed patches. Enter the following command:
opatch lsinventory
For details on usage and the other options to these commands, see MetaLink Notes 242993.1 and 189489.1 at http://metalink.oracle.com
.
The following are recommended practices for all database patch upgrades:
The following are additional recommended practices for RAC rolling upgrades.
However, if this was not done or is not feasible for some reason, adding information about an existing Oracle database software installation to the Oracle inventory can be done with the attach
option of the opatch
utility. Node information can be also added with this option.
-local
option is the recommended way to do this.
In the interest of keeping all instances of the RAC cluster at the same patch level, it is strongly recommended that after a patch has been validated, it should be applied to all nodes of the RAC installation. When instances of a RAC cluster have similar patch software, services can be migrated among instances without running into the problem a patch may have fixed.
The patches should be stored in a location that is accessible by all nodes of the cluster. Thus all nodes of the cluster are equivalent in their capability to apply or roll back a patch.
minimize_downtime
option of the apply
command.Using a logical standby database enables you to accomplish upgrades for database software and patch sets with almost no downtime.
If a logical standby database does not currently exist, then verify that a logical standby database supports all of the essential datatypes of your application.
See Also:
Oracle Data Guard Concepts and Administration for a list of datatypes supported by the logical standby database If you cannot use a logical standby database because of the datatypes in your application, then perform the upgrade as documented in Oracle Database Upgrade Guide |
First, create or establish a logical standby database. Figure 10-4 shows a production database and a logical standby database, which are both version X databases.
Text description of the illustration maxav023.gif
Second, stop the SQL Apply process and upgrade the database software on the logical standby database to version X+1. Figure 10-5 shows the production database, version X, and the logical standby database, version X+1.
Text description of the illustration maxav026.gif
Third, restart SQL Apply and operate with version X on the production database and version X+1 on the standby database. The configuration can run in the mixed mode shown in Figure 10-6 for an arbitrary period to validate the upgrade in the production environment.
Text description of the illustration maxav024.gif
When you are satisfied that the upgraded software is operating properly, you can reverse the database roles by performing a switchover. This may take only a few seconds. Switch the database clients to the new production database, so that the application becomes active. If application service levels degrade for some reason, then you can open the previous production database again, switch users back, and terminate the previous steps.
Figure 10-7 shows that the former standby database (version X+1) is now the production database, and the former production database (version X) is now the standby database. The clients are connected to the new production database.
Text description of the illustration maxav027.gif
Upgrade the new standby database. Figure 10-8 shows the system after both databases have been upgraded to version X+1.
Text description of the illustration maxav025.gif
The role reversal that was just described includes the following steps:
Oracle's online object reorganization capabilities have been available since Oracle8i. These capabilities enable object reorganization to be performed even while the underlying data is being modified.
Table 10-8 describes a few of the object reorganization capabilities available with Oracle Database 10g.
In highly available systems, it is occasionally necessary to redefine large tables that are constantly accessed to improve the performance of queries or DML performed. Oracle provides the DBMS_REDEFINITION
PL/SQL package to redefine tables online. This package provides a significant increase in availability compared to traditional methods of redefining tables that require tables to be taken offline.
An index can be rebuilt online using the previous index definition, optionally moving the index to a new tablespace.
Oracle Database 10g introduces the ability to rename a tablespace, similar to the ability to rename a column, table and datafile. Previously, the only way to change a tablespace name was to drop and re-create the tablespace, but this meant that the contents of the tablespace had to be dropped and rebuilt later. With the ability to rename a tablespace online, there is no interruption to the users.
ALTER TABLESPACE USERS RENAME TO new_tablespace_name; Tablespace altered.