Oracle® High Availability Architecture and Best Practices 10g Release 1 (10.1) Part Number B10726-01 |
|
|
View PDF |
This chapter provides recommendations for configuring the subcomponents that make up the database server tier and the network. It includes the following sections:
The goal of configuring a highly available environment is to create a redundant, reliable system and database without sacrificing simplicity and performance. This chapter includes recommendations for configuring the subcomponents of the database server tier.
These principles apply to all of the subcomponent recommendations:
Electronic data is one of the most important assets of any business. Storage arrays that house this data must protect it and keep it accessible to ensure the success of the company. This section describes characteristics of a fault-tolerant storage subsystem that protects data while providing manageability and performance. The following storage recommendations for all architectures are discussed in this section:
The following section pertains specifically to RAC environments:
All hardware components of a storage array must be fully redundant, from physical interfaces to physical disks, including redundant power supplies and connectivity to the array itself.
The storage array should contain one or more spare disks (often called hot spares). When a physical disk starts to report errors to the monitoring infrastructure, or fails suddenly, the firmware should immediately restore fault tolerance by mirroring the contents of the failed disk onto a spare disk.
Connectivity to the storage array must be fully redundant (referred to as multipathing) so that the failure of any single component in the data path from any node to the shared disk array (such as controllers, interface cards, cables, and switches) is transparent and keeps the array fully accessible. This achieves addressing the same logical device through multiple physical paths. A host-based device driver reissues the I/O to one of the surviving physical interfaces.
If the storage array includes a write cache, then it must be protected to guard against memory board failures. The write cache should be protected by multiple battery backups to guard against a failure of all external power supplies. The cache must be able to survive either by using the battery backup long enough for the external power to return or until all the dirty blocks in cache are guaranteed to be fully flushed to a physical disk.
If a physical component fails, then the array must allow the failed device to be repaired or replaced without requiring the array to be shut down or taken offline. Also, the storage array must allow the firmware to be upgraded and patched without shutting down the storage array.
Data should be mirrored to protect against disk and other component failures and should be striped over a large number of disks to achieve optimal performance. This method of storage configuration is known as Stripe and Mirror Everything (SAME). The SAME methodology provides a simple, efficient, and highly available storage configuration.
Oracle's automatic storage management (ASM) feature always evenly stripes data across all drives within a disk troup with the added benefit of automatically rebalancing files across new disks when disks are added, or across existing disks if disks are removed. In addition, ASM can provide redundancy protection to protect against component failures or enable mirroring to be provided by the underlying storage array.
See Also:
|
Load balancing of I/O operations across physical interfaces is usually provided by a software package that is installed on the host. The purpose of this load balancing software is to redirect I/O operations to a less busy physical interface if a single host bus adapter (HBA) is overloaded by the current workload.
Create separate, independent storage areas for software, active database files, and recovery-related files.
The following storage areas are needed:
The storage containing the database area should be as physically distinct as possible from the flash recovery area. At a minimum, the database area and flash recovery area should not share the same physical disk drives or controllers. This practice ensures that the failure of a component that provides access to a datafile in the database area does not also cause the loss of the backups or redo logs in the flash recovery area needed to recover that datafile.
Storage options are defined as follows:
The rest of this section includes these topics:
For the "Data Guard only" architecture and MAA, the primary site and secondary site should each contain their own identically configured storage areas.
For the "RAC only" architecture and MAA, follow these recommendations:
Table 6-1 summarizes the independent storage recommendations by architecture.
See Also:
|
ASM uses a concept called failure groups to protect data against disk or component failures. When using failure groups, ASM optimizes file layout to reduce the unavailability of data due to the failure of a shared resource. Failure groups define disks that share components, so that if one disk fails, then other disks sharing the component might also fail. An example of what might be defined as a failure group is a string of SCSI disks on the same SCSI controller. Failure groups are used to determine which ASM disks to use for storing redundant data. For example, if 2-way mirroring is specified for a file, then redundant copies of file extents will be stored in separate failure groups.
Failure groups are used with storage that does not provide its own redundancy capability, such as disks that have not been configured according to RAID. The manner in which failure groups are defined for an ASM disk group is site-specific because the definition depends on the configuration of the storage and how it is connected to the database systems. ASM attempts to maintain three copies of its metadata, requiring a minimum of three failure groups for proper protection.
When using ASM with intelligent storage arrays, the storage array's protection features (such as hardware RAID-1 mirroring) are typically used instead of ASM's redundancy capabilities, which are implemented by using the EXTERNAL REDUNDANCY
clause of the CREATE DISKGROUP
statement. When using external redundancy, ASM failure groups are not used, because disks in an external redundancy disk group are presumed to be highly available.
Disks, as presented by the storage array to the operating system, should not be aggregated or subdivided by the array because it can hide the physical disk boundaries needed for proper data placement and protection. If, however, physical disk boundaries are always hidden by the array, and if each logical device, as presented to the operating system, has the same size and performance characteristics, then simply place each logical device in the same ASM disk group (with redundancy defined as EXTERNAL REDUNDANCY
),and place the database and flash recovery areas in that one ASM disk group. An alternative approach is to create two disk groups, each concsisting of a single logical device, and place the database area in one disk group and the flash recovery in the other disk group. This method provides additional protection against disk group metadata failure and corruption.
For example, suppose a storage array takes eight physical disks of size 36GB and configures them in a RAID 0+1 manner for performance and protection, giving a total of 144GB of mirrored storage. Futhermore, this 144GB of storage is presented to the operating system as two 72GB logical devices with each logical device being striped and mirrored across all eight drives. When configuring ASM, place each 72GB logical device in the same ASM disk group, and place the database area and flash recovery areas on that disk group.
If the two logical devices have different performance characteristics,( for example, one corresponds to the inner half and the other to the outer half of the underlying physical drives) then the logical devices should be placed in separate disk groups. Because the outer portion of a disk has a higher transfer rate, the outer half disk group should be used for the database area; the inner half disk group should be used for the flash recovery area.
If multiple, distinct storage arrays are used with a database under ASM control, then multiple options are available. One option is to create multiple ASM disk groups that do not share storage across multiple arrays and place the database and flash recovery areas in separate disk groups, thus physically separating the database area from the flash recovery area. Another option is to create a single disk group across arrays that consists of failure groups, where each failure group contains disks from just one array.These options provide protection against the failure of an entire storage array.
The Hardware Assisted Resilient Data (HARD) initiative is a program designed to prevent data corruptions before they happen. Data corruptions are very rare, but when they do occur, they can have a catastrophic effect on business. Under the HARD initiative, Oracle's storage partners implement Oracle's data validation algorithms inside storage devices. This makes it possible to prevent corrupted data from being written to permanent storage. The goal of HARD is to eliminate a class of failures that the computer industry has so far been powerless to prevent. RAID has gained a wide following in the storage industry by ensuring the physical protection of data; HARD takes data protection to the next level by going beyond protecting physical data to protecting business data.
In order to prevent corruptions before they happen, Oracle tightly integrates with advanced storage devices to create a system that detects and eliminates corruptions before they happen. Oracle has worked with leading storage vendors to implement Oracle's data validation and checking algorithms in the storage devices themselves. The classes of data corruptions that Oracle addresses with HARD include:
End-to-end block validation is the technology employed by the operating system or storage subsystem to validate the Oracle data block contents. By validating Oracle data in the storage devices, corruptions will be detected and eliminated before they can be written to permanent storage. This goes beyond the current Oracle block validation features that do not detect a stray, lost, or corrupted write until the next physical read.
Oracle vendors are given the opportunity to implement validation checks based on a specification. A vendors' implementation may offer features specific to their storage technology. Oracle maintains a Web site that will show a comparison of each vendor's solution by product and Oracle version. For the most recent information, see http://otn.oracle.com/deploy/availability/htdocs/HARD.html
.
The following recommendation applies to the "RAC only" architecture and MAA:
The shared volumes created for the OCR and the voting disk should be configured using RAID to protect against media failure. This requires the use of an external cluster volume manager, cluster file system, or storage hardware that provides RAID protection.
The main server hardware components are the nodes for the database and application server farm and the components within each node such as CPU, memory, interface boards (such as I/O and network), storage, and the cluster interconnect in a RAC environment.
This section includes the following topics:
The following recommendations apply to all architectures:
Use fewer, faster CPUs instead of more, slower CPUs. Use fewer higher-density memory modules instead of more lower-density memory modules. This reduces the number of components that can fail while providing the same service level. This needs to be balanced with cost and redundancy.
Using redundant hardware components enables the system to fail over to the working component while taking the failed component offline. Choose components (such as power supplies, cooling fans, and interface boards) that can be repaired or replaced while the system is running to prevent unscheduled outages caused by the repair of hardware failures.
Use systems that can automatically detect failures and provide alternate paths around subsystems that have failed or isolate the subsystem. Choose a system that continues to run despite a component failure and automatically works around the failed component without incurring a full outage. For example, find a system that can use an alternate path for I/O requests if an adapter fails or can avoid a bad physical memory location if a memory board fails.
Because mirroring does not protect against accidental removal of a file or most corruptions of the boot disk, an online copy of the boot disk should be maintained so that the system can be quickly rebooted using the same operating system image if a critical file is removed or corruption occurs on the primary boot image. Operating system vendors often provide a mechanism to easily maintain multiple boot images.
The following recommendations apply to the "RAC only" and MAA environments:
RAC provides fast, automatic recovery from node and instance failures. Using a properly supported configuration is necessary for a successful RAC implementation. A supported hardware configuration may encompass server hardware, storage arrays, interconnect technology, and other hardware components.
Select an interconnect that has redundancy, high speed, low latency, low host resource consumption, and the ability to balance loads across multiple available paths. In two-node configurations, it may be possible to use a direct-connect interconnect between the two nodes, but a switch should be used instead to provide a degree of isolation between the network interface cards of the two nodes in the cluster. If you plan to have more than two nodes in the future, then choose a switch-based interconnect solution instead of direct-connect to reduce the complexity of adding additional nodes in the future. If you have more than two nodes, then a switch-based interconnect is highly recommended and, in many cases, a requirement of the cluster solution being used.
The following recommendation applies to both the primary and secondary sites in "Data Guard only" and MAA environments.
Using identical hardware for machines at both sites provides a symmetric environment that is easier to administer and maintain. Such a symmetric configuration mitigates failures or performance inconsistencies following a switchover or failover because of dissimilar hardware.
The recommendations for serversoftware apply to all nodes in a RAC environment and to both the primary and secondary sites in a Data Guard and MAA environment because they contain the identical configuration.
This section includes the following topics:
The following recommendations apply to all architectures:
Use the same operating system version, patch level, single patches, and driver versions on all machines. Consistency with operating system versions and patch levels reduces the likelihood of encountering incompatibilities or small inconsistencies between the software on all machines. In an open standards environment, it is impractical for any vendor to test each and every piece of new software with every combination of software and hardware that has been previously released. Temporary differences can be tolerated when using RAC or Data Guard as individual systems or groups of systems are upgraded or patched one at a time to minimize scheduled outages, if the goal is that all machines will be upgraded to the same versions and patch levels.
Use an operating system that, when coupled with the proper hardware, supports the ability to automatically detect failures and provide alternate paths around subsystems that have failed or isolate subsystems. Choose a system that can continue running if a component, such as a CPU or memory board, fails and automatically provides paths around failed components, such as an alternate path for I/O requests in the event of an adapter failure.
Mirror disks containing swap partitions so that if a disk that contains a swap partition fails, it will not result in an application or system failure.
Use disk-based swap instead of RAM-based swap. It is always good practice to make all system memory available for database and application use unless the amount available is sufficiently oversized to accommodate swap.
Do not use TMPFS (a Solaris implementation that stores files in virtual memory rather than on disk) or other file systems that exclusively use virtual memory instead of disks for storage for the /tmp
file system or other scratch file systems. This prevents runaway programs that write to /tmp
from having the potential to hang the system by exhausting all virtual memory.
An example on UNIX is setting shared memory and semaphore kernel parameters high enough to enable future growth, thus preventing an outage for reconfiguring the kernel and rebooting the system. Verify that using settings higher than required either presents no overhead or incurs such small overhead on large systems that the effect is negligible.
Journal file systems and logging reduce or eliminate the number of file system checks required following a system reboot, thereby facilitating faster restarting of the system.
As with server hardware and software, the recommendations for Oracle software apply to both the primary and secondary sites because they contain identical configurations. Mirror disks containing Oracle and application software to prevent a disk failure from causing an unscheduled outage.
The following recommendations apply to a "RAC only" environment:
RAC provides fast, automatic recovery from node and instance failures. Using a properly supported configuration is a key component of success. A supported software configuration encompasses operating system versions and patch levels, clustering software versions, which possibly includes Oracle-supplied cluster software. On some platforms (such as Solaris, Linux, and Windows), Oracle supplies cluster software required for use with RAC.
Use NTP to synchronize the clocks on all nodes in the cluster to facilitate analysis of tracing information based on timestamps.
This section includes the following topics:
The following recommendations apply to all architectures:
Table 6-2 describes the necessary redundant network components that are illustrated in Figure 6-1.
Figure 6-1 depicts a single site in an MAA environment, emphasizing the redundant network components.
Text description of the illustration maxav010.gif
Application layer load balancers sit logically in front of the application server farm and publish to the outside world a single IP address for the service running on a group of application servers. All requests are initially handled by the load balancer, which then distributes them to a server within the application server farm. End users only need to address their requests to the published IP address; the load balancer determines which server should handle the request.
If one of the middle-tier application servers cannot process a request, the load balancers route all subsequent requests to functioning servers. The load balancers should be redundant to avoid being a single point of failure because all client requests pass through a single hardware-based load balancer. Because failure of that piece of hardware is detrimental to the entire site, a backup load balancer is configured that has a heartbeat with the primary load balancer. With two load balancers, one is configured as standby and becomes active only if the primary load balancer becomes unavailable.
This recommendation applies to RAC environments.
Use the Oracle Interface Configuration (OIFCFG) tool to classify network interfaces as public, cluster interconnect, or storage so that RAC properly selects a network interface for internode network traffic.
The following recommendations apply to Data Guard:
Configure system TCP parameters that control the sending and receiving buffer sizes so that the bandwidth between sites can be fully utilized for log transport services. The proper buffer size is often governed by the bandwidth delay product (BDP) formula, particularly when using a high-speed, high-latency network.
See Also:
|
WAN traffic managers provide the initial access to the services located at the primary site. These managers are implemented at the primary and secondary sites to provide site failover capabilities when the primary site becomes completely unavailable. Geographically separating the WAN traffic managers on separate networks reduces the impact of network failures or a natural disaster that causes the primary site to become unavailable.