Oracle® Real Application Clusters Deployment and Performance Guide 10g Release 1 (10.1) Part Number B10768-01 |
|
|
View PDF |
This chapter describes how to monitor Oracle Real Application Clusters (RAC) performance and includes the following topics:
Automatic Workload Repository in Real Application Clusters Environments
Monitoring RAC Statistics and Events
See Also: Chapter 4, " Monitoring Performance with Oracle Enterprise Manager" for information about monitoring RAC performance with Oracle Enterprise Manager |
All single-instance Oracle database tuning practices apply to applications running on RAC databases. Therefore, implement the single-instance tuning methodologies described in Oracle Database Performance Tuning Guide.
The interconnect and internode communication protocols can affect Cache Fusion performance. In addition, the interconnect bandwidth, its latency, and the efficiency of the IPC protocol determine the speed with which Cache Fusion processes block transfers.
Once your interconnect is operative, you cannot significantly influence its performance. However, you can influence an interconnect protocol's efficiency by adjusting the IPC buffer sizes.
See Also: Your vendor-specific interconnect documentation for more information about adjusting IPC buffer sizes |
Although you should rarely need to set this parameter, you can use the CLUSTER_INTERCONNECTS
parameter to assign a private network IP address or NIC as in the following example:
CLUSTER_INTERCONNECTS=10.0.0.1
If you are using an operating system-specific vendor IPC protocol, then the trace information may not reveal the IP address. However, Oracle uses the correct network interface based on the use of vendor-specific IPC libraries.
Each instance has a set of instance-specific views. You can also query global dynamic performance views to retrieve performance information from all of the qualified instances. Global dynamic performance view names are prefixed with GV$. A global view contains all columns from its respective instance-specific view as well as the INST_ID
column. The instance number is also appended to the names of the archived redo log threads to create a unique identifier for each instance's archived redo logs.
See Also: Oracle Database Reference for restrictions on GV$ views and complete descriptions of related parameters and views |
If you did not create your RAC database with the Database Configuration Assistant (DBCA), then you must run the CATCLUST.SQL
script to create RAC-related views and tables. You must have SYSDBA
privileges to run this script.
See Also: Oracle Real Application Clusters Installation and Configuration Guide for more information about creating your RAC database |
This section provides an overview of the V$ and GV$ views that provide statistics that you can use evaluate block transfers in your cluster. Use these statistics to analyze interconnect block transfer rates as well as the overall performance of your RAC database.
RAC-specific statistics appear as message request counters or as timed statistics. Message request counters include statistics showing the number of certain types of block mode conversions. Timed statistics reveal the total or average time waited for read and write I/O for particular types of operations.
In RAC environments, each Automatic Workload Repository (AWR) snapshot captures data from all active instances within the cluster. The data for each snapshot set that is captured for all active instances is from the same point in time. In addition, the data for each instance is stored separately and is identified with an instance identifier. For example, the buffer_busy_wait
statistic shows the number of buffer waits on each instance. AWR does not store data that is aggregated from across the entire cluster. In other words, the data is stored for each individual instance.
This section explains wait events and statistics specific to RAC and how to interpret them when assessing performance data generated by the Automatic Workload Repository, Statspack, or by ad-hoc queries of the dynamic performance views.
See Also: Oracle Database Performance Tuning Guide for more information about wait event analysis and thespdoc.txt file for details about the Statspack utility. |
The statistics snapshots generated by AWR and Statspack can be evaluated by producing reports displaying summary data such as load and cluster profiles based on regular statistics and wait events gathered on each instance.
Most of the relevant data is summarized on the and Cluster Statistics Page and the GES Statistics Page. This information includes:
Analyzing and interpreting what sessions are waiting for is an important method to determine where time is spent. In RAC, the wait time is attributed to an event which reflects the exact outcome of a request. For example, when a session on an instance is looking for a block in the global cache, it does not know whether it will receive the data cached by another instance or whether it will receive a message to read from disk. The wait events for the global cache now convey precise information and waiting for global cache blocks or messages is:
Summarized in a broader category called Cluster Wait Class
Temporarily represented by a placeholder event which is active while waiting for a block, for example:
gc current block request
gc cr block request
Attributed to precise events when the outcome of the request is known, for example:
gc current block 3-way
gc current block busy
gc cr block grant 2-way
In summary, the wait events for RAC convey information valuable for performance analysis. They are used in Automatic Database Diagnostic Monitor (ADDM) to enable precise diagnostics of the impact of cache fusion.
In order to determine the amount of work and cost related to inter-instance messaging and contention, examine block transfer rates, remote requests made by each transaction, the number and time waited for global cache events as described under the following headings:
The effect of accessing blocks in the global cache and maintaining coherency is represented by
The Global Cache Service statistics for current and cr blocks, for example, gc current blocks received, gc cr blocks received, and so on)
The Global Cache Service wait events, for gc current block 3-way, gc cr grant 2-way, and so on.
The response time for cache fusion transfers is determined by the messaging and processing times imposed by the physical interconnect components, the IPC protocol and the GCS protocol. It is not affected by disk I/O factors other than occasional log writes. The cache fusion protocol does not require I/O to data files in order to guarantee cache coherency and RAC inherently does not cause any more I/O to disk than a non-clustered instance.
This section describes how to monitor Global Cache Service performance by identifying data blocks and objects which are frequently used ("hot") by all instances. High concurrency on certain blocks may be identified by Global Cache Service wait events and times.
The following wait events indicate that the access to cached data blocks was held up because they were busy either in the remote or the local cache, respectively:
gc current block busy
gc current block 2-way busy
gc current block 3-way busy
gc cr block 2-way busy
gc cr block 3-way busy
This means that the blocks were pinned or held up by sessions or delayed by a log write on a remote instance (for example, gc current, cr 2-way busy, or cr 3-way busy), or that a session on the same instance is already accessing a block which is in transition between instances and the current session needs to wait behind it (for example, gc current block busy).
The V$SESSION_WAIT view to identify objects and data blocks with contention. The gc wait events contain the file and block number for a block request in p1 and p2, respectively.
An additional segment statistic, gc buffer busy, has been added to quickly determine the "busy" objects without recourse to the query on V$SESSION_WAIT mentioned earlier.
The AWR infrastructure provides a view of active session history which can also be used to trace recent wait events and their arguments. It is therefore useful for hot block analysis.
Most of the reporting facilities used by AWR and Statspack contain the object statistics and cluster wait class category, so that sampling of the views mentioned earlier is largely unnecessary.
It is advisable to run ADDM on the snapshot data collected by the AWR infrastructure to obtain an overall evaluation of the impact of the global cache. The advisory will also identify the busy objects and SQL highest cluster wait time.
This section describes how to monitor Global Cache Service performance by identifying objects read and modified frequently and the service times imposed by the remote access. Waiting for blocks to arrive may constitute a significant portion of the response time, in the same way that reading from disk could increase the block access delays, only that cache fusion transfers in most cases are faster than disk access latencies.
The following wait events indicate that the remotely cached blocks were shipped to the local instance without having been busy, pinned or requiring a log flush:
gc current block 2-way
gc current block 3-way
gc cr block 2-way
gc cr block 3-way
The object statistics for gc current blocks received and gc cr blocks received enable quick identification of the indexes and tables which are shared by the active instances. As mentioned earlier, creating an ADDM analysis will, in most cases, point you to the SQL statements and database objects that could be impacted by inter-instance contention.
Note: You must run Statspack at level 7 to collect statistics related to block contention and segment block waits. |
Any increases in the average wait times for the events mentioned earlier could be caused by the following:
High load: CPU shortages, long run queues, scheduling delays
Misconfiguration: using public instead of private interconnect for message and block traffic
If the average wait times are acceptable and no interconnect or load issues can be diagnosed, then the accumulated time waited can usually be attributed to a few SQL statements which need to be tuned to minimize the number of blocks accessed.
The column CLUSTER_WAIT_TIME in V$SQLAREA represents the wait time incurred by individual SQL statements for global cache events and will identify the SQL which may need to be tuned.
Most global cache wait events that show a high total time as reported in the AWR and Statspack reports or in the dynamic performance views are normal and may present themselves as the top database time consumers without actually indicating a problem. This section describes the most important and frequent wait events that you should be aware of when interpreting performance data.
If user response times increases and a high proportion of time waited is for global cache (gc), then the cause should be determined. Most reports include a breakdown of events sorted by percentage of the total time.
It is useful to start with an ADDM report, which would analyze the routinely collected performance statistics with respect to their impact and point to the objects and SQL contributing most to the time waited, and then move on to the more detailed reports produced by AWR and Statspack.
The most important wait events for RAC fall into various categories, namely
Block-oriented
gc current block 2-way
gc current block 3-way
gc cr block 2-way
gc cr block 3-way
Message-oriented
gc current grant 2-way
gc cr grant 2-way
Contention-oriented
gc current block busy
gc cr block busy
gc current buffer busy
Load-oriented
gc current block congested
gc cr block congested
The block-oriented wait event statistics indicate that a block was received as either the result of a 2-way or a 3-way message, that is, the block was sent from either the resource master requiring 1 message and 1 transfer, or was forwarded to a third node from which it was sent, requiring 2 messages and 1 block transfer.
These events are usually the most frequent in the absence of block contention and the length of the wait is determined by the time it takes on the physical network, the time to process the request in the serving instances and the time it takes for the requesting process to wake up after the block arrives.
The average wait time and the total wait time should be considered when being alerted to performance issues where these particular waits have a high impact. Usually, either interconnect or load issues or SQL execution against a large shared working set can be found to be the root cause.
The message-oriented wait event statistics indicate that no block was received because it was not cached in any instance. Instead a global grant was given, allowing the requesting instance to read the block from disk or modify it.
If the time consumed by these events is high, then it may be assumed that the frequently executed SQL causes a lot of disk I/O (in the event of the cr grant) or that the workload inserts a lot of data and needs to find and format new blocks frequently (in the event of the current grant).
The contention-oriented wait event statistics indicate that a block was received which was pinned by a session on another node, was deferred because a change had not yet been flushed to disk or because of high concurrency, and therefore could not be shipped immediately. A buffer may also be busy locally when a session has already initiated a cache fusion operation and is waiting for its completion when another session on the same node is trying to read or modify the same data. High service times for blocks exchanged in the global cache may exacerbate the contention, which can be caused by frequent concurrent read and write accesses to the same data.
The load-oriented wait events indicate that a delay in processing has occurred in the GCS, which is usually caused by high load, CPU saturation and would have to be solved by additional CPUs, load-balancing, offloading processing to different times or a new cluster node.For the events mentioned, the wait time encompasses the entire round trip from the time a session starts to wait after initiating a block request until the block arrives.