B
Troubleshooting

Specific topics covered in this chapter are:

Cluster Configuration Tips

A large fraction of cluster problems that have been reported to Oracle Corporation are due to incorrect cluster configuration, particular of the Cluster Manager (CM) and interconnect components.

The information in this section is based on Oracle Corporation's reference implementation of the cluster Operating System Dependent (OSD) modules. Consequently, some of this information may not be applicable to your particular cluster environment.

Additional Information:

Consult with your hardware vendor for more details about installing and configuring your particular cluster configuration

Note:

The registry instructions in this section assume REGEDT32, not REGEDIT.

Cluster Software

Make sure all nodes have the exact same cluster OSD software installed, as well as the same registry configuration. Software can be verified by ensuring nodes have the same time stamps and file sizes.

CM Configuration

Typically, each node in a cluster will have at least two cards, one for the corporate network and one for the cluster interconnect. A computer, however, can only have one host name associated with it. To get around this problem, a host name for the computer can be assigned just for the cluster interconnect.

To specify a host name for the cluster interconnect:

For each node, ping the host name. For example,

C:\> PING OPS1-NT.US.ORACLE.COM

A message similar to the one below appears:
Reply from 144.25.188.247: bytes=32 time<10ms TIL=126

The IP address returned is for the corporate network, not the cluster interconnect.

For each node, determine which ethernet card will be used for the cluster interconnect by entering:

C:\> IPCONFIG /ALL

The output looks similar to the sample shown below:

Windows NT IP Configuration 
 
              Host Name . . . . . . . . . : ops1-nt.us.oracle.com 
 
Ethernet adapter El90x1: 
 
              Description . . . . . . . . : 3Com 3C90x Ethernet Adapter 
              IP Address. . . . . . . . . : 144.25.188.247 
 
Ethernet adapter CpqNF31: 
 
              Description . . . . . . . . : Compaq NetFlex-3 Driver 
              IP Address. . . . . . . . . : 144.25.190.247

In this case, the first interface is used for the corporate network, while the second interface is (144.25.190.247) is the one intended for the cluster interconnect.

Specify an new host names for each node's interconnect IP address in the HOST file (SYSTEMROOT\SYSTEM32\DRIVERS\ETC\HOSTS). For example:

144.25.190.247 ops1-ipc 
144.25.190.248 ops2-ipc 
144.25.190.249 ops3-ipc 
144.25.190.250 ops4-ipc

The HOSTS file should have one entry for each node's interconnect, and should be copied to all nodes of the cluster so that they can see each other. To verify that they can see each other, try pinging each host from each node. For example

C:\> PING OPS3-IPC

For each node, ensure the DefinedNodes value is specified in HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>CM. DefinedNodes specifies the member nodes in the cluster.

DefinedNodes: REG_MULTI_SZ: ops1-ipc ops4-ipc ops5-ipc ops2-ipc

Note:

DefinedNodes must be of value class REG_MULTI_SZ, and each host name entry must be entered on a separate line in the Multi-String Editor dialog box.

For each node, ensure the CmHostName value is specified in HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>CM. CmHostName specifies the node's interconnect host name.

CmHostName: REG_SZ: ops1-ipc

Cluster Configuration Verification

To verify your cluster configuration:

Start PGMS each node:

From the MS-DOS command line, enter:

C:\> NET START ORACLEPGMSSERVICESID

From the Control Panel's Services window, select OraclePGMSService, and click Start.

Check the bottom of PGMS.LOG file stored in SYSTEMROOT\SYSTEM32\PGMS.LOG to ensure that each time a node is brought up, PGMS reconfigures with the correct number of nodes. For example, if two nodes are up, the following should be in the log file:

15:06:46 | MESSAGE | 006f | HandleReconfig(): Reconfig OK - nodes(2) 
rcfgGen(5) master(0)

If you are unable to bring up PGMS, check your cluster configuration to make sure that it is correct.

CM Troubleshooting

During normal operation, CM on each node checks in with one another to ensure the health of each member. These check-ins occur at interval of N in milliseconds, as specified by the PollInterval registry value in HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>CM. A node is allowed to miss M check-ins before it is cast out of the cluster, as specified by the MissCount value in HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>CM.

Failed check-ins are recorded to the CM error log file (CM.LOG).

These check-in packets are typically UDP packets, and may be lost:

under heavy activity
due to network congestion (if there is not a dedicated interconnect)

If one of your database instances is dropping out of the cluster under heavy activity, you may see messages in CM.LOG file similar to:

05:01:25 | MESSAGE | PollingThread(): node(1) missed(3) checkin(s) 
05:01:27 | MESSAGE | PollingThread(): node(1) missed(5) checkin(s) 
05:01:28 | MESSAGE | PollingThread(): node(1) failure detected

This occurs if the check-in messages were lost because of the heavy activity. Make sure there is a dedicated interconnect for Oracle Parallel Server that is separate from the rest of the network. Slightly increasing the MissCount value may also help.

Note:

MissCount * PollInterval should never be greater than 20 seconds.

CM Secondary Backup

If you are using the secondary disk backup feature of the CM, try to use a partition on a disk that is not heavily used. The backup disk file is written to by every node member during each check-in. If the backup disk is heavily used, it may cause the CM to miss check-ins and falsely drop node members.

Note:

If you are using the secondary disk backup feature, do not lower PollInterval beyond 500 milliseconds because every node writes to the disk backup partition every PollInterval.

CM Error Log File Specification

The CM error log file (CM.LOG) is specified by the ErrorLog value in HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>CM:

ErrorLog: REG_SZ: c:\orant\rdbms80\trace\cm.log

Oracle Corporation recommends specifying an error log location of ORANT\RDBMS80\TRACE\CM.LOG.

Performance and Manager Configuration Tips

You must configure the Performance and Management (PM) module so that PGMS can determine the cluster configuration. Each OPS database corresponds to a PGMS group or domain. For example, the INITSID.ORA and INIT_COM.ORA files could have the following parmaters defined:

INITOPS1.ORA:

instance_number=1

INITOPS2.ORA:

instance_number=2

INITOPS3.ORA:

instance_number=4

INITOPS4.ORA:

instance_number=4

INIT_COM.ORA:

db_name=ops

The HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>PM key would then contain:

where:

Note:

Each row entry must be entered on a separate line in the Multi-String Editor dialog box. Instance numbers must be sequential, such as 0, 1, 2. Do not skip instance numbers, such as 0, 1, 3. Also, the key name (OPS) must match the value of DB_NAME in INIT_COM.ORA

ORA-29702

If the instance numbers in the PM key do not match those specified in the INITSID.ORA file, you will receive the following error in ORACLE_HOME\RDBMS80\TRACE\SIDLMON.TRC upon instance startup:

ORA-29702: error occurred in Group Membership Service operation

Starting Services

If you are having difficulty starting services or the database, check the PGMS.LOG file stored in SYSTEMROOT\SYSTEM32\PGMS.LOG.

If you used the CRTSRV script in "Step 4: Create Services" in Chapter 5, "Configuring Oracle Parallel Server", OraclePGMSService automatically starts up and shuts down when the OracleServiceSID service is started.

If you did not use the CRTSRV script, you can still have OraclePGMSService start up automatically with a OracleServiceSID service by entering the following at the command for each node:

C:\> OPSREG80 ADD SID

You can also discontinue the OraclePGMSService service automatic start up with OracleServiceSID service by entering the following at the command line for each node:

C:\> OPSREG80 DEL SID

DYNAMIC RESOURCES ALLOCATED or DYNAMIC LOCKS ALLOCATED

The following messages appear if LM_RESS and LM_LOCKS values are not sufficient, and additional IDLM locks or resources must be allocated dynamically from the SGA:

DYNAMIC RESOURCES ALLOCATED
DYNAMIC LOCKS ALLOCATED

If these messages appear often, it may lead to SGA exhaustion. To resolve this, increase LM_RESS and LM_LOCKS parameters appropriately based on your database needs to avoid exhausting the SGA.

Additional Information:

See Chapter 15, "Allocating PCM Instance Locks Oracle Parallel Server," of the Oracle8 Parallel Server Concepts and Administration guide.

Understanding the Trace Files

This section discusses the following trace file subjects:

Background Thread Trace Files

Oracle Parallel Server background threads use trace files to record occurrences and exceptions of database operations, as well as errors. These detailed trace logs are helpful to Oracle support to debug problems in your cluster configuration. Background thread trace files are created regardless of whether the BACKGROUND_DUMP_DEST parameter is set in the INIT_COM.ORA initialization parameter file. If BACKGROUND_DUMP_DEST is set, the trace files are stored in the directory specified. If the parameter is not set, the trace files are stored in the ORACLE_HOME\RDBMS80\TRACE directory.

Oracle8 database creates a different trace file for each background thread. The name of the trace file contains the name of the background thread, followed by the extension .TRC, such as:

SIDDBWR.TRC
SIDSMON.TRC

Oracle Parallel Server trace information is reported in the following trace files:

Trace File Description

SIDLCKN.TRC

Trace file for the LCKn process. This trace file shows lock request for other background processes.

SIDLMDN.TRC

Trace file for the LMDn process. This trace file shows lock requests.

SIDLMON.TRC

Trace file for the LMON process. This trace file show status of cluster, including the "Reconfiguration complete" message.

SIDP00N.TRC

Trace file for the parallel query slaves.

Trace File	Description
SIDLCKN.TRC	Trace file for the LCKn process. This trace file shows lock request for other background processes.
SIDLMDN.TRC	Trace file for the LMDn process. This trace file shows lock requests.
SIDLMON.TRC	Trace file for the LMON process. This trace file show status of cluster, including the "Reconfiguration complete" message.
SIDP00N.TRC	Trace file for the parallel query slaves.

User Thread Trace Files

Trace files are also created for user threads if the USER_DUMP_DEST parameter is set in the initialization parameter file. The trace files for the user threads have the form ORAXXXXX.TRC, where XXXXX is a 5-digit number indicating the Windows NT thread ID.

Alert File

The alert file, SIDALRT.LOG, contains important information about error messages and exceptions that occur during database operations. Each instance has one alert file; information is appended to the file each time you start the instance. All threads can write to the alert file.

SIDALRT.LOG is found in the directory specified by the BACKGROUND_DUMP_DEST parameter in the INIT_COM.ORA initialization parameter file. If the BACKGROUND_DUMP_DEST parameter is not set, the SIDALRT.LOG file is generated in ORACLE_HOME\RDBMS80\TRACE.

Error Call Trace Stack

Oracle Worldwide Support may ask you to create an error call trace stack for a particular trace file. An error call trace stack provides program trace of specific background or user threads in the database.

To create an error call trace:

Obtain the Oracle proccess ID for the background processes:

C:\> SVRMGR30
SVRMGR30> CONNECT INTERNAL/PASSWORD
SELECT PID "Oracle Process Id", 
       NAME 
    FROM V$PROCESS, V$BGPROCESS 
    WHERE V$PROCESS.ADDR = V$BGPROCESS.PADDR;

Output displayed looks like this:

Oracle Pro NAME 
---------- ----- 
         2 PMON 
         3 LMON 
         4 LMD0 
         5 DBW0 
         6 LGWR 
         7 CKPT 
         8 SMON 
         9 RECO 
        10 SNP0 
        11 SNP1 
        13 LCK0

Dump the trace stack to the trace file. For example, to dump out the trace stack of LMON, enter:

Set the Oracle process ID to LMON, which is 3 in this example:

SVRMGR30> ORADEBUG SETORAPID 3

Dump the error stack to SIDLMON.TRC:

SVRMGR30> ORADEBUG DUMP ERRORSTACK 3

Cluster Tracing

CM and PGMS tracing can be helpful to Oracle Worldwide Support in debugging your cluster configuration problems in cases where the database is not starting, a particular node is hanging, or there is a node crash.

PGMS Tracing

PGMS tracing is stored in the PGMS log file, SYSTEMROOT\SYSTEM32\PGMS.LOG.

Note:

Do not enable detailed tracing during normal database operation.

To enable detailed PGMS tracing:

De-install the OraclePGMSService:

PGMS /R

Re-install OraclePGMSService with debug flags turned on:

PGMS /I:"C:ORANT\BIN\PGMS.EXE /D /V /S"

where:

/D

debug tracing

/V

verbose tracing

/S

spy on PGMS network packets

To disable tracing:

De-install the OraclePGMSService:

PGMS /R

Re-install OraclePGMSService with debug flags turned off:

PGMS /I:C:"ORANT\BIN\PGMS.EXE"

CM Tracing

CM tracing is stored in the error log file, CM.LOG. The location of CM.LOG is defined by the ErrorLog value in HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>CM.

To enable detailed CM tracing:

Stop the CMSRVR.EXE by rebooting the node.
Specify the CMSrvrpath value in HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD>CM. ErrorLog specifies the CM log file.

CMSrvrpath: REG_SZ: c:\orant\osdbin\cmsrvr.exe /v /c /s

where:

/v

verbose

/c

trace client request

/s

spy on CM network traffic

Using PhysicalDrive for Raw Partitions

When creating symbolic links for the logical partitions with SETLINKS utility, do not use prefix \\.\PhysicalDrive. If you use \\.\PhysicalDrive as a symbolic link, you may corrupt your database files. Use the symbolic links provided in the ORALINKx.TBL file(s), as described in Chapter 5, "Configuring Oracle Parallel Server".

SHUTDOWN ABORT

SHUTDOWN ABORT is not recommended. Oracle Corporation recommends shutting down the OracleServiceSID service so that resources, such as memory usage or files, will be cleaned up by the Windows NT operating system correctly.

To shut down OracleServiceSID:

From the MS-DOS command line, enter:

C:\> NET STOP OracleServiceSID

From the Control Panel's Services window, select the OracleServiceSID service, then choose Stop.

Contacting Oracle Worldwide Customer Support

If after reading this appendix, you still cannot resolve your problems, call Oracle Worldwide Customer Support to report the error. Please have the following information at hand:

cluster hardware, for example, a two-node cluster of Dell PowerEdge 6100 servers
Windows NT version (for example, Windows NT (Workstation, Server, Enterprise) 4.0 with Service pack 3)
all five digits in release number of Oracle RDBMS (for example, 8.0.4.1.0 )
all five digits in release number of Oracle Parallel Server Option
version number of PGMS, which can be obtained from SYSTEMROOT\SYSTEM32\PGMS.LOG.
contents of HKEY_LOCAL_MACHINE>SOFTWARE>ORACLE>OSD key
cluster OSD upgrades from vendor
particular operation that failed, for example, database startup or query
steps to reproduce the problem.

Severe Errors

If an ORA-600 error occurred, it will be printed to SIDALRT.LOG file. If an ORA-600 error or any other severe errors appear in the SIDALRT.LOG file, then provide all files in ORACLE_HOME\RDBMS80\TRACE and PGMS.LOG located in SYSTEMROOT\SYSTEM32.

/D	debug tracing
/V	verbose tracing
/S	spy on PGMS network packets

B Troubleshooting

Cluster Configuration Tips

Cluster Software

CM Configuration

Cluster Configuration Verification

CM Troubleshooting

CM Secondary Backup

CM Error Log File Specification

Performance and Manager Configuration Tips

ORA-29702

Starting Services

DYNAMIC RESOURCES ALLOCATED or DYNAMIC LOCKS ALLOCATED

Understanding the Trace Files

Background Thread Trace Files

User Thread Trace Files

Alert File

Error Call Trace Stack

Cluster Tracing

PGMS Tracing

CM Tracing

Using PhysicalDrive for Raw Partitions

SHUTDOWN ABORT

Contacting Oracle Worldwide Customer Support

Severe Errors

B
Troubleshooting