Possible node failures can be caused by the following conditions:

HPMC. This is a High Priority Machine Check, a system panic caused by a hardware error.

TOC

Panics

Hangs

Power failures

In the event of a TOC, a system dump is performed on the failed node and numerous messages are also displayed on the console.

You can use the following commands to check the status of your network and subnets:

netstat -in- to display LAN status and check to see if the package IP is stacked on the LAN card.

lanscan - to see if the LAN is on the primary interface or has switched to the standby interface.

arp -a- to check the arp tables.

lanadmin - to display, test, and reset the LAN cards.

Since your cluster is unique, there are no cookbook solutions to all possible problems. But if you apply these checks and commands and work your way through the log files, you will be successful in identifying and solving problems.

Troubleshooting the Quorum Server

NOTE: See the HP Serviceguard Quorum Server Version A.04.00 Release Notes for information about configuring the Quorum Server. Do not proceed without reading the Release Notes for your version.

Authorization File Problems

The following kind of message in a Serviceguard node’s syslog file or in the output of cmviewcl -vmay indicate an authorization problem:

Access denied to quorum server 192.6.7.4

The reason may be that you have not updated the authorization file. Verify that the node is included in the file, and try using /usr/lbin/qs -updateto re-read the Quorum Server authorization file.

Timeout Problems

The following kinds of message in a Serviceguard node’s syslog file may indicate timeout problems:

Unable to set client version at quorum server 192.6.7.2: reply timed out

Probe of quorum server 192.6.7.2 timed out

These messages could be an indication of an intermittent network problem; or the default Quorum Server timeout may not be sufficient. You can set the QS_TIMEOUT_EXTENSION to increase the timeout, or you can increase the MEMBER_TIMEOUT value. See “Cluster Configuration Parameters ” (page 109) for more information about these parameters.

A message such as the following in a Serviceguard node’s syslog file indicates that the node did not receive a reply to its lock request on time. This could be because of delay in communication between the node and the Quorum Server or between the Quorum Server and other nodes in the cluster:

338 Troubleshooting Your Cluster

Page 338
Image 338
HP Serviceguard manual Troubleshooting the Quorum Server, Authorization File Problems, Timeout Problems