
7.3.4 Changing the Failure Detection Rate
Use the SMIT Change/Show a Cluster Network Module screen to change the failure detection rate for your network module only if enabling I/O pacing or extending the syncd frequency did not resolve deadman problems in your cluster. By changing the failure detection rate to “Slow”, you can extend the time required before the deadman switch is invoked on a hung node and before a takeover node detects a node failure and acquires a hung node’s resources. See the HACMP for AIX, Version 4.3: Administration Guide,
Note
I/O pacing must be enabled before completing these procedures; it regulates the number of I/O data transfers. Also, keep in mind that the Slow setting for the Failure Detection Rate is network specific.
7.4 Node Isolation and Partitioned Clusters
Node isolation occurs when all networks connecting nodes fail but the nodes remain up and running. One or more nodes can then be completely isolated from the others. A cluster in which this has happened is called a partitioned cluster. A partitioned cluster has two groups of nodes (one or more in each), neither of which cannot communicate with the other. Let’s consider a two node cluster where all networks have failed between the two nodes, but each node remains up and running.
The problem with a partitioned cluster is that each node interprets the absence of keepalives from its partner to mean that the other node has failed, and then generates node failure events. Once this occurs, each node attempts to take over resources from a node that is still active and therefore still legitimately owns those resources. These attempted takeovers can cause unpredictable results in the
To guard against a TCP/IP subsystem failure causing node isolation, the nodes should also be connected by a
It is important to understand that the serial network does not carry TCP/IP communication between nodes; it only allows nodes to exchange keepalives
Cluster Troubleshooting 147