IBM SG24-5131-00 Changing the Failure Detection Rate, Node Isolation and Partitioned Clusters

Models: SG24-5131-00

1 240
Download 240 pages 48.89 Kb
Page 165
Image 165

7.3.4 Changing the Failure Detection Rate

Use the SMIT Change/Show a Cluster Network Module screen to change the failure detection rate for your network module only if enabling I/O pacing or extending the syncd frequency did not resolve deadman problems in your cluster. By changing the failure detection rate to “Slow”, you can extend the time required before the deadman switch is invoked on a hung node and before a takeover node detects a node failure and acquires a hung node’s resources. See the HACMP for AIX, Version 4.3: Administration Guide, SC23-4279for more information and instructions on changing the Failure Detection Rate.

Note

I/O pacing must be enabled before completing these procedures; it regulates the number of I/O data transfers. Also, keep in mind that the Slow setting for the Failure Detection Rate is network specific.

7.4 Node Isolation and Partitioned Clusters

Node isolation occurs when all networks connecting nodes fail but the nodes remain up and running. One or more nodes can then be completely isolated from the others. A cluster in which this has happened is called a partitioned cluster. A partitioned cluster has two groups of nodes (one or more in each), neither of which cannot communicate with the other. Let’s consider a two node cluster where all networks have failed between the two nodes, but each node remains up and running.

The problem with a partitioned cluster is that each node interprets the absence of keepalives from its partner to mean that the other node has failed, and then generates node failure events. Once this occurs, each node attempts to take over resources from a node that is still active and therefore still legitimately owns those resources. These attempted takeovers can cause unpredictable results in the cluster—for example, data corruption due to a disk being reset.

To guard against a TCP/IP subsystem failure causing node isolation, the nodes should also be connected by a point-to-point serial network. This connection reduces the chance of node isolation by allowing the Cluster Managers to communicate even when all TCP/IP-based networks fail.

It is important to understand that the serial network does not carry TCP/IP communication between nodes; it only allows nodes to exchange keepalives

Cluster Troubleshooting 147

Page 165
Image 165
IBM SG24-5131-00 manual Changing the Failure Detection Rate, Node Isolation and Partitioned Clusters