Arbitration for Data Integrity in Serviceguard Clusters

Cluster Membership Concepts

Cluster Membership Concepts

What is arbitration? Why is it necessary? When and how is it carried out? To answer these questions, it is necessary to explain a number of clustering concepts that are central to the processes of cluster formation and re-formation. These concepts are membership, quorum, split-brain, and tie-breaking.

Membership

A cluster is a networked collection of nodes. The key to success in controlling the location of applications in the cluster and ensuring there is no inappropriate duplication is maintaining a well-defined cluster node list. When the cluster starts up, all the nodes communicate and build this membership list, a copy of which is in the memory of every node. The list is validated continuously as the cluster runs; this is done by means of heartbeat messages that are transmitted among all the nodes. As nodes enter and leave the cluster, the list is changed in memory. Changes in membership can result from an operator’s issuing a command to run or halt a node, or from system events that cause a node to halt, reboot, or crash. Some of these events are routine, and some may be unexpected. There are frequent cases in cluster operation when cluster membership is changing and when the cluster software must determine which node in the cluster should run an application.

How does the cluster software tell where an application should run? In a running cluster, when one system cannot communicate with the others for a significant amount of time, there can be several possible reasons:

1.The node has crashed.

2.The node is experiencing a kernel hang, and processing has stopped.

3.The cluster is partitioned because of a network problem. Either all the network cards connecting the node to the rest of the cluster have failed, or all the cables connecting the cards to the network have failed, or there has been a failure of the network itself.

It is often impossible for the cluster manager software to distinguish (1) from (2) and (3), and therein lies a problem, because in case (1), it is safe to restart the application on another node in the cluster, but in (2) and (3), it is not safe.

4

Page 4
Image 4
HP Serviceguard manual Cluster Membership Concepts