HP Serviceguard Metrocluster 40

Monitoring over a Network

A monitor package running on one cluster tracks the health of another cluster in the recovery pair and sends notification to configured destinations if the state of the monitored cluster changes. (If a cluster contains any packages to be recovered it must be monitored.) The monitor software polls the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file, which also indicates when and where to send messages if there is a state change.

The physical separation between clusters will require communication by way of a Local or Wide Area Network (LAN or WAN). Since the polling takes place across the network, interruptions of network service cannot always be differentiated from cluster failure states. This means that if the network is unreliable, the monitoring facility will often detect and report an unreachable state for the monitored cluster that is actually an interruption of the network service.

Because the monitoring is indeterminate in some instances, information from independent sources must be gathered to determine the need for proceeding with the recovery process. For these reasons, cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the cluster recovery is automated to reduce the chance of human error that might occur if manual steps were needed. In Continentalclusters, a system of cluster events and notifications is provided so that events can be easily tracked, and users will know when to seek additional information before initiating recovery.

Cluster Events

A cluster event is a change of state in a monitored cluster. The four cluster states reported by the monitor are Unreachable, Down, Up, and Error. Table 6 summarizes possible causes for the cluster events with regard to both the monitored cluster and the network. However, in many cases the causes of cluster events are indeterminate without additional information that is not available to the software.

Table 6 Monitored States and Possible Causes

Cluster Event (Old state ->	Cluster-related Causes	Network-related Causes
New state)

Up -> Unreachable	Cluster went down; no nodes are responding	Network failure
	to network inquiries

Down -> Unreachable	Cluster was down and nodes are no longer	Network failure
	responding

Error -> Unreachable	Error resolved but cluster down and nodes not	Network failure
	responding

Up -> Down	Cluster has been halted, but at least one node	No network problems
	is still responding to network inquiries

Error -> Down	Error resolved, cluster is down	Network problem was fixed, cluster is
		down

Unreachable -> Down	Cluster nodes were rebooted but the cluster was	Network came up but the cluster was
	not started	not running

Up -> Error	Serviceguard version or security file mismatch,	Network is misconfigured, or DNS
	software error	server crashed or set up incorrectly

Down -> Error	Serviceguard version or security file mismatch,	Network is misconfigured, or DNS
	software error	server crashed or set up incorrectly

Unreachable -> Error	Serviceguard version or security file mismatch,	Network problem was fixed, but the
	software error	error condition still exists

Down -> Up	Cluster started	No network problems

40 Designing Continentalclusters