Monitoring over a Network

A monitor package running on one cluster tracks the health of another cluster in the recovery pair and sends notification to configured destinations if the state of the monitored cluster changes. (If a cluster contains any packages to be recovered it must be monitored.) The monitor software polls the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file, which also indicates when and where to send messages if there is a state change.

The physical separation between clusters will require communication by way of a Local or Wide Area Network (LAN or WAN). Since the polling takes place across the network, interruptions of network service cannot always be differentiated from cluster failure states. This means that if the network is unreliable, the monitoring facility will often detect and report an unreachable state for the monitored cluster that is actually an interruption of the network service.

Because the monitoring is indeterminate in some instances, information from independent sources must be gathered to determine the need for proceeding with the recovery process. For these reasons, cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the cluster recovery is automated to reduce the chance of human error that might occur if manual steps were needed. In Continentalclusters, a system of cluster events and notifications is provided so that events can be easily tracked, and users will know when to seek additional information before initiating recovery.

Cluster Events

A cluster event is a change of state in a monitored cluster. The four cluster states reported by the monitor are Unreachable, Down, Up, and Error. Table 6 summarizes possible causes for the cluster events with regard to both the monitored cluster and the network. However, in many cases the causes of cluster events are indeterminate without additional information that is not available to the software.

Table 6 Monitored States and Possible Causes

Cluster Event (Old state ->

Cluster-related Causes

Network-related Causes

New state)

 

 

 

 

 

Up -> Unreachable

Cluster went down; no nodes are responding

Network failure

 

to network inquiries

 

 

 

 

Down -> Unreachable

Cluster was down and nodes are no longer

Network failure

 

responding

 

 

 

 

Error -> Unreachable

Error resolved but cluster down and nodes not

Network failure

 

responding

 

 

 

 

Up -> Down

Cluster has been halted, but at least one node

No network problems

 

is still responding to network inquiries

 

 

 

 

Error -> Down

Error resolved, cluster is down

Network problem was fixed, cluster is

 

 

down

 

 

 

Unreachable -> Down

Cluster nodes were rebooted but the cluster was

Network came up but the cluster was

 

not started

not running

 

 

 

Up -> Error

Serviceguard version or security file mismatch,

Network is misconfigured, or DNS

 

software error

server crashed or set up incorrectly

 

 

 

Down -> Error

Serviceguard version or security file mismatch,

Network is misconfigured, or DNS

 

software error

server crashed or set up incorrectly

 

 

 

Unreachable -> Error

Serviceguard version or security file mismatch,

Network problem was fixed, but the

 

software error

error condition still exists

 

 

 

Down -> Up

Cluster started

No network problems

 

 

 

40 Designing Continentalclusters