![](/images/backgrounds/286000/hp-serviceguard-metrocluster-users-guide-66984540x1.png)
Monitoring over a Network
A monitor package running on one cluster tracks the health of another cluster in the recovery pair and sends notification to configured destinations if the state of the monitored cluster changes. (If a cluster contains any packages to be recovered it must be monitored.) The monitor software polls the monitored cluster at a specific MONITOR_INTERVAL defined in an ASCII configuration file, which also indicates when and where to send messages if there is a state change.
The physical separation between clusters will require communication by way of a Local or Wide Area Network (LAN or WAN). Since the polling takes place across the network, interruptions of network service cannot always be differentiated from cluster failure states. This means that if the network is unreliable, the monitoring facility will often detect and report an unreachable state for the monitored cluster that is actually an interruption of the network service.
Because the monitoring is indeterminate in some instances, information from independent sources must be gathered to determine the need for proceeding with the recovery process. For these reasons, cluster recovery is not automatic, but must be initiated by a root user. Once initiated, however, the cluster recovery is automated to reduce the chance of human error that might occur if manual steps were needed. In Continentalclusters, a system of cluster events and notifications is provided so that events can be easily tracked, and users will know when to seek additional information before initiating recovery.
Cluster Events
A cluster event is a change of state in a monitored cluster. The four cluster states reported by the monitor are Unreachable, Down, Up, and Error. Table 6 summarizes possible causes for the cluster events with regard to both the monitored cluster and the network. However, in many cases the causes of cluster events are indeterminate without additional information that is not available to the software.
Table 6 Monitored States and Possible Causes
Cluster Event (Old state | ||
New state) |
|
|
|
|
|
Up | Cluster went down; no nodes are responding | Network failure |
| to network inquiries |
|
|
|
|
Down | Cluster was down and nodes are no longer | Network failure |
| responding |
|
|
|
|
Error | Error resolved but cluster down and nodes not | Network failure |
| responding |
|
|
|
|
Up | Cluster has been halted, but at least one node | No network problems |
| is still responding to network inquiries |
|
|
|
|
Error | Error resolved, cluster is down | Network problem was fixed, cluster is |
|
| down |
|
|
|
Unreachable | Cluster nodes were rebooted but the cluster was | Network came up but the cluster was |
| not started | not running |
|
|
|
Up | Serviceguard version or security file mismatch, | Network is misconfigured, or DNS |
| software error | server crashed or set up incorrectly |
|
|
|
Down | Serviceguard version or security file mismatch, | Network is misconfigured, or DNS |
| software error | server crashed or set up incorrectly |
|
|
|
Unreachable | Serviceguard version or security file mismatch, | Network problem was fixed, but the |
| software error | error condition still exists |
|
|
|
Down | Cluster started | No network problems |
|
|
|
40 Designing Continentalclusters