HP Serviceguard Metrocluster 42

•Notification that a cluster came down for any reason.

•Notification that a cluster has been in an unreachable state for a short period of time. An alert is sent in this case as a warning that an alarm might be issued later if the cluster’s state remains unreachable for a longer time.

The expected process in dealing with alerts is to continue watching for additional notifications and to contact individuals at the site of the monitored cluster to see whether problems exist.

Alarms

Alarms are intended to indicate that a cluster failure might have taken place. The most common example of an alarm is the following:

•Notification that a specified cluster has been in an unreachable state for a significant amount of time.

The expected process in dealing with cluster events that persist at the alarm level is to obtain as much information as possible, including authorization to recover, if your business practices require this. At which point, issue the Continentalclusters recovery command, cmrecovercl.

Creating Notifications for Failure Events

For events that indicate potential cluster failure, display the escalation of concern of the cluster health by defining alerts followed by one or more alarms. The following is a typical sequence:

•cluster alert at 5 minutes

•cluster alert at 10 minutes

•cluster alarm at 15 minutes

This could be accomplished by entering two CLUSTER_ALERT lines in the configuration file, and one CLUSTER_ALARM line. A detailed example is provided in the comments in the ASCII configuration file template, shown in “Editing Section 3—Monitoring Definitions” (page 82).

Creating Notifications for Events that Indicate a Return of Service

For those events that indicate that the cluster is back online or that communication with the monitor has been restored, use cluster alerts to show the de-escalation of concern. In this case, use a CLUSTER_ALERT line in the configuration file with a time of zero (0), so that notifications are sent as soon as the return to service is detected.

Maintenance Mode for Recovery Groups

A recovery group in maintenance mode allows the recovery group to be exempted from a recovery. This implies that the recovery package cannot be started in a recovery cluster. By default, all recovery groups in the Continentalclusters configuration are not in the maintenance mode. To move a recovery group in continentalclusters into the maintenance mode, you must disable it. To move a recovery group out of the maintenance mode, you must enable it. You can complete rehearsal operations on a recovery group only when the recovery group is in the maintenance mode. For more information on rehearsal operations, see “Performing a Rehearsal Operation in your Environment” (page 103).

Use the cmrecovercl -d-gcommand to move a recovery group into the maintenance mode. To move the recovery group out of the maintenance mode, use the cmrecovercl -e-gcommand.

42 Designing Continentalclusters