Responses to Failures, Example | HP Serviceguard specs

Table 5 Pros and Cons of Volume Managers with Serviceguard (continued)

Product Advantages

•Supports shared activation.

•Supports exclusive activation.

•Supports activation in different modes on different nodes at the same time

•RAID 1+0 mirrored stripes

•RAID 0+1 striped mirrors

•CVM versions 4.1 and later support the Veritas Cluster File System (CFS)

Tradeoffs

•Requires purchase of additional license

•No support for RAID 5

•CVM requires all nodes to have connectivity to the shared disk groups

•Not currently supported on all versions of HP-UX

Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For most hardware failures, the response is not user-configurable, but for package and service failures, you can choose the system’s response, within limits.

System Reset When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is an HP-UX TOC or INIT, which is a system reset without a graceful shutdown (normally referred to in this manual simply as a system reset). This allows packages to move quickly to another node, protecting the integrity of the data.

A system reset occurs if a cluster node cannot communicate with the majority of cluster members for the predetermined time, or under other circumstances such as a kernel hang or failure of the cluster daemon (cmcld).

The case is covered in more detail under “What Happens when a Node Times Out” (page 88). See also “Cluster Daemon: cmcld” (page 41).

A system reset is also initiated by Serviceguard itself under specific circumstances; see “Responses to Package and Service Failures ” (page 90).

What Happens when a Node Times Out

Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You configure MEMBER_TIMEOUT in the cluster configuration file (see “Cluster Configuration Parameters ” (page 109)); the heartbeat interval is not directly configurable. If a node fails to send a heartbeat message within the time set by MEMBER_TIMEOUT, the cluster is reformed minus the node no longer sending heartbeat messages.

When a node detects that another node has failed (that is, no heartbeat message has arrived within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:

1.The node contacts the other nodes and tries to re-form the cluster without the failed node.

2.If the remaining nodes are a majority or can obtain the cluster lock, they form a new cluster without the failed node.

3.If the remaining nodes are not a majority or cannot get the cluster lock, they halt (system reset).

Example

Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02is exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB respectively.

HP Serviceguard manual Responses to Failures, System Reset When a Node Fails, Example

Models: Serviceguard