Table 5 Pros and Cons of Volume Managers with Serviceguard (continued)

Product Advantages

Supports shared activation.

Supports exclusive activation.

Supports activation in different modes on different nodes at the same time

RAID 1+0 mirrored stripes

RAID 0+1 striped mirrors

CVM versions 4.1 and later support the Veritas Cluster File System (CFS)

Tradeoffs

Requires purchase of additional license

No support for RAID 5

CVM requires all nodes to have connectivity to the shared disk groups

Not currently supported on all versions of HP-UX

Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For most hardware failures, the response is not user-configurable, but for package and service failures, you can choose the system’s response, within limits.

System Reset When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is an HP-UX TOC or INIT, which is a system reset without a graceful shutdown (normally referred to in this manual simply as a system reset). This allows packages to move quickly to another node, protecting the integrity of the data.

A system reset occurs if a cluster node cannot communicate with the majority of cluster members for the predetermined time, or under other circumstances such as a kernel hang or failure of the cluster daemon (cmcld).

The case is covered in more detail under “What Happens when a Node Times Out” (page 88). See also “Cluster Daemon: cmcld” (page 41).

A system reset is also initiated by Serviceguard itself under specific circumstances; see “Responses to Package and Service Failures ” (page 90).

What Happens when a Node Times Out

Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You configure MEMBER_TIMEOUT in the cluster configuration file (see “Cluster Configuration Parameters ” (page 109)); the heartbeat interval is not directly configurable. If a node fails to send a heartbeat message within the time set by MEMBER_TIMEOUT, the cluster is reformed minus the node no longer sending heartbeat messages.

When a node detects that another node has failed (that is, no heartbeat message has arrived within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:

1.The node contacts the other nodes and tries to re-form the cluster without the failed node.

2.If the remaining nodes are a majority or can obtain the cluster lock, they form a new cluster without the failed node.

3.If the remaining nodes are not a majority or cannot get the cluster lock, they halt (system reset).

Example

Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02is exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB respectively.

88 Understanding Serviceguard Software Components

Page 88
Image 88
HP Serviceguard manual Responses to Failures, System Reset When a Node Fails, What Happens when a Node Times Out, Example