Disaster Tolerance and Recovery in a Serviceguard Cluster

Understanding Types of Disaster Tolerant Clusters

Disk resynchronization is independent of CPU failure (that is, if the hosts at the primary site fail but the disk remains up, the disk knows it does not have to be resynchronized).

Differences Between Extended Distance Cluster and CLX

The major differences between an Extended Distance Cluster and a CLX cluster are:

The methods used to replicate data between the storage devices in the two data centers. The two basic methods available for replicating data between the data centers for Linux clusters are either host-based or storage array-based. Extended Distance Cluster always uses host-based replication (MD mirroring on Linux). Any (mix of) Serviceguard supported Fibre Channel storage can be implemented in an Extended Distance Cluster. CLX always uses array-based replication/mirroring, and requires storage from the same vendor in both data centers (that is, a pair of XPs with Continuous Access, or a pair of EVAs with Continuous Access).

Data centers in an Extended Distance Cluster can span up to 100km, whereas the distance between data centers in a Metrocluster is defined by the shortest of the following distances:

Maximum distance that guarantees a network latency of no more than 200ms

Maximum distance supported by the data replication link

Maximum supported distance for DWDM as stated by the provider

In an Extended Distance Cluster, there is no built-in mechanism for determining the state of the data being replicated. When an application fails over from one data center to another, the package is allowed to start up if the volume group(s) can be activated. A CLX implementation provides a higher degree of data integrity; that is, the application is only allowed to start up based on the state of the data and the disk arrays.

It is possible for data to be updated on the disk system local to a server running a package without remote data being updated. This happens if the data link between sites is lost, usually as a precursor to a site going down. If that occurs and the site with the latest data then goes down, that data is lost. The period of time from the link lost to the site going down is called the "recovery point". An

26

Chapter 1