HP serviceguard t2808-90006 manual How is the cluster maintained?

Page 49

Disaster Tolerance and Recovery in a Serviceguard Cluster

Managing a Disaster Tolerant Environment

Even if recovery is automated, you may choose to, or need to recover from some types of disasters with manual recovery. A rolling disaster, which is a disaster that happens before the cluster has recovered from a previous disaster, is an example of when you may want to manually switch over. If the data link failed, as it was coming up and resynchronizing data, and the data center failed, you would want human intervention to make judgment calls on which site had the most current and consistent data before failing over.

•Who manages the nodes in the cluster and how are they trained?

Putting a disaster tolerant architecture in place without planning for the people aspects is a waste of money. Training and documentation are more complex because the cluster is in multiple data centers.

Each data center often has its own operations staff with their own processes and ways of working. These operations people will now be required to communicate with each other and coordinate maintenance and failover rehearsals, as well as working together to recover from an actual disaster. If the remote nodes are placed in a “lights-out” data center, the operations staff may want to put additional processes or monitoring software in place to maintain the nodes in the remote location.

Rehearsals of failover scenarios are important to keep prepared. A written plan should outline rehearsal of what to do in cases of disaster with a minimum recommended rehearsal schedule of once every 6 months, ideally once every 3 months.

•How is the cluster maintained?

Planned downtime and maintenance, such as backups or upgrades, must be more carefully thought out because they may leave the cluster vulnerable to another failure. For example, nodes need to be brought down for maintenance in pairs: one node at each site, so that quorum calculations do not prevent automated recovery if a disaster occurs during planned maintenance.

Rapid detection of failures and rapid repair of hardware is essential so that the cluster is not vulnerable to additional failures.

Testing is more complex and requires personnel in each of the data centers. Site failure testing should be added to the current cluster testing plans.

Chapter 1

49

Image 49

Contents Page Legal Notices Contents Disaster Scenarios and Their Handling Managing an MD Device Contents Contents Editions and Releases Printing HistoryHP Printing Division Document Organization Intended AudiencePage Related Page Disaster Tolerance Evaluating the Need for Disaster Tolerance Evaluating the Need for Disaster Tolerance High Availability Architecture What is a Disaster Tolerant Architecture?Node 1 fails Pkg B Client ConnectionsDisaster Tolerant Architecture Extended Distance Clusters Understanding Types of Disaster Tolerant ClustersFrom both storage devices Extended Distance Cluster Two Data Center Setup Benefits of Extended Distance Cluster Cluster Extension CLX Cluster CLX for Linux Serviceguard Cluster Shows a CLX for a Linux Serviceguard cluster architectureBenefits of CLX Differences Between Extended Distance Cluster and CLX Continental Cluster New York Cluster Los Angeles ClusterData Cent er a Data Center B Continental ClusterBenefits of Continentalclusters Continental Cluster With Cascading Failover Comparison of Disaster Tolerant SolutionsAttributes Extended Distance Comparison of Disaster Tolerant Cluster SolutionsContinentalclusters Cluster HP-UX onlyUnderstanding Types of Disaster Tolerant Clusters Understanding Types of Disaster Tolerant Clusters Understanding Types of Disaster Tolerant Clusters WAN EVA Protecting Nodes through Geographic Dispersion Disaster Tolerant Architecture GuidelinesOff-line Data Replication Protecting Data through ReplicationPhysical Data Replication On-line Data ReplicationDisadvantages of physical replication in hardware are Advantages of physical replication in hardware areAdvantages of physical replication in software are Logical Data Replication Disadvantages of physical replication in software areDisadvantages of logical replication are Ideal Data Replication Using Alternative Power SourcesPower Circuit 1 node Alternative Power SourcesData Center a Node 3 Power Circuit Creating Highly Available NetworkingDisaster Tolerant Wide Area Networking Disaster Tolerant Local Area NetworkingDisaster Tolerant Cluster Limitations Managing a Disaster Tolerant Environment Manage it in-house, or hire a service?How is the cluster maintained? Additional Disaster Tolerant Solutions Information Building an Extended Distance Dwdm Types of Data Link for Storage NetworkingTwo Data Center and Quorum Service Location Architectures Two Data Center and Quorum Service Location Architectures Server Two Data Centers and Third Location with Dwdm and QuorumTwo Data Center and Quorum Service Location Architectures Rules for Separate Network and Data Links Guidelines on Dwdm Links for Network and Data Guidelines on Dwdm Links for Network and Data Guidelines on Dwdm Links for Network and Data Chapter Configuring your Environment Understanding Software RAID Installing XDC Installing the Extended Distance Cluster SoftwareSupported Operating Systems Prerequisites# rpm -Uvh xdc-A.01.00-0.rhel4.noarch.rpm Verifying the XDC InstallationInstalling the Extended Distance Cluster Software Configuring the Environment Configuring the Environment Configuring the Environment Setting the Value of the Link Down Timeout Parameter Configuring Multiple Paths to StorageCluster Reformation Time and Timeout Values Http//docs.hp.com Using Persistent Device NamesTo Create and Assemble an MD Device Creating a Multiple Disk Device# mdadm -A -R /dev/md0 /dev/hpdev/sde1 /dev/hpdev/sdf1 Chapter Linux #RAIDTAB= # MD RAID Commands To Create a Package Control Script Creating and Editing the Package Control ScriptsTo Edit the Datarep Variable To Configure the RAID Monitoring Service To Edit the Xdcconfig File parameterEditing the raid.conf File Cases to Consider when Setting Rpotarget RPO Target Definitions Chapter Multipledevices and Componentdevices Raidmonitorinterval Configuring your Environment for Software RAID Recovery Process What happens when this disaster occursDisaster Scenario Disaster Scenarios and Their Handling Disaster Scenarios and Their Handling# mdadm --remove /dev/md0 # mdadm -add /dev/md0 Dev/hpdev/mylink-sdf P1 uses a mirror md0 Run the following command to S2 is non-current by less # cmrunpkg packagename Execute the commands that With md0 consisting of only N1, for example Becomes accessible from N2 Center Disaster Scenarios and Their Handling Managing an MD Device Cat /proc/mdstat Viewing the Status of the MD DeviceExample A-1 Stopping the MD Device /dev/md0 Stopping the MD DeviceExample A-2 Starting the MD Device /dev/md0 Starting the MD Device# udevinfo -q symlink -n sdc1 Removing and Adding an MD Mirror Component Disk# mdadm --remove /dev/md0 /dev/hpdev/sde Adding a Mirror Component DeviceIndex 104