Deadman Switch | IBM SG24-5131-00 specs

hang. After a certain amount of time, by default 360 seconds, the cluster manager will issue a config_too_long message into the /tmp/hacmp.out file.

The message issued looks like this:

The cluster has been in reconfiguration too long;Something may be wrong.

In most cases, this is because an event script has failed. You can find out more by analyzing the /tmp/hacmp.out file.The error messages in the /var/adm/cluster.log file may also be helpful. You can then fix the problem identified in the log file and execute the clruncmd command on the command line, or by using the SMIT Cluster Recovery Aids screen. The clruncmd command signals the Cluster Manager to resume cluster processing.

Note, however, that sometimes scripts simply take too long, so the message showing up isn’t always an error, but sometimes a warning. If the message is issued, that doesn’t necessarily mean that the script failed or never finished. A script running for more than 360 seconds can still be working on something and eventually get the job done. Therefore, it is essential to look at the /tmp/hacmp.out file to find out what is actually happening.

7.3 Deadman Switch

The term “deadman switch” describes the AIX kernel extension that causes a system panic and dump under certain cluster conditions if it is not reset. The deadman switch halts a node when it enters a hung state that extends beyond a certain time limit. This enables another node in the cluster to acquire the hung node’s resources in an orderly fashion, avoiding possible contention problems.

If this is happening, and it isn’t obvious why the cluster manager was kept from resetting this timer counter, for example because some application ran at a higher priority as the clstrmgr process, customizations related to performance problems should be performed in the following order:

1.Tune the system using I/O pacing.

2.Increase the syncd frequency.

3.If needed, increase the amount of memory available for the communications subsystem.

4.Change the Failure Detection Rate.

Each of these options is described in the following sections.

Cluster Troubleshooting 145

Image 163

IBM SG24-5131-00 manual Deadman Switch

Contents

AIX Hacmp Page AIX Hacmp Take Note Contents Iv IBM Certification Study Guide AIX Hacmp Page Vi IBM Certification Study Guide AIX Hacmp Vii Appendix A. Special Notices Viii IBM Certification Study Guide AIX Hacmp Figures IBM Certification Study Guide AIX Hacmp Tables Xii IBM Certification Study Guide AIX Hacmp Xiii Preface Team That Wrote This Redbook Your comments are important to us Comments Welcome Xvi IBM Certification Study Guide AIX Hacmp Recommended Prerequisites Certification OverviewIBM Certified Specialist AIX Hacmp Certification Requirement two Tests Certification Exam Objectives PreinstallationHacmp Implementation System Management Certification Education Courses Following table outlines information about the next course IBM Certification Study Guide AIX Hacmp Cluster Nodes Cluster PlanningCPU Options Cluster Node Considerations Cluster Planning Switch adapter is onboard and does not need an extra slot 1 TCP/IP Networks Cluster NetworksSupported TCP/IP Network Types Special Network Considerations Slip Socc Cluster Planning Supported Non-TCP/IP Network Types Non-TCPIP Networks Serial RS232 Special Considerations Target-mode SSA Cluster DisksSSA Disks Target-mode Scsi Host Specification 1.2 Supported and Non-Supported Adapters Disk Capacities Rules for SSA Loops Cluster Planning RAID Level RAID vs. Non-RAID RAID Technology RAID Levels 2 RAID on the 7133 Disk Subsystem Advantages Disks Scsi DisksSubsystems Advantages Disadvantages Resource Group Options Resource Planning Cluster Planning Shared LVM Components Hot-Standby Configuration Hot-Standby Configuration Rotating Standby Configuration Mutual Takeover Configuration Mutual Takeover Configuration Third-Party Takeover Configuration Third-Party Takeover Configuration IP Address Takeover Concurrent Disk Access Configurations Single Network Network Topology Point-to-Point Connection Dual Network Network Name NetworksNetwork Attribute Adapter Label Network AdaptersAdapter Function Cluster Planning Defining Hardware Addresses Application Planning NFS Exports and NFS Mounts Application Startup and Shutdown Routines Performance Requirements Coexistence with other Applications Licensing MethodsCritical/Non-Critical Prioritizations Event Notification Customization PlanningEvent Customization Special Application Requirements Predictive Event Error Correction Error Notification Sample Screen for Add a Notification Method Application Failure Notification Cluster User and Group IDs User ID Planning NFS-Mounted Home Directories Cluster PasswordsHome Directories on Shared Volumes User Home Directory Planning NFS-Mounted Home Directories on Shared Volumes Rootvg Mirroring Cluster Node SetupAdapter Slot Placement Cluster Hardware and Software Preparation IBM Certification Study Guide AIX Hacmp Procedure This is so that the Quorum OFF functionality takes effect Necessary Apar Fixes AIX Prerequisite LPPs 4.1 I/O Pacing AIX Parameter Settings Checking Network Option Settings Cron and NIS Considerations Editing the /.rhosts File Cabling Considerations Network Connection and Testing IP Addresses and Subnets Connecting Networks to a Hub Testing Non TCP/IP Networks Configuring Target Mode Scsi Configuring RS232 Testing RS232 and Target Mode Networks Configuring Target Mode SSA 1 SSA Cluster Disk SetupCabling Adapter Router AIX Configuration Disk Definitions Adapter Definitions #lsdev -Cc disk grep SSA Diagnostics Microcode Loading Upgrade Instructions Cluster Hardware and Software Preparation Scsi Configuring a RAID on SSA Disks Scsi Adapters Connecting RAID SubsystemsRAID Enclosures 110 RAIDiant Arrays Connected on Two Shared 8-Bit Scsi Buses Cluster Hardware and Software Preparation #2416 Adapter Scsi ID and Termination change Termination F1=Help F2=Refresh # chdev -l scsi1 -a id=6 -P Change/Show Characteristics of a Scsi Adapter Shared LVM Component Configuration Creating Shared VGs Creating VGs for Concurrent AccessCreating Non-Concurrent VGs Physical Volume Names Creating Shared LVs and File Systems Renaming a jfslog and Logical Volumes on the Source Node Adding Copies to Logical Volume on the Source Node Mirroring Strategies Testing a File SystemImporting to Other Nodes Changing a Volume Group’s Startup Status Quorum at Vary On Quorum Disabling and Enabling Quorum Quorum EnabledQuorum Disabled Quorum after Vary On Forcing a Varyon Quorum in Non-Concurrent Access ConfigurationsQuorum in Concurrent Access Configurations Alternate Method TaskGuide Starting the TaskGuide IBM Certification Study Guide AIX Hacmp Installing Hacmp Hacmp Installation and Cluster DefinitionFirst Time Installs Cluster.base.server.utils Cluster.hc Rebooting Servers Install Server NodesUpgrading From a Previous Version Upgrade AIX on One Node Check Upgraded Configuration Install Hacmp 4.3 for AIX on Node a Client-only Migration Defining Cluster Topology Defining Nodes Defining the Cluster Defining Adapters Hacmp Installation and Cluster Definition Node Name Adding or Changing Adapters after the Initial Configuration Configuring Network Modules Synchronizing the Cluster Definition Across Nodes Ignore Cluster Defining Resources Configuring Resource Groups Service IP Label Configuring Resources for Resource Groups Defining Application Servers Configuring Run-Time Parameters Clverify Initial TestingSynchronizing Cluster Resources Takeover and Reintegration Initial Startup Cluster Snapshot Applying a Cluster Snapshot Hacmp Installation and Cluster Definition IBM Certification Study Guide AIX Hacmp Predefined Cluster Events Cluster Customization Nodeupremote AcquireserviceaddrAcquiretakeoveraddr Getdiskvgfs Nodedownremote ReleaseserviceaddrSequence of nodedown Events Nodedownlocal Networkupcomplete StartserverNetwork Events Networkdown Networkup Swapadapter ConfigtoolongReconfigtopologystart Network Adapter Events Event Notification Event Recovery and RetryConfiguration Resources Cluster Events Change/Show Cluster Pre- and Post-Event Processing Event Emulator Network Modules/Topology Services and Group Services NFS considerations Creating Shared Volume Groups Creating NFS Mount Points on Clients Exporting NFS File SystemsNFS Mounting Cascading Takeover with Cross Mounted NFS File Systems NFS Cross Mounts Caveats about Node Names and NFS Cross Mounted NFS File Systems and the Network Lock Manager SLEEP=2 Done 131 Cluster TestingNode Verification Device State Process State System ParametersNetwork State Cluster State LVM State Adapter Failure Simulate ErrorsEthernet or Token Ring Interface Failure Ethernet or Token Ring Adapter or Cable Failure Switch Adapter Failure Re-attach the cables Failure of a 7133 Adapter AIX Crash Node Failure / ReintegrationCPU Failure 2.3 TCP/IP Subsystem Failure Network Failure Mirrored rootvg Disk hdisk0 Failure Disk Failure Mirrored 7133 Disk Failure 4.2 7135 Disk Failure Application Failure IBM Certification Study Guide AIX Hacmp Cluster Log Files Cluster Troubleshooting143 Daemons Configtoolong Deadman Switch Extending the syncd Frequency Tuning the System Using I/O PacingIncrease Amount of Memory for Communications Subsystem Node Isolation and Partitioned Clusters Changing the Failure Detection Rate Dgsp Message Troubleshooting Strategy User ID Problems IBM Certification Study Guide AIX Hacmp Monitoring the Cluster Cluster Management and Administration151 Monitoring Clusters using HAView Clstat Command 3.3 /usr/sbin/cluster/history/cluster.mmdd System Error Log3.1 /var/adm/cluster.log 3.2 /tmp/hacmp.out Starting and Stopping Hacmp on a Node or a Client Cluster Lock Manager daemon cllockd Hacmp DaemonsCluster Manager daemon clstrmgr Cluster Smux Peer daemon clsmuxpd Cluster Information Program daemon clinfo Starting Cluster Services on a NodeCluster Topology Services daemon topsvcsd Cluster Group Services daemon grpsvcsd Automatically Restarting Cluster Services Stopping Cluster Services on a Node Forced When to Stop Cluster servicesTypes of Cluster Stops Graceful Maintaining Cluster Information Services on Clients Starting and Stopping Cluster Services on Clients Nodes Replacing Failed ComponentsAdapters Disks 3.1 SSA/SCSI Disk Replacement RAID Sync the volume group smit clsyncvg Changing Shared LVM Components Manual Update Lazy Update Spoc IBM Certification Study Guide AIX Hacmp Changing Cluster Resources TaskGuideTaskGuide Requirements Synchronize Cluster Resources 1 Add/Change/Remove Cluster Resources Dare Resource Migration Utility Sticky Resource Migration Resource Migration TypesNon-Sticky Resource Migration Locations Default LocationNode Name Stop Location Using the cldare Command to Migrate Resources Stopping Resource Groups Using the clfindres Command Applying Software Maintenance to an Hacmp Cluster Fallover System a Rejoins Cluster Split-Mirror Backups Backup Strategies How to do a split-mirror backup User Management Using Events to Schedule a Backup Adding User Accounts on all Cluster Nodes Listing Users On All Cluster Nodes Removing Users from a Cluster Changing Attributes of Users in a Cluster Spoc Log Managing Group Accounts IBM Certification Study Guide AIX Hacmp 183 Special RS/6000 SP TopicsHigh Availability Control Workstation Hacws Hardware Requirements Software Requirements Configuring the Backup CWS Hacws Configuration Install High Availability Software Setup and Test Hacws Kerberos Security Ambrose Bierce, The Enlarged Devil’s Dictionary Configuring Kerberos Security with Hacmp Version Virtual Shared Disk VSDs VSDs RVSDs Special RS/6000 SP Topics Undefined D e Z Recoverable Virtual Shared Disk Rsvd Daemons Switch Basics Within Hacmp SP Switch as an Hacmp Network Eprimary Management Switch Failures Special RS/6000 SP Topics IBM Certification Study Guide AIX Hacmp 199 Hacmp Classic vs. HACMP/ES vs. HanfsHacmp for AIX Classic Hacmp for AIX / Enhanced Scalability IBM Risc System Cluster Technology Rsct High Availability for Network File System for AIX Enhanced Cluster Security Decision Criteria Similarities and Differences Hacmp Classic vs. HACMP/ES vs. Hanfs IBM Certification Study Guide AIX Hacmp 205 Appendix A. Special Notices SP1 Special Notices IBM Certification Study Guide AIX Hacmp 209 Appendix B. Related PublicationsInternational Technical Support Organization Publications Redbooks on CD-ROMs Other Publications How IBM Employees Can Get Itso Redbooks How to Get Itso Redbooks211 Ibmmail How Customers Can Get Itso Redbooks 213 IBM Redbook Order Form IBM Certification Study Guide AIX Hacmp 215 List of Abbreviations Netbios 217 Index Symbols HACMP/ES NIS 219 Vgda Vgsa 221 Itso Redbook Evaluation IBM Certification Study Guide AIX Hacmp SG24-5131-00