hang. After a certain amount of time, by default 360 seconds, the cluster manager will issue a config_too_long message into the /tmp/hacmp.out file.
The message issued looks like this:
The cluster has been in reconfiguration too long;Something may be wrong.
In most cases, this is because an event script has failed. You can find out more by analyzing the /tmp/hacmp.out file.The error messages in the /var/adm/cluster.log file may also be helpful. You can then fix the problem identified in the log file and execute the clruncmd command on the command line, or by using the SMIT Cluster Recovery Aids screen. The clruncmd command signals the Cluster Manager to resume cluster processing.
Note, however, that sometimes scripts simply take too long, so the message showing up isn’t always an error, but sometimes a warning. If the message is issued, that doesn’t necessarily mean that the script failed or never finished. A script running for more than 360 seconds can still be working on something and eventually get the job done. Therefore, it is essential to look at the /tmp/hacmp.out file to find out what is actually happening.
7.3 Deadman Switch
The term “deadman switch” describes the AIX kernel extension that causes a system panic and dump under certain cluster conditions if it is not reset. The deadman switch halts a node when it enters a hung state that extends beyond a certain time limit. This enables another node in the cluster to acquire the hung node’s resources in an orderly fashion, avoiding possible contention problems.
If this is happening, and it isn’t obvious why the cluster manager was kept from resetting this timer counter, for example because some application ran at a higher priority as the clstrmgr process, customizations related to performance problems should be performed in the following order:
1.Tune the system using I/O pacing.
2.Increase the syncd frequency.
3.If needed, increase the amount of memory available for the communications subsystem.
4.Change the Failure Detection Rate.
Each of these options is described in the following sections.
Cluster Troubleshooting 145