HP UX 11i Workload Management (gWLM/WLM) Software How the automatic restart works, Related events

Models: UX 11i Workload Management (gWLM/WLM) Software

1 72
Download 72 pages 18.09 Kb
Page 42
Image 42

2.(Optional) Edit the property com.hp.gwlm.node.HA.minimumTimeout

in the file /etc/opt/gwlm/conf/gwlmagent.properties to set the minimum number of seconds that must pass before a managed node considers itself separated from its SRD. Set this property to ensure that minor network problems do not cause a managed node to prematurely consider itself separated.

gWLM uses this value only if it is larger than 10 multiplied by gWLM’s allocation interval. For example, with an allocation interval of 15 seconds, a node can go 2.5 minutes without communicating with its SRD before the node’s gWLM agent attempts to re-connect with the SRD.

This feature works best when one managed node is lost at a time or all managed nodes are lost.

NOTE: If a vpar is borrowing cores from other vPars when it loses contact with its SRD, those borrowed cores might be separated from the SRD. If the vpar might be down for an extended time, check that the SRD has reformed without that vpar and that it has enough cores to meet its commitments. If not, try using vparmodify to reclaim some of the cores. (With the vpar down, you will not be able to modify it locally, and only some versions of HP-UX Virtual Partitions allow you to easily modify a remote vpar.)

Similarly, if an npar has several active cores (due to Instant Capacity) when it loses contact with its SRD, you might have to manually size the npar to reclaim those cores for nPars still in the SRD. For more information, see the Instant Capacity documentation.

How the automatic restart works

When a managed node boots, the gWLM agent (gwlmagent) starts automatically if

GWLM_AGENT_START is set to 1 in the file /etc/rc.config.d/gwlmCtl. The agent then checks the file /etc/opt/gwlm/deployed.config to determine its CMS. Next, it attempts to contact the CMS to have the CMS re-deploy its view of the SRD. If the CMS cannot be contacted, the SRD in the deployed.config file is deployed as long as all nodes agree.

In general, when an SRD is disrupted by a node’s going down, by a CMS's going down, or by network communications issues, gWLM attempts to reform the SRD. gWLM maintains the concept of a cluster for the nodes in an SRD. In a cluster, one node is a master and the other nodes are nonmasters. If the master node loses contact with the rest of the SRD, the rest of the SRD can continue without it, as a partial cluster, by unanimously agreeing on a new master. If a nonmaster loses communication with the rest of the SRD, the resulting partial cluster continues operation without the lost node. The master simply omits the missing node until it becomes available again.

You can use the gwlmstatus command to monitor availability. It can tell you whether any hosts are unable to rejoin a node's SRD as well as whether hosts in the SRD are nonresponsive. For more information, see gwlmstatus(1M).

NOTE: Attempts to reform SRDs might time out, leaving no SRD deployed and consequently no management of resource allocations. If this occurs, see the HP Matrix Operating Environment Release Notes and follow the actions suggested in the section titled “Data Missing in Real-time Monitoring.”

Related events

You can configure the following System Insight Manager events regarding this automatic restart feature:

Node Failed to Rejoin SRD on Start-up

SRD Reformed with Partial Set of Nodes

SRD Communication Issue

42 Additional configuration and administration tasks

Page 42
Image 42
HP UX 11i Workload Management (gWLM/WLM) Software manual How the automatic restart works, Related events