103
Member failures
In the absence of HA, master nodes detect the failures of members by receiving regular heartbeat messages. If
no heartbeat has been received for 600 seconds, the master assumes the member is dead. There are two ways
to recover from this problem:
Repair the dead host (e. g. by physically rebooting it). When the connection to the member is restored, the
master will mark the member as alive again.
Shutdown the host and instruct the master to forget about the member node using the xe host-forget CLI
command. Once the member has been forgotten, all the VMs which were running there will be marked as
offline and can be restarted on other XenServer hosts. Note it is very important to ensure that the XenServer
host is actually offline, otherwise VM data corruption might occur. Be careful not to split your pool into multiple
pools of a single host by using xe host-forget , since this could result in them all mapping the same shared
storage and corrupting VM data.
Warning:
If you are going to use the forgotten host as a XenServer host again, perform a fresh
installation of the XenServer software.
Do not use xe host-forget command if HA is enabled on the pool. Disable HA first, then
forget the host, and then re-enable HA.
When a member XenServer host fails, there may be VMs still registered in the running state. If you are sure that
the member XenServer host is definitely down, use the xe vm-reset-powerstate CLI command to set the power
state of the VMs to halted. See the section called “vm-reset-powerstate” for more details.
Warning:
Incorrect use of this command can lead to data corruption. Only use this command if
absolutely necessary.
Before you can start VMs on another XenServer host, you are also required to release the locks on VM storage.
Each disk in an SR can only be used by one host at a time, so it is key to make the disk accessible to other XenServer
hosts once a host has failed. To do so, run the following script on the pool master for each SR that contains disks
of any affected VMs:
/opt/xensource/sm/resetvdis.py <host_UUID> <SR_UUID> [master]
You need only supply the third string ("master") if the failed host was the SR master — pool master or XenServer
host using local storage — at the time of the crash.
Warning:
Be absolutely sure that the host is down before executing this command. Incorrect use of this
command can lead to data corruption.
If you attempt to start a VM on another XenServer host before running the script above, then you will receive
the following error message: VDI <UUID> already attached RW.
Master failures
Every member of a resource pool contains all the information necessary to take over the role of master if required.
When a master node fails, the following sequence of events occurs:
1. If HA is enabled, another master is elected automatically.
2. If HA is not enabled, each member will wait for the master to return.
If the master comes back up at this point, it re-establishes communication with its members, and operation
returns to normal.