
Network Partitions Across Sites
A network partition across sites is similar to a site failure. The Serviceguard cluster nodes on both sites detect this failure and try to reform the cluster using the Quorum Server. The nodes from only one of the sites will receive the quorum and form the cluster. The nodes on the other site restart and deliberately fail the active
The Site Controller Package running on the site nodes that failed to form the cluster will now fail over to the adoptive node on the site where the cluster is reformed. When the Site Controller Package starts on the adoptive node at the remote site, it detects that the active complex workload's packages have failed. Consequently, the Site Controller Package performs a site failover and starts the corresponding complex workload's packages on the site where the cluster has reformed.
Disk Array and SAN Failure
When a disk array or the host access SAN at a site fails, the active complex workload database running on the site could hang or fail based on the component that has failed. If the SAN failure causes the complex workload database processes to fail and consequently the
Replication Link Failure
A failure in a replication link between sites stalls the replication from the active
On a synchronized replication mode, with fence level set to Data, the primary site disk array starts failing I/Os. This causes the active complex workload configuration to fail. The Site Controller Package then performs a site failover, if a
If the fence level is set to Never, the I/O on the PVOL side is not failed, and the active complex workload continues to run successfully.
On an asynchronous replication mode, there is no interruption at the complex workload's configuration and it continues to run uninterrupted.
When the complex workload is mounted as read only or is idle or is completing
Site Controller Package Failure
The Site Controller Package can fail for many reasons, such as node crash, while the active
NOTE: When the adoptive node is a node in the same site, where the current active complex workload stack is running, it is considered as a local failover for the Site Controller Package.
On a Site Controller Package local failover, the disaster tolerant complex workload remains uninterrupted on that site. The Site Controller Package continues to monitor the managed packages or the critical packages on the site, as configured from the current node.
When the Site Controller Package fails over to an adoptive node at the remote site, it is considered a failover across sites for the Site Controller Package. When the Site Controller Package fails over across sites while the active
Understanding Failure Scenarios in a Site Aware Disaster Tolerant Architecture 393