Process Monitoring and Integrity

6.7.9Excessive Restarts, Failed Escalate Failover/Reboot, Non- Critical

In this scenario PMS detects a process fault. The severity of the process is configured to a value that is not critical. The configured recovery action is: restart the process. However, the PMS also detects that the process has exceeded the threshold for excessive process restarts. Therefore, the PMS will execute the escalation action. The configured escalation recovery action is: failover to the standby CMM and upon successfully executing the failover, reboot the now standby CMM.

The failover recovery action is unsuccessful (standby is not available, etc.). The process being monitored is not of a critical severity and therefore the reboot of the CMM will not be performed.

Table 14. Excessive Restarts, Failed Escalate Failover/Reboot, Non-Critical

Description

 

Event String

UID

Assert

Severity

 

 

 

 

 

PMS detects a faulty process. The

Process existence fault;

 

 

 

attempting recovery

or

 

 

 

mechanism (existence, thread

Thread watchdog fault; attempting

 

 

 

watchdog, or integrity) used to detect

#

Assert

Configure

recovery

or

 

the fault will determine which of the

 

 

 

 

Process integrity fault; attempting

 

 

 

event type strings will be used.

 

 

 

 

recovery

 

 

 

 

 

 

 

 

 

 

The recovery action specified is

Attempting process restart

#

N/A

Configure

"restart process"

recovery action

 

 

 

 

 

 

 

 

 

 

PMS detects that the process has

Recovery failure due to excessive

#

N/A

Configure

been restarted excessively.

restarts

 

 

 

 

 

 

 

 

 

 

 

 

The escalated recovery action

Attempting failover & reboot

#

N/A

Configure

specified is "failover and reboot"

escalated recovery action

 

 

 

 

 

 

 

 

 

The existing code generates the

 

 

 

 

events for failover. They are

 

 

 

PMS executes a failover.

separate from process monitoring

-

N/A

N/A

 

events and are not described

 

 

 

 

here.

 

 

 

 

 

 

 

 

 

 

 

 

PMS detects that it is still running on

 

 

 

 

 

 

the active CMM. The process is not

Failover & reboot escalated

#

N/A

Configure

critical and therefore the reboot

recovery failure

 

 

 

 

 

operation will not be performed.

 

 

 

 

 

 

 

 

 

 

 

 

 

No attempt will be made to recover

Process existence fault;

 

 

 

the process. The PMS will stop

 

 

 

monitoring disabled

or

 

 

 

monitoring the process.

 

 

 

Thread watchdog fault; monitoring

#

Assert

Configure

See Section 6.7.11, “Process

disabled

or

 

Administrative Action” on page 53, for

 

 

 

 

Process integrity fault; monitoring

 

 

 

information about how to re-enable

 

 

 

disabled

 

 

 

 

 

monitoring and de-assert the event.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6.7.10Excessive Restarts, Failed Escalate Failover/Reboot, Critical

In this scenario, PMS detects a process fault. The severity of the process is configured as critical. The configured recovery action is: restart the process. However, the PMS also detects that the process has exceeded the threshold for excessive process restarts. Therefore, the PMS will execute the escalation recovery action. The configured escalation recovery action is: failover to the standby CMM and upon successfully executing the failover, reboot the now standby CMM. The failover

52MPCMM0001 Chassis Management Module Software Technical Product Specification

Page 52
Image 52
Intel MPCMM0001 manual Process Monitoring and Integrity