Process Monitoring and Integrity

Process Monitoring and Integrity

6

6.1Overview

The Chassis Management Module monitors the general health of processes running on the CMM and can take recovery actions upon detection of failed processes. This is handled by the Process Monitoring Service (PMS).

Upon detecting unhealthy processes, the PMS will take a configurable recovery action. Examples of recovery actions include restarting the process, failing over to the standby CMM, etc.

The PMS itself is also monitored to ensure that it is operating correctly. The PMS is monitored in both a single CMM configuration and a redundant CMM configuration. When faults are detected in the PMS, corrective actions are taken.

The PMS also provides dynamic configuration and status information through the CLI, RPC, and SNMP interfaces. For example, users can administratively lock/disable monitoring of a process while the PMS is running to suit their particular needs. The PMS also provides static configuration to allow customers the ability to tune the static system parameters for the given platform. Examples of these parameters may include monitoring interval, retries, and ramp-up times.

6.1.1Process Existence Monitoring

Process existence monitoring utilizes the operating system's process table to determine the existence of the process. When the CMM software is started, the PMS initializes and determines the set of processes to monitor for process existence. The PMS periodically queries the operating system for the existence of that set of processes. When a monitored process is found not to exist, the PMS will generate a SEL entry and take a recovery action.

Process existence monitoring can be utilized on all permanent processes (processes which exist for the life of the CMM software as a whole). It is particularly useful when monitoring processes that were not specifically developed for running on the CMM. Applications that are provided by the operating system vendor are examples of these types of processes. For the Linux* operating system, processes like syslogd and crond would be good examples.

6.1.2Thread Watchdog Monitoring

Thread watchdog monitoring requires that the process being monitored notifies the PMS of its continued operation. Notifying the PMS will allow the PMS to monitor the process for existence and conditions where a process locks-up. Each thread requiring monitoring within a process using the thread watchdog will register with the PMS. The PMS will loop through its list of registered threads and determine if the set of registered threads are operating. When any thread is determined to be unresponsive (i.e., not notifying the PMS of its continued operation), the PMS will generate a SEL entry and take a recovery action.

Thread watchdog monitoring can be used on all processes that are instrumented with the PMS thread watchdog API. It provides more functionality then process existence monitoring and can be used in conjunction with process integrity monitoring to provide a comprehensive solution. Thread

MPCMM0001 Chassis Management Module Software Technical Product Specification

41

Page 41
Image 41
Intel MPCMM0001 manual Process Monitoring and Integrity, Process Existence Monitoring, Thread Watchdog Monitoring