How PCI / PCI-Express Error Recovery Works
When an I/O driver detects PCI bus parity errors, it reports the errors to Error Recovery Infrastructure and then to the core platform support module.
Core PSM implements error recovery functionality using interfaces that are independent of the platform. This module verifies if the error is a device error or a bus error. If the error is a device error then the PSM ignores the error. Otherwise, the PSM module handles the I/O error and notifies the error recovery infrastructure about error handling. While handling the error, the core PSM invokes firmware interface (like Health checker daemon as shown in the figure below) which logs and clears the error.
Figure 3 depicts the PCI I/O error recovery control flow.
Figure 3: PCI Error Recovery Flow Diagram
The PCI I/O errors are handled as follows by the system that supports error recovery functionality:
Determine whether the platform and the drivers are error recovery capable. If that is the case, set the I/O paths to SoftFail mode.
When a driver detects an error, it reports the error to error recovery Infrastructure. The report is sent to the core PSM.
To handle and recover from PCI error, the core PSM completes the following three phases:
-Diagnose Phase
-Synchronization (suspension) Phase and
-Release (resumption) Phase
Diagnose Phase: During this phase, the I/O node information is passed from the driver to the core PSM. This node forms the initial root of the error path. The primary goal of this state is to gather additional information and determine the actual root of the error path. On
No attempt is made during this state to recover the path from the error, and some hardware may be inaccessible.
Synchronization (suspension) Phase: During this phase, the core PSM attempts to clear logged errors
7