Executive Summary
In the context of software applications, Reliability, Availability, and Serviceability (RAS) mean failures in the underlying processes and hardware components must not cause any interruptions in the overall system operation. Service transactions are adversely impacted during system failure and performance is affected because of the service down time in the failed system. Recovering from the failure early,
managing the remaining working components of the system, or both, with minimal impact to the business services, is the key for an optimized system that meets the RAS criteria.
In computer systems, PCITM failures constitute a significant percentage of errors. The PCI Error Recovery (PCI ER) feature enables the detection of PCI bus parity errors, isolation of the failed I/O path, and recovery of cards from errors. Enabling the PCI ER feature avoids system crash, decreases system downtime, and supports single system high availability.
On
Problem Statement
Without PCI/PCI EXPRESS® (PCIe®) error recovery, I/O paths operate in Hardfail mode. While operating in this mode, a PCI I/O error/a rope error / PCIe errors, causes an MCA on a PIO read and brings the system down. Using the PCI ER feature, PCI I/O paths can be set to SoftFail mode if the platform and all the adapter drivers support this feature.
Historical Evolution of PCI/PCI-Express Error Recovery (ER)
On systems running
On systems running
On systems running
Why PCI/PCI-Express Error Recovery
Interruptions in online transaction processing system and enterprise resource planning service caused from a failed application or system or hardware component, can be costly and disruptive. The impact of service downtime continues to grow as companies move toward a
Figure 1 and Figure 2 compares the system behavior, when a PCI error occurs, with and without error recovery feature.
2