Executive Summary

In the context of software applications, Reliability, Availability, and Serviceability (RAS) mean failures in the underlying processes and hardware components must not cause any interruptions in the overall system operation. Service transactions are adversely impacted during system failure and performance is affected because of the service down time in the failed system. Recovering from the failure early,

managing the remaining working components of the system, or both, with minimal impact to the business services, is the key for an optimized system that meets the RAS criteria.

In computer systems, PCITM failures constitute a significant percentage of errors. The PCI Error Recovery (PCI ER) feature enables the detection of PCI bus parity errors, isolation of the failed I/O path, and recovery of cards from errors. Enabling the PCI ER feature avoids system crash, decreases system downtime, and supports single system high availability.

On HP-UX 11i v2 OS legacy systems, user intervention is required to attempt recovery from PCI errors. The olrad (1 M) command and the Attention Button can be used to recover and restore the slot, card, and driver to a usable state, without taking the system down. Whereas, HP-UX 11i v3 has the ability to automatically recover from PCI errors and restore the slot/driver without user intervention.

Problem Statement

Without PCI/PCI EXPRESS® (PCIe®) error recovery, I/O paths operate in Hardfail mode. While operating in this mode, a PCI I/O error/a rope error / PCIe errors, causes an MCA on a PIO read and brings the system down. Using the PCI ER feature, PCI I/O paths can be set to SoftFail mode if the platform and all the adapter drivers support this feature.

Historical Evolution of PCI/PCI-Express Error Recovery (ER)

On systems running HP-UX 11i v1, the PCI bus errors are handled and the cards are recovered manually. This feature was shipped as a site specific patch.

On systems running HP-UX 11i v2, the PCI bus errors are handled and the cards are recovered manually. The product was shipped as an optional product bundle (PCIErrorHandling) on the SupportPack media starting with AR0806 release.

On systems running HP-UX 11i v3, the PCI bus errors are handled and the cards are recovered automatically. This feature is part of the Base Operating Environment.

Why PCI/PCI-Express Error Recovery

Interruptions in online transaction processing system and enterprise resource planning service caused from a failed application or system or hardware component, can be costly and disruptive. The impact of service downtime continues to grow as companies move toward a real-time business model. Moreover, as companies become more connected and response times shorten, the cost of service downtime continues to increase. For these reasons, businesses invest large amounts of money in maintaining the RAS of servers in the IT infrastructure. The PCI ER feature can enable a drastic reduction in downtime of the system caused by PCI errors.

Figure 1 and Figure 2 compares the system behavior, when a PCI error occurs, with and without error recovery feature.

2

Page 2
Image 2
HP I Error Handling and Recovery manual Executive Summary, Problem Statement, Why PCI/PCI-Express Error Recovery