Dynamic CPU Deallocation

The processors are continuously monitored for errors such as L2 cache ECC errors. When a predefined error threshold is met, an error log with warning severity and threshold exceeded status is returned to AIX. At the same time, the service processor marks the CPU for deconfiguration at the next boot. In the meantime, AIX will attempt to migrate all resources associated with that processor (tasks, interrupts, etc.) to another processor, and then stop the failing processor.

The capability of dynamic CPU deallocation is only active in systems with more than two processors, because device drivers and kernel extensions, which are common to multi-processor and uni-processor systems would change their mode to uni-processor mode with unpredictable results.

Persistent CPU and Memory Deconfiguration

CPUs and memory modules with a failure history are marked bad to prevent them from being configured on subsequent boots. This history is kept in the VPD3 records on the FRU4, so the information moves physically with the FRU and is cleared when the FRU is replaced, and stays with the failed FRU when it is returned to IBM. A CPU or memory module is marked bad when:

It fails BIST5/POST6 testing during boot (as determined by the service processor).

It causes a machine check or check stop during runtime and the failure can be isolated specifically to that CPU or memory module (as determined by the service processor).

It reaches a threshold of recovered failures (for example, ECC correctable L2 cache errors, see the preceding) that result in a predictive call-out (as determined by service processor).

During CEC initialization, the service processor checks the VPD values and does not configure CPUs or memory that are marked bad, much in the same way that it would deconfigure them for BIST/POST failures.

I/O Expansion (RIO) Recovery

The RIO interface supports packet retry on its interface, which means that it will automatically try to resend a packet if it gets no acknowledgment or a bad response until a time-out threshold is reached.

RIO also supports a closed loop topology configuration, which is required for RS/6000 products. RIO hubs will automatically attempt to reroute packets through the alternate RIO port if a successful transmission cannot be completed (for example, the retry threshold is exceeded) through the primary port. Therefore, no single link failure in the RIO loop will cause the system to go down, although the failure will be reported for deferred maintenance.

PCI Bus Error Recovery

As described in the PCI slot section, every slot is connected through a PCI-to-PCI bridge chip to a primary PCI bus; thereby, each slot is logically and physically isolated onto its own individual PCI bus. This fact provides a special error

3Vital Product Data (VPD)

4Field Replaceable Unit (FRU)

5Built-in self-test (BIST)

6Power-on self-test (POST)

14 RS/6000 7025 Model F80 Technical Overview

Page 16
Image 16
IBM F80 manual Dynamic CPU Deallocation, Persistent CPU and Memory Deconfiguration, Expansion RIO Recovery