IBM F80 manual Dynamic CPU Deallocation, Persistent CPU and Memory Deconfiguration

Page 16

Dynamic CPU Deallocation

The processors are continuously monitored for errors such as L2 cache ECC errors. When a predefined error threshold is met, an error log with warning severity and threshold exceeded status is returned to AIX. At the same time, the service processor marks the CPU for deconfiguration at the next boot. In the meantime, AIX will attempt to migrate all resources associated with that processor (tasks, interrupts, etc.) to another processor, and then stop the failing processor.

The capability of dynamic CPU deallocation is only active in systems with more than two processors, because device drivers and kernel extensions, which are common to multi-processor and uni-processor systems would change their mode to uni-processor mode with unpredictable results.

Persistent CPU and Memory Deconfiguration

CPUs and memory modules with a failure history are marked bad to prevent them from being configured on subsequent boots. This history is kept in the VPD3 records on the FRU4, so the information moves physically with the FRU and is cleared when the FRU is replaced, and stays with the failed FRU when it is returned to IBM. A CPU or memory module is marked bad when:

It fails BIST5/POST6 testing during boot (as determined by the service processor).

It causes a machine check or check stop during runtime and the failure can be isolated specifically to that CPU or memory module (as determined by the service processor).

It reaches a threshold of recovered failures (for example, ECC correctable L2 cache errors, see the preceding) that result in a predictive call-out (as determined by service processor).

During CEC initialization, the service processor checks the VPD values and does not configure CPUs or memory that are marked bad, much in the same way that it would deconfigure them for BIST/POST failures.

I/O Expansion (RIO) Recovery

The RIO interface supports packet retry on its interface, which means that it will automatically try to resend a packet if it gets no acknowledgment or a bad response until a time-out threshold is reached.

RIO also supports a closed loop topology configuration, which is required for RS/6000 products. RIO hubs will automatically attempt to reroute packets through the alternate RIO port if a successful transmission cannot be completed (for example, the retry threshold is exceeded) through the primary port. Therefore, no single link failure in the RIO loop will cause the system to go down, although the failure will be reported for deferred maintenance.

PCI Bus Error Recovery

As described in the PCI slot section, every slot is connected through a PCI-to-PCI bridge chip to a primary PCI bus; thereby, each slot is logically and physically isolated onto its own individual PCI bus. This fact provides a special error

3Vital Product Data (VPD)

4Field Replaceable Unit (FRU)

5Built-in self-test (BIST)

6Power-on self-test (POST)

14 RS/6000 7025 Model F80 Technical Overview

Image 16
Contents RS/6000 7025 Model F80 RS/6000 7025 Model F80 Technical Overview Overview IBM RS/6000 Model F80 DescriptionPhysical Package Model F80 Overview Internal Storage Model F80 Rear ViewShows the internal devices and bays of a Model F80 Operator Panel Operator PanelCPU Architecture System Architecture and Technical OverviewRS64 III Risc Processor Processor Boards Single Processor4- Way SMP Way SMPMemory Controller Memory SubsystemBus Bandwidth Hub FunctionInternal I/O Architecture PCI SlotsHot-Plug PCI Adapters SmitSoftware Requirements Example Adding a Hot-Plug PCI AdapterReliability, Availability, and Serviceability RAS Features Error Recovery for Caches and MemoryInvestment Protection and Expansion High AvailabilityPersistent CPU and Memory Deconfiguration Expansion RIO RecoveryPCI Bus Error Recovery Dynamic CPU DeallocationSystem Power Control Network Spcn Disk Redundancy Mirroring, RAID, Dual ControllersService Processor Hot Swap Disk and Service AidAutomatic Reboot SurveillanceSystem Upgrades Processor and Memory Boot Time DeconfigurationFast Boot Service Processor RestartReference External Storage ExpandabilitySP Attachment System DocumentationAcknowledgements Select Internet LinksBiographies Special NoticesIBM RS/6000 7025 Model F80 Server 22 RS/6000 7025 Model F80 Technical Overview