IBM 755 manual Memory error correction extensions, Redundancy for array self-healing

Page 6

Memory error correction extensions

The memory has single-bit-error correction and double-bit-error detection ECC circuitry. The ECC code is also designed such that the failure of any one specific memory module within an ECC word by itself can be corrected absent any other fault.

Memory protection features include scrubbing to detect errors, a means to call for the deallocation of memory pages for a pattern of correctable errors detected, and signaling deallocation of a logical memory block when an error occurs that cannot be corrected by the ECC code.

Redundancy for array self-healing

Although the most likely failure event in a processor is a soft single-bit error in one of its caches, other events can occur, and they need to be distinguished from one another. For caches and their directories, hardware and firmware keep track of whether errors are being corrected beyond a threshold. If exceeded, a deferred repair error log is created.

Caches and directories on the POWER7 chip are manufactured with spare bits in their arrays that can be accessed via programmable steering logic to replace faulty bits in the respective arrays. This is analogous to the redundant bit steering employed in main storage as a mechanism that is designed to help avoid physical repair, and is also implemented in POWER7 systems. The steering logic is activated during processor initialization and is initiated by the built-in system-test (BIST) at power-on time.

When correctable error cache exceeds a set threshold, systems using the POWER7 processor invoke a dynamic cache line delete function, which enables them to stop using bad cache and eliminates exposure to greater problems.

Fault monitoring functions

When a POWER7-based system is powered on, BIST and POST (power-on self- test) check processor, cache, memory, and associated hardware required for proper booting of the operating system. If a noncritical error is detected or if the errors occur in resources that can be removed from the system configuration, the restarting process is designed to proceed to completion. The errors are logged in the system nonvolatile RAM (NVRAM).

Disk drive fault tracking is designed to alert the system administrator of an impending disk drive failure before it impacts customer operation.

Mutual surveillance

The Service Processor monitors the operation of the firmware during the boot

process, and also monitors the HypervisorTM for termination. The Hypervisor monitors the Service Processor and will perform a reset/reload if it detects the loss of the Service Processor. If the reset/reload does not correct the problem with the Service Processor, the Hypervisor will notify the operating system and the operating system can take appropriate action, including calling for service.

Environmental monitoring functions

POWER7-based servers include a range of environmental monitoring functions:

Temperature monitoring warns the system administrator of potential environmental-related problems by monitoring the air inlet temperature. When the inlet temperature rises above a warning threshold, the system initiates an orderly shutdown. When the temperature exceeds the critical level, or if the temperature remains above the warning level for too long, the system will shut down immediately.

Fan speed is controlled by monitoring actual temperatures on critical components and adjusting accordingly. If internal component temperatures reach critical levels, the system will shut down immediately, regardless of fan speed. When a

IBM United States Hardware Announcement 110-008

IBM is a registered trademark of International Business Machines Corporation

6

Image 6
Contents Table of contents Key prerequisites Planned availability date DescriptionFor more information, visit Minimum defined configuration, if no choice is made, is EXP 12S SAS Drawer #5886Reliability, availability, and serviceability RAS features Reliability, fault tolerance, and data correctionRedundancy for array self-healing Memory error correction extensionsFault monitoring functions Mutual surveillanceAvailability enhancement functions POWER7 processor functionsPOWER7 single processor checkstopping Partition availability priorityServiceability Service Interface First Failure Data Capture and Error Data AnalysisStand-alone diagnostics Service labelsLocation diagrams Error Handling and Reporting Service ProcessorCall Home IBM Electronic Services BenefitsStatement of general direction Product number E8C 5774 3m 14-Ft 3PH/24A Power Cord 8236 Inch racks Safety Information PublicationsTechnical information ServicesPhysical specifications Specified operating environmentNoise level and sound power Hardware requirementsSoftware requirements Limitations 110-008SAS AIX Linux DVD-RAM Sata IBM Electronic Services Planning informationCable orders Terms and conditions Warranty periodWarranty service IBM Global FinancingWarranty service upgrades Maintenance Services Model conversions Machine installation Usage plan machine IBM hourly service rate classificationField-installable features Graduated program license charges applyPrices Educational allowanceAdptr AA E8C 3687 Both PCI Full Width Keyboard -- USB, Danish, #159 IOP E8C US English New features Existing featuresMeter SAS Cable Virtual USB, Spanish USB, Russian Optional 8XDIMMS Kapalua Order now TrademarksTerms of use