Self-healing, +1 redundancy | IBM DS8000 specification

generate Early Power-Off Warning (EPOW) events. Critical events (for example, a Class 5 AC power loss) trigger appropriate signals from hardware to the affected components to prevent any data loss without operating system or firmware involvement. Non-critical environmental events are logged and reported using Event Scan. The operating system cannot program or access the temperature threshold using the SP.

Temperature monitoring is also performed. If the ambient temperature goes above a preset operating range, then the rotation speed of the cooling fans can be increased. Temperature monitoring also warns the internal microcode of potential environment-related problems. An orderly system shutdown will occur when the operating temperature exceeds a critical level.

Voltage monitoring provides warning and an orderly system shutdown when the voltage is out of operational specification.

Self-healing

For a system to be self-healing, it must be able to recover from a failing component by first detecting and isolating the failed component. It should then be able to take it offline, fix or isolate it, and then reintroduce the fixed or replaced component into service without any application disruption. Examples include:

Bit steering to redundant memory in the event of a failed memory module to keep the server operational

Bit scattering, thus allowing for error correction and continued operation in the presence of a complete chip failure (Chipkill™ recovery)

Single-bit error correction using ECC without reaching error thresholds for main, L2, and L3 cache memory

L3 cache line deletes extended from 2 to 10 for additional self-healing

ECC extended to inter-chip connections on fabric and processor bus

Memory scrubbing to help prevent soft-error memory faults

Dynamic processor deallocation

Memory reliability, fault tolerance, and integrity

The p5 570 uses Error Checking and Correcting (ECC) circuitry for system memory to correct single-bit memory failures and to detect double-bit. Detection of double-bit memory failures helps maintain data integrity. Furthermore, the memory chips are organized such that the failure of any specific memory module only affects a single bit within a four-bit ECC word (bit-scattering), thus allowing for error correction and continued operation in the presence of a complete chip failure (Chipkill recovery).

The memory DIMMs also utilize memory scrubbing and thresholding to determine when memory modules within each bank of memory should be used to replace ones that have exceeded their threshold of error count (dynamic bit-steering). Memory scrubbing is the process of reading the contents of the memory during idle time and checking and correcting any single-bit errors that have accumulated by passing the data through the ECC logic. This function is a hardware function on the memory controller chip and does not influence normal system memory performance.

N+1 redundancy

The use of redundant parts, specifically the following ones, allows the p5 570 to remain operational with full resources:

Redundant spare memory bits in L1, L2, L3, and main memory

Redundant fans

Redundant power supplies

Chapter 4. RAS 65

IBM DS8000 manual Self-healing, Memory reliability, fault tolerance, and integrity, +1 redundancy

Models: DS8000

Self-healing

Memory reliability, fault tolerance, and integrity

N+1 redundancy