operating system has lost control. Mutual surveillance also enables the operating system to monitor for service processor activity and can request a service processor repair action if necessary.

Environmental monitoring

Environmental monitoring related to power, fans, and temperature is performed by the System Power Control Network (SPCN). Environmental critical and non-critical conditions generate Early Power-Off Warning (EPOW) events. Critical events (for example, a Class 5 AC power loss) trigger appropriate signals from hardware to affected components so as to prevent any data loss without operating-system or firmware involvement. Non-critical environmental events are logged and reported using Event Scan.

The operating system cannot program or access the temperature threshold using the SP.

EPOW events can trigger the following actions:

￿Temperature monitoring, which increases the fan’s speed rotation when ambient temperature is above a preset operating range.

￿Temperature monitoring warns the system administrator of potential environmental-related problems. It also performs an orderly system shutdown when the operating temperature exceeds a critical level.

￿Voltage monitoring provides warning and an orderly system shutdown when the voltage is out of operational specification.

3.2.4Self-healing

For a system to be self-healing, it must be able to recover from a failing component by first detecting and isolating the failed component, taking it offline, fixing or isolating it, and reintroducing the fixed or replacement component into service without any application disruption. Examples include:

￿Bit steering to redundant memory in the event of a failed memory module to keep the server operational

￿Bit-scattering, thus allowing for error correction and continued operation in the presence of a complete chip failure (Chipkill™ recovery)

￿Single-bit error correction using ECC without reaching error thresholds for main, L2, and L3 cache memory

￿L3 cache line deletes extended from 2 to 10 for additional self-healing

￿ECC extended to inter-chip connections on fabric and processor bus

￿Memory scrubbing to help prevent soft-error memory faults

￿Dynamic processor deallocation, in which a deallocated processor can be replaced by an unused CoD processor to keep the system operational

Memory reliability, fault tolerance, and integrity

The p5-570 uses Error Checking and Correcting (ECC) circuitry for system memory to correct single-bit memory failures and to detect double-bit. Detection of double-bit memory failures helps maintain data integrity. Furthermore, the memory chips are organized such that the failure of any specific memory module only affects a single bit within a four-bit ECC word (bit-scattering), thus allowing for error correction and continued operation in the presence of a complete chip failure (Chipkill recovery). The memory DIMMs also utilize memory scrubbing and thresholding to determine when spare memory modules within each bank of memory should be used to replace ones that have exceeded their threshold of error count

Chapter 3. Capacity on Demand, RAS, and manageability 57

Page 69
Image 69
IBM P5 570 manual Self-healing, Environmental monitoring, Memory reliability, fault tolerance, and integrity

P5 570 specifications

The IBM P5 570 is a high-performance server that was designed for enterprise-scale computing, offering a blend of advanced technologies and a flexible architecture. Launched as part of IBM's Power5 server line, the P5 570 stands out for its robust processing capabilities and extensive scalability, making it a preferred choice for businesses requiring reliable and efficient computing solutions.

At the heart of the P5 570 is the IBM Power5 processor, which employs simultaneous multi-threading (SMT) technology. This allows the processor to handle two threads per core, effectively doubling the throughput for workloads ideally suited to multi-threading. The server typically features a configuration of up to 32 Power5 processors, providing an impressive compute power that supports demanding applications, ranging from databases to complex enterprise resource planning (ERP) systems.

The P5 570 architecture supports a wide range of memory configurations, with a maximum memory capacity of up to 512 GB. Utilizing IBM’s proprietary Chip Memory technology, it can deliver high bandwidth and low latency, significantly enhancing performance for memory-intensive applications. Furthermore, the integrated memory controller architecture optimizes memory access, ensuring that critical workloads run smoothly.

Scalability is a key characteristic of the P5 570, with the ability to expand processing power and memory capacity as an organization’s needs grow. The server supports various operating systems, including AIX, Linux, and IBM i, which provides flexibility for diverse IT environments. This versatility ensures that companies can run their preferred applications without the need for substantial system overhauls.

In terms of storage, the P5 570 utilizes advanced RAID technology and supports a variety of disk configurations, ensuring that data integrity and availability are maintained. Coupled with built-in security features, such as the IBM Trusted Foundation, which establishes a secure boot environment, the P5 570 offers a reliable platform for mission-critical workloads.

Finally, the IBM P5 570 is designed for high availability and redundancy. Features like hot-swappable components and advanced error detection and recovery mechanisms minimize downtime, making it a dependable choice for businesses that operate around the clock. Combined with its powerful hardware and versatile software support, the IBM P5 570 remains a formidable player in the high-performance server arena.