Investment Protection and Expansion, High Availability

http://techsupport.services.ibm.com/rs6k/fixes.html

If you have problems downloading the latest maintenance level, ask your IBM

Business Partner or IBM representative.

Investment Protection and Expansion

The following sections discuss how configurations, upgrades, and design features help you lower your cost of ownership.

High Availability

Reliability of the system is further hardened by using the HACMP clustering solution available across the entire range of RS/6000 servers. The HACMP solution exploits redundancy between server resources and provides application uptime. The Model F80 is available in a high-availability cluster solution package named the HA-F80. This solution consists of the following components:

•Two Model 7025-F80 Enterprise Servers

•AIX Version 4.3.3 operating system (unlimited user license), or later

•HACMP 4.3.1 cluster software, or later

•One 7133-T40 SSA disk subsystem with at least four disk drives

•All necessary redundant hardware and cables

This solution is sold at a price lower than the sum of its parts. Ask your IBM

Business Partner or IBM representative for further information.

Reliability, Availability, and Serviceability (RAS) Features

Some RAS features such as redundant power supplies or N+1 hot-plug fans are already discussed. Additional topics are covered in the following sections.

Error Recovery for Caches and Memory

The RS64 III processor L1 cache, the L2 cache, system busses, and the memory are protected by error correction code (ECC) logic. The ECC codes provide single bit error correction and double bit error detection for the L2 cache and the memory. All recovered error events are reported by an attention interrupt to the service processor, where they are monitored for threshold conditions.

The standard memory card has single error-correct and double-error detect ECC circuitry to correct single-bit memory failures. The double-bit detection helps maintain data integrity by detecting and reporting multiple errors beyond what the ECC circuitry can correct. In many cases (using DIMMs with 18 DRAM chips and when memory is configured in quads, for example), memory chips are organized such that the failure of any specific memory module only affects a single bit within an ECC word (bit scattering) thus allowing for error correction and continued operation in the presence of a complete chip failure (chip kill recovery).

Another function, named memory scrubbing, provides a built-in hardware function, which performs continuous background reads of data from memory, checking for correctable errors. Correctable errors are corrected and rewritten to memory, and a threshold counter is maintained that will signal the service processor with a special attention when the threshold is exceeded.

IBM RS/6000 7025 Model F80 Server 13

IBM F80 manual Investment Protection and Expansion, High Availability

Models: F80

Investment Protection and Expansion

High Availability

Reliability, Availability, and Serviceability (RAS) Features

Error Recovery for Caches and Memory