(dynamic
3.2.5 N+1 redundancy
The use of redundant parts allows the
Redundant spare memory bits in L1, L2, L3, and main memory
Redundant fans
Redundant power supplies
3.2.6Fault masking
If corrections and retries succeed and do not exceed threshold limits, the system remains operational with full resources and no client or IBM customer engineer intervention is required:
CEC bus retry and recovery
ECC Chipkill soft error
3.2.7Resource deallocation
If recoverable errors exceed threshold limits, resources can be deallocated with the system remaining operational, allowing deferred maintenance at a convenient time.
Dynamic or persistent deallocation
Dynamic deallocation of potentially failing components is
Dynamic deallocation functions include:
Processor
L3 cache line delete
Partial L2 cache deallocation
For dynamic processor deallocation, the service processor performs a predictive failure analysis based on any recoverable processor errors that have been recorded. If these transient errors exceed a defined threshold, the event is logged and the processor is deallocated from the system while the operating system continues to run. This feature (named CPU Guard) enables maintenance to be deferred until a suitable time. Processor deallocation can occur only if there are sufficient functional processors (at least two).
To verify whether CPU Guard has been enabled, run the following command:
lsattr
If CPU Guard is enabled, the output will be similar to:
cpuguard | enable | CPU Guard | True |
58