RAS Design Philosophy
Realization of a
Generally, in order to achieve reliability and availability on an open server, clustering would be implemented. However, clustering comes with a price tag. To keep costs at a minimum, the Express5800/1000 series servers were designed to achieve a high level of reliability and availability, but within a single server.
The Express5800/1000 series server’s powerful RAS features were developed through the pursuit of dependable server technology.
Continuous operations throughout failures; minimize the spread of failures; and smooth recovery after failures were goals set forth which lead to implementation of technologies such as memory mirroring, increased redundancy of intricate components, and modularization. Through these technologies a mainframe level of continuous operation was achieved.
Clustering
|
| Reliability | Availability | Serviceability |
Mainflame | Center | No chipset on the center plane |
| |
Level |
| |||
plane |
| |||
|
|
|
|
Improved system availability
Improved reliability and availability as a stand alone server
Dependable Server Technology
Continuous operations through failures
Redundant components, error prediction and error correction allows for continuous operation
Minimized spread of failures
Technology to minimize the effects of hardware failures on the system. Reduction of performance degradation and
Smooth recovery after failures
Ability to replace failed components without
shutting down operations
|
| ECC protection of main | Partial chipset degradation/ |
|
|
| Chipset | data paths Intricate error | Hot Pluggable* | 4 | |
| detectionof the high- | Dynamic recovery |
| ||
|
| speed interconnects |
|
| |
|
|
|
|
| |
|
|
| Duplexed*1 | Hot Pluggable*4 | |
| Clock |
| 16 processor domain | ||
|
|
| segmentation*2 |
|
|
Conventional | Core I/O |
| Core I/O Relief | Hot Pluggable*4 | |
|
|
|
|
| |
open server |
|
|
|
|
|
PCI card |
|
| Hot Pluggable*4 | ||
Level |
|
| |||
| Memory | ECC protection | Memory |
|
|
| SDDC Memory | Mirroring*1 |
|
| |
|
|
|
|
|
|
| CPU | Intel® Cache Safe |
|
|
|
| L3 cache | Technology*3 |
|
|
|
| Power |
| N+1 Redundant | Hot Pluggable*4 | |
|
| Two independent | |||
|
|
| power sources |
|
|
PC Server | HDD |
| Software RAID | Hot Pluggable*4 | |
Level |
|
| Hardware RAID |
|
|
*1 Available only on the 1320Xf/1160Xf
*2 Available only on the 1320Xf
*3 Intel® technology designed to avoid cache based failures
*4 Replacement of failed component without shutting down other partitions.
The
The framework for hardware, firmware and OS error handling
The
MCA provides a 3 stage error handling mechanism – hardware, firmware, and operating system. In the first stage, the CPU and chipset attempt to handle errors through ECC (Error Correcting Code) and parity protection. If the error can not be handled by the hardware, it is then passed to the second stage, where the firmware attempts to resolve the issue. In the third stage, if the error can not be handled by the first two stages, the operating system runs recovery procedures based on the error report and error log that was received. In the event of a critical error, the system will automatically reset, to significantly reduce the possibility of a system failure.
Application Layer
Operating System
The OS logs the error, and then starts the recovery process
Firmware
Seamlessly handles the error
Hardware
CPU and chipset ECC and parity protection
The Firmware and OS aid in the correction of complex platform errors to restore the system Error details are logged, and then a report flow is defined for the OS
Detects and corrects a wide range of hardware errors for main data structures
6