Mainframe-class RAS Features

RAS Design Philosophy

Realization of a mainframe-class continuous operation through the pursuit of reliability and availability in a single server construct

Generally, in order to achieve reliability and availability on an open server, clustering would be implemented. However, clustering comes with a price tag. To keep costs at a minimum, the Express5800/1000 series servers were designed to achieve a high level of reliability and availability, but within a single server.

The Express5800/1000 series server’s powerful RAS features were developed through the pursuit of dependable server technology.

Continuous operations throughout failures; minimize the spread of failures; and smooth recovery after failures were goals set forth which lead to implementation of technologies such as memory mirroring, increased redundancy of intricate components, and modularization. Through these technologies a mainframe level of continuous operation was achieved.

Clustering

 

 

Reliability

Availability

Serviceability

Mainflame

Center

No chipset on the center plane

 

Level

 

plane

 

 

 

 

 

Improved system availability

Improved reliability and availability as a stand alone server

Dependable Server Technology

Continuous operations through failures

Redundant components, error prediction and error correction allows for continuous operation

Minimized spread of failures

Technology to minimize the effects of hardware failures on the system. Reduction of performance degradation and multi-node shutdown

Smooth recovery after failures

Ability to replace failed components without

shutting down operations

 

 

ECC protection of main

Partial chipset degradation/

 

 

 

Chipset

data paths Intricate error

Hot Pluggable*

4

 

detectionof the high-

Dynamic recovery

 

 

 

speed interconnects

 

 

 

 

 

 

 

 

 

 

Duplexed*1

Hot Pluggable*4

 

Clock

 

16 processor domain

 

 

 

segmentation*2

 

 

Conventional

Core I/O

 

Core I/O Relief

Hot Pluggable*4

 

 

 

 

 

open server

 

 

 

 

 

PCI card

 

 

Hot Pluggable*4

Level

 

 

 

Memory

ECC protection

Memory

 

 

 

SDDC Memory

Mirroring*1

 

 

 

 

 

 

 

 

 

CPU

Intel® Cache Safe

 

 

 

 

L3 cache

Technology*3

 

 

 

 

Power

 

N+1 Redundant

Hot Pluggable*4

 

 

Two independent

 

 

 

power sources

 

 

PC Server

HDD

 

Software RAID

Hot Pluggable*4

Level

 

 

Hardware RAID

 

 

*1 Available only on the 1320Xf/1160Xf

*2 Available only on the 1320Xf

*3 Intel® technology designed to avoid cache based failures

*4 Replacement of failed component without shutting down other partitions.

The Dual-Core Intel® Itanium® processor MCA (Machine Check Architecture)

The framework for hardware, firmware and OS error handling

The Dual-Core Intel® Itanium® processor, designed for high-end enterprise servers, not only excels in performance, but is also abundant in RAS features. At the core of the processor’s RAS feature set, is the error handling framework, called MCA.

MCA provides a 3 stage error handling mechanism – hardware, firmware, and operating system. In the first stage, the CPU and chipset attempt to handle errors through ECC (Error Correcting Code) and parity protection. If the error can not be handled by the hardware, it is then passed to the second stage, where the firmware attempts to resolve the issue. In the third stage, if the error can not be handled by the first two stages, the operating system runs recovery procedures based on the error report and error log that was received. In the event of a critical error, the system will automatically reset, to significantly reduce the possibility of a system failure.

Application Layer

Operating System

The OS logs the error, and then starts the recovery process

Firmware

Seamlessly handles the error

Hardware

CPU and chipset ECC and parity protection

The Firmware and OS aid in the correction of complex platform errors to restore the system Error details are logged, and then a report flow is defined for the OS

Detects and corrects a wide range of hardware errors for main data structures

6

Page 6
Image 6
NEC 5800/1000 manual RAS Design Philosophy, Framework for hardware, firmware and OS error handling