RAS Design Philosophy

Mainframe-class RAS Features

Realization of a mainframe-class continuous operation through the pursuit of reliability and availability in a single server construct

Generally, in order to achieve reliability and availability on an open server, clustering would be implemented. However, clustering comes with a price tag. To keep costs at a minimum, the Express5800/1000 series servers were designed to achieve a high level of reliability and availability, but within a single server.

The Express5800/1000 series server’s powerful RAS features were developed through the pursuit of dependable server technology.

Continuous operations throughout failures; minimize the spread of failures; and smooth recovery after failures were goals set forth which lead to implementation of technologies such as memory mirroring, increased redundancy of intricate components, and modularization. Through these technologies a mainframe level of continuous operation was achieved.

Clustering

		Reliability	Availability	Serviceability
Mainflame	Center	No chipset on the center plane
Level	Center
Level	plane
	plane

Improved system availability

Improved reliability and availability as a stand alone server

Dependable Server Technology

Continuous operations through failures

Redundant components, error prediction and error correction allows for continuous operation

Minimized spread of failures

Technology to minimize the effects of hardware failures on the system. Reduction of performance degradation and multi-node shutdown

Smooth recovery after failures

Ability to replace failed components without

shutting down operations

		ECC protection of main	Partial chipset degradation/
	Chipset	data paths Intricate error	Partial chipset degradation/	Hot Pluggable*	4
	Chipset	detectionof the high-	Dynamic recovery	Hot Pluggable*
		speed interconnects	Dynamic recovery
		speed interconnects
			Duplexed*1	Hot Pluggable*4
	Clock		16 processor domain	Hot Pluggable*4
			segmentation*2
Conventional	Core I/O		Core I/O Relief	Hot Pluggable*4
Conventional
open server
open server	PCI card			Hot Pluggable*4
Level	PCI card			Hot Pluggable*4
	Memory	ECC protection	Memory
	Memory	SDDC Memory	Mirroring*1

	CPU	Intel® Cache Safe
	L3 cache	Technology*3
	Power		N+1 Redundant	Hot Pluggable*4
	Power		Two independent	Hot Pluggable*4
			power sources
PC Server	HDD		Software RAID	Hot Pluggable*4
Level			Hardware RAID

*1 Available only on the 1320Xf/1160Xf

*2 Available only on the 1320Xf

*3 Intel® technology designed to avoid cache based failures

*4 Replacement of failed component without shutting down other partitions.

The Dual-Core Intel® Itanium® processor MCA (Machine Check Architecture)

The framework for hardware, firmware and OS error handling

The Dual-Core Intel® Itanium® processor, designed for high-end enterprise servers, not only excels in performance, but is also abundant in RAS features. At the core of the processor’s RAS feature set, is the error handling framework, called MCA.

MCA provides a 3 stage error handling mechanism – hardware, firmware, and operating system. In the first stage, the CPU and chipset attempt to handle errors through ECC (Error Correcting Code) and parity protection. If the error can not be handled by the hardware, it is then passed to the second stage, where the firmware attempts to resolve the issue. In the third stage, if the error can not be handled by the first two stages, the operating system runs recovery procedures based on the error report and error log that was received. In the event of a critical error, the system will automatically reset, to significantly reduce the possibility of a system failure.

Application Layer

Operating System

The OS logs the error, and then starts the recovery process

Firmware

Seamlessly handles the error

Hardware

CPU and chipset ECC and parity protection

The Firmware and OS aid in the correction of complex platform errors to restore the system Error details are logged, and then a report flow is defined for the OS

Detects and corrects a wide range of hardware errors for main data structures

NEC 5800/1000 manual RAS Design Philosophy, Framework for hardware, firmware and OS error handling

Models: 5800/1000