Memory Mirroring

Continuous operation even in the event of a non-correctable memory error

The Express5800/1000 series server supports high-level memory RAS features to ensure that the server can rapidly detect memory errors, reduce multi-bit errors and continually operate even in the event of memory chip or memory controller failures. Memory scan, memory chip sparing (SDDC*) and memory scrubbing are examples of those features.

A memory scan is run on all loaded memory modules at each OS

boot. If the system detects a memory failure, the failed component is immediately isolated and detached from the system preventing possible downtime during business operations.

Chip sparing (SDDC*) memory is a memory system loaded with several DRAM chips that can correct errors at the chip level. If a failure were to occur in the memory, the error can be corrected immediately to allow for continuous operation.

CPU

CPU

 

CPU

CPU

Cell

Memory

 

Memory

 

Controller

I/F

 

Controller

 

 

Memory

 

Memory

Mirror

 

I/F

 

Controller

 

 

 

 

 

Memory

Mirror

Memory

 

 

I/F

Controller

 

 

Memory

 

Memory

 

 

I/F

 

Controller

 

Memory

Image

Data 0

 

Data 2

 

 

 

 

 

 

Data 1

 

Data 3

 

 

 

 

 

 

Data 0

 

Data 2

 

 

 

 

 

 

Data 1

 

Data 3

 

 

 

Memory scrubbing checks memory content regularly (every few milliseconds) during operation without affecting performance. When an error is detected, it is corrected and then reported. The scrubbing function is effective in detecting errors in a timely manner which ultimately results in the reduction of multi-bit errors.

Memory mirroring takes place continuously, where the same data is written onto 2 separate memory blocks instead of 1 (available only on the 1160Xf and 1320Xf). In the event of a non-correctable error, due to the fact that the data exists on two independent blocks, operations are able to continue without interruption.

 

 

Unit of degradation

Components covered by

Components covered by

on the Express5800/

the memory mirroring

the standard chip sparing

1000 Series

 

This construct allows for continuous operation through all non- correctablememory errors, not limited to the memory themselves, but also in the memory interfaces and the in memory controllers.

* Single Device Data Correction

Partial Chipset degradation

Avoid multi-partition shutdowns resulting from chipset failures

In certain instances when multiple server partitions share a common crossbar controller, effects of a single partition failure may result in a multi-partition shutdown. To resolve this issue, the Express5800/1000 series servers have been designed to allow for the partial degradation of chipsets.

Within each of the LSI chips, which make up the chipset, multiple LSI sub-units exist. These sub-units are connected to other sub- units located on separate LSI chips. The combined sub-units together make up single partition. If an error were to occur on an LSI sub-unit, that sub-unit alone can be degradated to isolate the failure to a single partition, thus preventing the failure to spread to other partitions.

Furthermore, the downed partition can automatically reboot itself, after isolating the failed subsystem, to resume operations in a degradated mode without the intervention of a system administrator. This is made possible, on the Express5800/1000 series servers, by the redundant paths between the Cells and the IO.

0

Cell 0

Partial degradation

Failure

Unit

SubUnit

Crossbar

Sub

Sub

Controller

A

Unit

Unit

 

PCIBox

0

01 n specifies the partition number

Sub-units within the chipset Sub Additional sub-sets exist in Unit actuality

1

Cell 1

Sub

Sub

ControllerCrossbar

Unit

Unit

Sub

Sub

B

Unit

Unit

 

PCIBox

1

0Failure occurs at the sub-unit of the crossbar controller.

Partition 0 is shutdown so that the failed component can be isolated. Partition 0 is rebooted

1Not affected

7

Page 7
Image 7
NEC 5800/1000 manual Memory Mirroring, Partial Chipset degradation