Functional Architecture

Intel® Server Board SE7520JR2

3.3.6.5DIMM Sparing Function

To provide a more fault tolerant system, the Intel E7520 MCH includes specialized hardware to support fail-over to a spare DIMM device in the event that a primary DIMM in use exceeds a specified threshold of runtime errors. One of the DIMMs installed per channel, greater than or equal in size than all installed, will not be used but kept in reserve. In the event of significant failures in a particular DIMM, it and its corresponding partner in the other channel (if applicable), will, over time, have its data copied over to the spare DIMM(s) held in reserve. When all the data has been copied, the reserve DIMM(s) will be put into service and the failing DIMM will be removed from service. Only one sparing cycle is supported. If this feature is not enabled, then all DIMMs will be visible in normal address space.

Note: The DIMM Sparing feature requires that the spare DIMM be at least the size of the largest primary DIMM in use.

Hardware additions for this feature include the implementation of tracking register per DIMM to maintain a history of error occurrence, and a programmable register to hold the fail-over error threshold level. The operational model is straightforward: if the fail-over threshold register is set to a non-zero value, the feature is enabled, and if the count of errors on any DIMM exceeds that value, fail-over will commence. The tracking registers themselves are implemented as “leaky buckets,” such that they do not contain an absolute cumulative count of all errors since power- on; rather, they contain an aggregate count of the number of errors received over a running time period. The “drip rate” of the bucket is selectable by software, so it is possible to set the threshold to a value that will never be reached by a “healthy” memory subsystem experiencing the rate of errors expected for the size and type of memory devices in use.

The fail-over mechanism is slightly more complex. Once fail-over has been initiated the MCH must execute every write twice; once to the primary DIMM, and once to the spare. The MCH will also begin tracking the progress of its built-in memory scrub engine. Once the scrub engine has covered every location in the primary DIMM, the duplicate write function will have copied every data location to the spare. At that point, the MCH can switch the spare into primary use, and take the failing DIMM off-line.

Note that this entire mechanism requires no software support once it has been programmed and enabled, until the threshold detection has been triggered to request a data copy. Hardware will detect the threshold initiating fail-over, and escalate the occurrence of that event as directed (signal an SMI, generate an interrupt, or wait to be discovered via polling). Whatever software routine responds to the threshold detection must select a victim DIMM (in case multiple DIMMs have crossed the threshold prior to sparing invocation) and initiate the memory copy. Hardware will automatically isolate the “failed” DIMM once the copy has completed. The data copy is accomplished by address aliasing within the DDR control interface, thus it does not require reprogramming of the DRAM row boundary (DRB) registers, nor does it require notification to the operating system that anything has occurred in memory.

The memory mirroring feature and DIMM sparing are exclusive of each other, only one may be activated during initialization. The selected feature must remain enabled until the next power- cycle. There is no provision in hardware to switch from one feature to the other without rebooting, nor is there a provision to “back out” of a feature once enabled without a full reboot.

44

Revision 1.0

 

C78844-002