Data Integrity and Error Handling

6

6.1Integrity

This chapter explains the various errors in the chipset. Error handling requires catching the error, containing it, notifying the system, and recovery or system restart. Different platforms have different requirements for error handling. A server is most interested in containment. It wants bad data to be stopped before it reaches the network or the disk. On the other hand workstations with graphics may be less interested in containment. If the screen blips for one frame and the blip is gone in the next frame, the error is transient, and may not even be noticed.

The 460GX chipset will attempt to accommodate both philosophies. It will allow certain errors to be masked off, or will turn them into simple interrupts instead of fatal errors. Fatal errors are those which require a re-boot, e.g. BINIT#. Some errors will always be fatal, such as protocol errors or when the chipset has lost synchronization of queues or events. The user (OEM, O.S.) can decide the behavior for data errors. These may be considered as fatal, for maximum containment, or they may simply be reported as an interrupt while the system continues as best it can. If the data is moving to graphics, then an error may be unnoticed. It is possible that data entering memory as bad is never used, and therefore never shows up as an error to any user.

Each error will not be individually maskable. In general there are only 2 modes - aggressive and non-aggressive. In aggressive mode, every error - parity, protocol, queue management - will be considered fatal and lead to a BINIT#. In the non-aggressive mode, many errors will be reported as interrupts and not cause BINIT#. Even in non-aggressive mode, when the chipset has certain errors and doesn’t know what to do with a transaction or seems out of sync across the chips, it will BINIT#.

The chipset will report errors at their use, instead of their generation. Both the processor and the chipset may ‘poison’ data. If the processor has an internal cache error, it may write out the data with bad ECC. If the chipset has bad parity on I/O data, it will corrupt the data as it is passed along. In both cases the data will be put in memory with bad ECC. If it isn’t used, then no error is reported. If it is used, then the error is found at that point.

The 460GX chipset will isolate the error reporting as close to the error itself as possible. In some cases this can be to a failing DRAM or PCI card. In others it will be for a PCI bus or Expander port.

6.1.1System Bus

The 460GX chipset provides ECC generation on data delivered to the system bus, and ECC checking of data accepted from the system bus. Single-bit errors are corrected; multi-bit errors will write the data with bad ECC into the DRAM’s (poisoned data) or to I/O with bad parity.

Parity bits are generated and checked independently for the system bus address lines, the system bus request group, and the system bus response group. Errors typically result in the assertion of BINIT#.

A variety of system bus protocol errors are also detected, and will result in assertion of BINIT#.

The first instance of a bus error is logged with the address and error type. Additional status flags indicate subsequent errors occurred.

For I/O accesses, good ECC is always generated for data with no parity errors. For data with bad parity, the data is poisoned with bad ECC as it’s returned to the processor.

Intel® 460GX Chipset Software Developer’s Manual

6-1