Configuring and Deconfiguring

Processors or

Memory

All failures that

crash the

system with a machine check or check stop, even if

intermittent, are reported as a diagnostic callout for service repair. To prevent the recurrence of intermittent problems and improve the availability of the system until a scheduled maintenance window, processors and memory modules with a failure

history are marked "bad" to prevent their being configured on subsequent boots.

A

processor or

memory

module

is

marked "bad" under the following circumstances:

Ÿ

A processor

or memory module

fails built-in self test (BIST) or power-on self test

 

(POST) testing

during

boot

(as

determined by the Service Processor).

ŸA processor or memory module causes a machine check or check stop during

runtime, and the failure can be isolated specifically to that processor or memory module (as determined by the processor runtime diagnostics in the Service Processor).

ŸA processor or memory module reaches a threshold of recovered failures that results in a predictive callout (as determined by the processor runtime diagnostics in the Service Processor).

During boot time, the Service Processor does not configure processors or memory modules that are marked "bad," much in the same way that it would deconfigure them for BIST/POST failures.

If a processor is deconfigured,

the processor remains offline for subsequent reboots

until

the faulty processor

is replaced. The Repeat Gard function also provides the

users with the option of

manually deconfiguring a processor, or re-enabling a

previously deconfigured processor. For information on how to configure or

deconfigure a processor,

see the

Processor Configuration/Deconfiguration Menu on

page

46.

 

 

You can enable or disable CPU Repeat Gard or Memory Repeat Gard using the Processor Configuration/Deconfiguration Menu, which is a submenu under the

System Information Menu.

Run-Time CPU Deconfiguration (CPU Gard)

L1

instruction

cache

recoverable errors, L1

data

cache

correctable errors,

and

L2

cache

correctable

errors are monitored by

the

processor runtime

diagnostics

(PRD)

code

running

in

the

Service Processor. When a predefined error threshold is

met,

an

error

log

with

warning severity and threshold

exceeded

status is

returned

to

AIX. At

the

same

time,

PRD

marks the CPU for deconfiguration at

the next

boot. AIX

will

 

70 RS/6000 Enterprise Server Model H80 Series User's Guide

Page 86
Image 86
IBM H80 Series manual Memory, Run-Time CPU Deconfiguration CPU Gard, Processors or, Prd