IBM pSeries manual Errpt command, HMC error logging, Multiple versions of MPI libraries

Page 21

On the HMC GUI, select Service Applications -> Service Focal Point -> Select Serviceable Events.

5.7 errpt command

On AIX 5L, the errpt command lists a summary of system error messages. Some of the HPS subsystem errors are collected by errpt. To find out if you have hardware errors, you can either run the errpt command, or you can run the dsh command from the CSM manager:

dsh errpt grep “ 0223” grep sysplanar0 (The value 0223 is the month and day.)

You can also look at /var/adm/sni/sni_errpt_capture on the LPAR that is reporting the error.

If you see any errors from sni in the errpt listing, check the sni logs for more specific information. The HPS logs are found in a set of directories under the /var/adm/sni directory.

5.8 HMC error logging

The HMC records errors in the /var/hsc/log directory. Here is an example of a command to check for cyclical redundancy check (CRC) errors in the FNM_Recover.log:

grep -i evtsum FNM_Recov.log grep -i crc

In general, if Service Focal Point is working properly, you should not need to check the low-level FNM logs such as the FNM_Recov file. However, for completeness, these are additional FNM logs on the HMC:

FNM_Comm.log

FNM_Ice.log

FNM_Init.log

FNM_Route.log

Another debug command you can run on the HMC is lsswtopol -n 1 -p $PLANE_NUMBER. For example, run the following command to check the link status for plane 0:

lsswtopol -n 1 -p0

If the lsswtopol command calls out links as ”service required,” but these links do not show up in Service Focal Point, contact IBM service.

5.9 Multiple versions of MPI libraries

One common problem on clustered systems is having different MPI library levels on various nodes. This can occur when a node is down for service while an upgrade is made, or when there are multiple versions of the libraries for each node and the links are broken. To check the library levels across a large system, use the following dsh commands:

For LAPI libraries: dsh sum /opt/rsct/lapi/lib/liblapi_r.a (or run with MP_INFOLEVEL=2)

pshpstuningguidewp040105.doc

Page 21

Image 21
Contents IBM ~pSeries High Performance Switch Contents Mpprintenv Mpstatistics Introduction MPI tunables for Parallel Environment Tunables and settings for switch softwareMpeagerlimit MppollingintervalMemoryaffinity Mprexmitbufsize and MprexmitbufcntMptaskaffinity MpcssinterruptMPI-IO Chgsni command Tunables and settings for AIX 5L IP tunablesFile cache Svmon and vmstat commands Vsid Esid Type Description LPage Inuse Pin Pgsp Virtual SvmonPin Pgsp Virtual VmstatLarge page sizing Pshpstuningguidewp040105.doc Large pages and IP support Memory affinity for a single LparAmount of memory available Debug settings in the AIX 5L kernel Daemon configurationRsct daemons LoadLeveler daemons Reducing the number of daemons runningReducing logging Settings for AIX 5L threads Placement of POE managers and LoadLeveler schedulerAIX 5L mail, spool, and sync daemons Lsattr tuning Debug settings and data collection toolsDriverdebug setting Iptrclvl settingDeconfigured L3 cache Small Real Mode Address Region on HMC GUIService focal point Affinity LPARsErrpt command HMC error loggingMultiple versions of MPI libraries Mpprintenv Memoryaffinity MpstatisticsDropped switch packets Nddipacketsmsw 0x00000000 Nddipacketslsw Packets dropped in the ML0 interface Packets dropped because of a hardware problem on an endpoint Mpinfolevel Packets dropped in the switch hardwareLapidebugperf LapidebugcommtimeoutConclusions and summary AIX 5L trace for daemon activityAdditional reading HPS documentationMPI documentation POWER4AIX 5L performance guides IBM RedbooksPshpstuningguidewp040105.doc