IBM pSeries manual Mpinfolevel, Packets dropped in the switch hardware

Page 28

MAC WOF

(2F870): Bit: 1

[. . .]

 

5.12.4 Packets dropped in the switch hardware

If a packet is dropped within the switch hardware itself (for example, when traversing the link between two switch chips), evidence of the packet drop is on the HMC, where the switch Federation Network Manager (FNM) runs. You can run /opt/hsc/bin/fnm.snap to create a snap archive in /var/hsc/log (for example, /var/hsc/log/c704hmc1.2004-11-19.12.50.33.snap.tar.gz).

The FNM code handles errors associated with packet drops in the switch. To run the fnm.snap command (/opt/hsc/bin/fnm.snap), you must have root access or set up proper authentication. In the snap data, check the FNM_Recov.* logs for switch errors. If a certain type of error reached a threshold in the hardware, reporting for that type of error might be disabled. As a result, packet loss might not be reported. Generally, when you are looking for packet loss, it's a good idea to restart the FNM code to ensure that error reporting is reset.

5.13 MP_INFOLEVEL

You can get additional information from an MPI job by setting the MPI_INFOLEVEL variable to 2. In addition, if you set the MP_LABELIO variable to yes, you can get information for each task. Here is an example of the output using these settings:

INFO: 0031-364

Contacting LoadLeveler to set and query information for interactive job

INFO: 0031-380

LoadLeveler step ID is test_mach1.customer.com.2507.0

INFO: 0031-118

Host test_mach1.customer.com requested for task 0

INFO: 0031-118

Host test_mach2.customer.com requested for task 1

INFO: 0031-119

Host test_mach1.customer.com allocated for task 0

INFO: 0031-120

Host address 10.10.10.1 allocated for task 0

INFO: 0031-377

Using sn1 for MPI euidevice for task 0

INFO: 0031-119

Host test_mach2.customer.com allocated for task 1

INFO: 0031-120

Host address 10.10.10.2 allocated for task 1

INFO: 0031-377

Using sn1 for MPI euidevice for task 1

1:INFO: 0031-724

Executing program: <spark-thread-bind.lp>

0:INFO: 0031-724

Executing program: <spark-thread-bind.lp>

1:LAPI version #7.9 2004/11/05 1.144 src/rsct/lapi/lapi.c, lapi, rsct_rir2, rir20446a 32bit(us) library compiled on Wed Nov 10 06:44:38 2004

1:LAPI is using lightweight lock. 1:Bulk Transfer is enabled.

1:Shared memory not used on this node due to sole task running. 1:The LAPI lock is used for the job

0:INFO: 0031-619 32bit(us) MPCI shared object was compiled at Tue Nov 9 12:36:54 2004 0:LAPI version #7.9 2004/11/05 1.144 src/rsct/lapi/lapi.c, lapi, rsct_rir2, rir20446a 32bit(us)

library compiled on Wed Nov 10 06:44:38 2004 0:LAPI is using lightweight lock.

0:Bulk Transfer is enabled.

0:Shared memory not used on this node due to sole task running. 0:The LAPI lock is used for the job

pshpstuningguidewp040105.doc

Page 28

Image 28
Contents IBM ~pSeries High Performance Switch Contents Mpprintenv Mpstatistics Introduction Tunables and settings for switch software MPI tunables for Parallel EnvironmentMpeagerlimit MppollingintervalMprexmitbufsize and Mprexmitbufcnt MemoryaffinityMpcssinterrupt MptaskaffinityMPI-IO Chgsni command IP tunables Tunables and settings for AIX 5LFile cache Svmon and vmstat commands Svmon Vsid Esid Type Description LPage Inuse Pin Pgsp VirtualVmstat Pin Pgsp VirtualLarge page sizing Pshpstuningguidewp040105.doc Memory affinity for a single Lpar Large pages and IP supportAmount of memory available Daemon configuration Debug settings in the AIX 5L kernelRsct daemons Reducing the number of daemons running LoadLeveler daemonsReducing logging Placement of POE managers and LoadLeveler scheduler Settings for AIX 5L threadsAIX 5L mail, spool, and sync daemons Debug settings and data collection tools Lsattr tuningDriverdebug setting Iptrclvl settingSmall Real Mode Address Region on HMC GUI Deconfigured L3 cacheService focal point Affinity LPARsHMC error logging Errpt commandMultiple versions of MPI libraries Mpprintenv Mpstatistics MemoryaffinityDropped switch packets Nddipacketsmsw 0x00000000 Nddipacketslsw Packets dropped in the ML0 interface Packets dropped because of a hardware problem on an endpoint Packets dropped in the switch hardware MpinfolevelLapidebugcommtimeout LapidebugperfAIX 5L trace for daemon activity Conclusions and summaryAdditional reading HPS documentationPOWER4 MPI documentationAIX 5L performance guides IBM RedbooksPshpstuningguidewp040105.doc