MAC WOF

(2F870): Bit: 1

[. . .]

 

5.12.4 Packets dropped in the switch hardware

If a packet is dropped within the switch hardware itself (for example, when traversing the link between two switch chips), evidence of the packet drop is on the HMC, where the switch Federation Network Manager (FNM) runs. You can run /opt/hsc/bin/fnm.snap to create a snap archive in /var/hsc/log (for example, /var/hsc/log/c704hmc1.2004-11-19.12.50.33.snap.tar.gz).

The FNM code handles errors associated with packet drops in the switch. To run the fnm.snap command (/opt/hsc/bin/fnm.snap), you must have root access or set up proper authentication. In the snap data, check the FNM_Recov.* logs for switch errors. If a certain type of error reached a threshold in the hardware, reporting for that type of error might be disabled. As a result, packet loss might not be reported. Generally, when you are looking for packet loss, it's a good idea to restart the FNM code to ensure that error reporting is reset.

5.13 MP_INFOLEVEL

You can get additional information from an MPI job by setting the MPI_INFOLEVEL variable to 2. In addition, if you set the MP_LABELIO variable to yes, you can get information for each task. Here is an example of the output using these settings:

INFO: 0031-364

Contacting LoadLeveler to set and query information for interactive job

INFO: 0031-380

LoadLeveler step ID is test_mach1.customer.com.2507.0

INFO: 0031-118

Host test_mach1.customer.com requested for task 0

INFO: 0031-118

Host test_mach2.customer.com requested for task 1

INFO: 0031-119

Host test_mach1.customer.com allocated for task 0

INFO: 0031-120

Host address 10.10.10.1 allocated for task 0

INFO: 0031-377

Using sn1 for MPI euidevice for task 0

INFO: 0031-119

Host test_mach2.customer.com allocated for task 1

INFO: 0031-120

Host address 10.10.10.2 allocated for task 1

INFO: 0031-377

Using sn1 for MPI euidevice for task 1

1:INFO: 0031-724

Executing program: <spark-thread-bind.lp>

0:INFO: 0031-724

Executing program: <spark-thread-bind.lp>

1:LAPI version #7.9 2004/11/05 1.144 src/rsct/lapi/lapi.c, lapi, rsct_rir2, rir20446a 32bit(us) library compiled on Wed Nov 10 06:44:38 2004

1:LAPI is using lightweight lock. 1:Bulk Transfer is enabled.

1:Shared memory not used on this node due to sole task running. 1:The LAPI lock is used for the job

0:INFO: 0031-619 32bit(us) MPCI shared object was compiled at Tue Nov 9 12:36:54 2004 0:LAPI version #7.9 2004/11/05 1.144 src/rsct/lapi/lapi.c, lapi, rsct_rir2, rir20446a 32bit(us)

library compiled on Wed Nov 10 06:44:38 2004 0:LAPI is using lightweight lock.

0:Bulk Transfer is enabled.

0:Shared memory not used on this node due to sole task running. 0:The LAPI lock is used for the job

pshpstuningguidewp040105.doc

Page 28

Page 28
Image 28
IBM pSeries manual Mpinfolevel, Packets dropped in the switch hardware