Section:
Scali MPI Connect Release 4.4 Users Guide 60
C-11.1 Troubleshooting 3rdparty DAT providers
The only requirements are that the libraries have the proper permissions for shared objects,
and that the /etc/dat.conf is formatted according to the standard.
All available devices are listed with the scanet command.
C-11.2 Troubleshooting the GM provider
The GM provider provides a network device for each Myrinet card installed on the node, named
gm0, gm1... etc.
To verify that the gm0 device is operational, run an MPI test job on two, or more, nodes in
question:
user% mpimon -networks gm0 /opt/scali/examples/bin/bandwidth -- node1 node2
If the gm0 devices fails, the MPI job should fail with a "[ 1] No valid network connection from
1 to 0" message.
First of a ll, keep i n mind that the GM sou rce must be obtained from Myric om, and co mpiled on
your nodes. Scali provides the ScaGMbuilder package to do the job for you, the README, and
RELEASE_NOTES (under /opt/scali/doc/ScaGMbuilder) describes the procedure.
If you have just in installed your cluster, upgraded the GM source or just replaced the kernel,
the compilation of GM is in progress (takes about 10 min) is run. Verify that the GM binary is
installed with:
root# rpm -q gm
This should report whether the package is installed or not.
The (re)build process require that compiler tools, and kernel source is installed on all nodes.
Verify that the "gm" kernel module is loaded by running lsmod(8) on the compute node in
question.
Verify that GM is operational, a
root# /opt/gm/bin/gm_board_info
is enough to check, you should see all the nodes on your GM network listed. (This command
must be run on a node with a Myrinet card installed!)
A simple cause of failure is that /opt/gm/lib is not in /etc/ld.so.conf and/or ldconfig is
not run, you will get a unable to find libgm.so error message is this is the case.