Section:
C-11.1 Troubleshooting 3rdparty DAT providers
The only requirements are that the libraries have the proper permissions for shared objects, and that the /etc/dat.conf is formatted according to the standard.
All available devices are listed with the scanet command.
C-11.2 Troubleshooting the GM provider
The GM provider provides a network device for each Myrinet card installed on the node, named gm0, gm1... etc.
To verify that the gm0 device is operational, run an MPI test job on two, or more, nodes in question:
user% mpimon
If the gm0 devices fails, the MPI job should fail with a "[ 1] No valid network connection from 1 to 0" message.
First of all, keep in mind that the GM source must be obtained from Myricom, and compiled on your nodes. Scali provides the ScaGMbuilder package to do the job for you, the README, and RELEASE_NOTES (under /opt/scali/doc/ScaGMbuilder) describes the procedure.
If you have just in installed your cluster, upgraded the GM source or just replaced the kernel, the compilation of GM is in progress (takes about 10 min) is run. Verify that the GM binary is installed with:
root# rpm
This should report whether the package is installed or not.
The (re)build process require that compiler tools, and kernel source is installed on all nodes.
Verify that the "gm" kernel module is loaded by running lsmod(8) on the compute node in question.
Verify that GM is operational, a
root# /opt/gm/bin/gm_board_info
is enough to check, you should see all the nodes on your GM network listed. (This command must be run on a node with a Myrinet card installed!)
A simple cause of failure is that /opt/gm/lib is not in /etc/ld.so.conf and/or ldconfig is not run, you will get a unable to find libgm.so error message is this is the case.
Scali MPI Connect Release 4.4 Users Guide | 60 |