Section:
Scali MPI Connect Release 4.4 Users Guide 55
B-1.2 Why can I not start mpid?
mpid opens a socket and assigns a predefined mpid port number (see /etc/services for more
information), to the end point. If mpid is terminated abnormally, the mpid port number cannot
be re-used until a system defined timer has expired. To resolve:
Use netstat -a | grep mpid to observe when the socket is released. When the socket is
released, restart mpid again.
B-1.2.1 Bad clean up
VA previous SMC run has not terminated properly.
Check for mpi-processes on the nodes using /opt/scali/bin/scaps.
Use /opt/scali/sbin/scidle
Use /opt/scali/bin/scash to check for leftover shared memory segments on all nodes
(ipcs for Solaris and Linux).
Note: core dumping takes time.
B-1.2.2 Space overflow
VThe application has required too much SCI or shared memory resources.
The mpimon pool-size specifications are too large, and must be reduced.
B-1.3 Why does my program terminate abnormally?
B-1.3.1 Core dump
VThe application core dumps.
Use a debugger to locate the point of violation. The application may need to be recompiled
to include symbolic debug information (-g for most compilers).
Define SCAMPI_INSTALL_SIGSEGV_HANDLER=1 and attach to the failing process with the
debugger.
B-1.4 General problems
VAre you reasonably certain that your algorithms are MPI safe?
Check if every send has a matching receive.
VThe program just hangs
If the application has a large degree of asynchronicity, try to increase the channel-size.
Further information is available in “Communication buffer adaption: If the communication
behaviour of the application is known, explicitly providing buffersize settings to mpimon, to
match the requirement of the application, will in most cases improve performance.
Example: Application sending only 900 bytes messages. Set channel_inline_threshold 964
(64 added for alignment) and increase the channel-size significantly (32-128 k). Setting
eager_size 1k and eager_count high (16 or more). If all messages can be buffered, the
transporter-{size, count} can be set to low values to reduce shared memory consumption.”
on page 47.
VThe program terminates without an error message
Investigate the core file, or rerun the program in a debugger.