Section:

B-1.2 Why can I not start mpid?

mpid opens a socket and assigns a predefined mpid port number (see /etc/services for more information), to the end point. If mpid is terminated abnormally, the mpid port number cannot be re-used until a system defined timer has expired. To resolve:

‹Use netstat -a grep mpid to observe when the socket is released. When the socket is released, restart mpid again.

B-1.2.1 Bad clean up

A previous SMC run has not terminated properly.

‹Check for mpi-processes on the nodes using /opt/scali/bin/scaps.

‹Use /opt/scali/sbin/scidle

‹Use /opt/scali/bin/scash to check for leftover shared memory segments on all nodes (ipcs for Solaris and Linux).

Note: core dumping takes time.

B-1.2.2 Space overflow

The application has required too much SCI or shared memory resources.

‹The mpimon pool-sizespecifications are too large, and must be reduced.

B-1.3 Why does my program terminate abnormally?

B-1.3.1 Core dump

The application core dumps.

‹Use a debugger to locate the point of violation. The application may need to be recompiled to include symbolic debug information (-gfor most compilers).

‹Define SCAMPI_INSTALL_SIGSEGV_HANDLER=1 and attach to the failing process with the debugger.

B-1.4 General problems

Are you reasonably certain that your algorithms are MPI safe?

‹Check if every send has a matching receive.

The program just hangs

‹If the application has a large degree of asynchronicity, try to increase the channel-size. Further information is available in “Communication buffer adaption: If the communication behaviour of the application is known, explicitly providing buffersize settings to mpimon, to match the requirement of the application, will in most cases improve performance. Example: Application sending only 900 bytes messages. Set channel_inline_threshold 964 (64 added for alignment) and increase the channel-size significantly (32-128 k). Setting eager_size 1k and eager_count high (16 or more). If all messages can be buffered, the transporter-{size, count} can be set to low values to reduce shared memory consumption.” on page 47.

The program terminates without an error message

‹Investigate the core file, or rerun the program in a debugger.

Scali MPI Connect Release 4.4 Users Guide

55

Page 67
Image 67
Escali 4.4 manual Why can I not start mpid?, Why does my program terminate abnormally?, General problems