Section: 3.4 Suspending and resuming jobs
Scali MPI Connect Release 4.4 Users Guide 28
<proc>: all (default), none, or MPI-process number(s).
-part <part> Use nodes from partition <part>
-q Keep quiet, no mpimon printout.
-t test mode, no MPI program is started
<params> Parameters not recognized are passed on to mpimon.
3.4 Suspending and resuming jobs
From time to time it is convenient to be able to suspend regular jobs running on a cluster in
order to allow a critical, maybe real-time job to be use the cluster. When using Scali MPI
Connect to run parallel applications suspending jobs to yield the cluster to other jobs can be
achieved by sending a SIGUSR1 or SIGTSTP signal to the mpimon representing the job.
Assuming that the process identifier for this mpimon is <PID>, the user interface for this is:
user% kill -USR1 <PID>
or
user% kill -TSTP <PID>
Similarly the suspended job can be resumed by sending it a SIGUSR2 or SIGCONT signal, i.e.,
user% kill -USR2 <PID>
or
user% kill -CONT <PID>
3.5 Running with dynamic interconnect failover capabilities
If a runtime failure on a high speed interconnect occurs, ScaMPI has the ability to do an
interconnect failover and continue running on a secondary network device. This high availability
feature is part of the Scali MPI Connect/HA product, which requires a separately priced license.
Once this license is installed, you may enable the failover functionality by setting the
environment variable SCAMPI_FAILOVER_MODE to 1, or by using the mpimon command line
argument -failover_mode.
Currently the Scali MPI Infiniband (ib0), Myrinet (gm0) and all DAT-based drivers are
supported. SCI is not supported. Note also that the combination of failover and tfdr is not
supported in this version of Scali MPI Connect.
Some failures will not result in a explicit error value propagating to Scali MPI. Scali MPI handles
this by treating a lack of progress within a specified time as a failure. You may alter this time
by setting the environment variable SCAMPI_FAILOVER_TIMEOUT to the desired number of
seconds.
Failures will be logged using the standard syslog mechanism.
3.6 Running with tcp error detection - TFDR
Errors may occur when transferring data from the network card to memory. When offloading
the tcp stack in hardware, this may result in actual data errors. Using the wrapper script
tfdrmpimon (Transmission Failure Detection and Retransmit), Scali MPI will handle such
errors by adding an extra checksum and retransmit the data if an error should occur. This high
availability feature is part of the Scali MPI Connect/HA product which requires a separate
license.