Problem: cannot start parallel task

Or logged on to this node, run psiadmin which also starts up the ParaStation daemon psid. See Section 6.1, “ Problem:psiadmin returns error ” for more details.

Check the logfile /var/log/messages on this node for error messages. Verify that all nodes have an identical configuration (/etc/parastation.conf).

6.3. Problem: cannot start parallel task

Problem: a parallel task cannot be launched, an error is reported:

PSI: PSI_createPartition: Resource temporarily unavailable

Check for available nodes and active parallel tasks. Check for user or group restrictions.

If the error

PSI: dospawn: spawn to node 1 failed.

PSE: Could not spawn './mpi_latency' process 1, error = Bad \ file descriptor.

is reported, check if the current directory holding the program mpi_latency is accessible on all nodes. Verify that the program is executable on all nodes.

6.4. Problem: bad performance

Verify that the proper interconnect and/or transport is used: check for environment variables controlling transport (see Section 5.8, “ControllingParaStation5 communication paths” and ps_environment(5)).

Watch protocol counters, e.g. counters indicating timeouts, retries, errors or other bad conditions. For p4sock, check recv_net_data and recv_user. See Section 5.2, “ParaStation5 protocol p4sock”.

Look for a crystal bowl!

Or contact <support@par-tec.com>.

6.5.Problem: different groups of nodes are seen as up or down

Problem: depending on which node the psiadmin is run, different groups of nodes are seen as "up" or "down".

Check for identical configuration on each node, e.g. compare the configuration file /etc/ parastation.conf on each node.

6.6. Problem: cannot start process on frontend

Problem: Starting a job is canceled giving the error message

Connecting client 139.27.166.22:44784 (rank 6) failed : Network is unreachable

PSIlogger: Child with rank 12 exited with status 1.

30

ParaStation5 Administrator's Guide

Page 34
Image 34
PAR Technologies V5 Problem cannot start parallel task, Problem bad performance, Problem cannot start process on frontend