Troubleshooting Cluster Test

Table 4 Cluster Test Troubleshooting Guide

Symptom

How to diagnose

Possible solution

 

 

 

A test terminates right

Check the message on the output window or

The Platform MPI license has expired. Get

away.

terminal:

 

new license and copy it to /opt/hpmpi/

 

Cannot check out license

 

licenses

 

ssh: connect to host 192.168.1.X port 22:

The date and time on the head node is not

 

No route to host.

 

set correctly. This often happens in

 

 

 

fresh-from-the-factory machines. Set the date

 

 

 

and time with the date command. See

 

 

 

date(1) for more information.

 

 

 

License failures can also occur because the

 

 

 

dates on the compute nodes are not

 

 

 

consistent with the date on the head node.

 

 

 

To fix this, select ToolsSync Node Times.

 

 

Admin network connection to node

 

 

 

192.168.1.X can’t be established. Check

 

 

 

Ethernet cable. Restart network daemon on

 

 

 

that node.

 

 

 

CrissCross test fails to

Check the message on the output window or

Interconnect between nodes can't be

complete.

terminal:

established:

 

Mpirun: one or more remote shell

You might have a bad cable or bad

 

commands exited with non-zero status

 

Interconnect PCI card (InfiniBand, or driver

 

which may indicate a remote access

 

not loaded).

 

problem.

Restart the network daemon or openibd

 

Use the checkic command to find out

 

 

on the node having the problem.

 

which nodes have a broken interconnect.

 

 

 

 

 

CrissCross test: a node

Check the interconnect cable and the link

Replace the interconnect cable, the

responds with less

LED on the PCI card.

 

interconnect PIC card, or both.

optimal bandwidth

Check firmware of the Interconnect PCI

Update card firmware.

compared to others.

card.

Reseat the line cards on the interconnect

 

 

Use diagnostics software that comes with

 

 

switch.

 

the interconnect switch to diagnose the

Update switch firmware.

 

switch.

 

 

 

 

 

 

Test4 fails to complete

Did the CrissCross test complete

Follow the hints above for troubleshooting

 

successfully?

 

the CrissCross test if CrissCross did not

 

Does any node shut itself down during the

 

complete successfully.

 

 

 

 

Test4 test?

Heat related problem – check to see if all

 

Observe the Performance Monitor to see

 

fans on the shut down node are running at

 

 

expected speeds. If not, replace fans on

 

if any node drops off or has no activity on

 

 

 

that node.

 

the interconnect. See “The performance

 

 

 

 

 

monitor” (page 33).

You might need to replace bad nodes.

 

 

 

Linpack can’t start on a

Check the system date on that node. If the

Set the system date to current date.

node

date is far off the current date, Linpack can’t

 

 

 

start because the hpmpi license might expire.

 

 

 

 

 

A node shuts down

Heat related

Check fans on that node.

itself during Linpack test

 

Replace the node.

 

 

 

 

 

 

Troubleshooting Cluster Test

45