Troubleshooting Cluster Test

Table 4 Cluster Test Troubleshooting Guide

Symptom

How to diagnose

Possible solution

A test terminates right

Check the message on the output window or

The Platform MPI license has expired. Get

away.

terminal:

new license and copy it to /opt/hpmpi/

 

Cannot check out license

licenses

ssh: connect to host 192.168.1.X port 22: No route to host.

The date and time on the head node is not set correctly. This often happens in fresh-from-the-factory machines. Set the date and time with the date command. See date(1) for more information.

License failures can also occur because the dates on the compute nodes are not

consistent with the date on the head node. To fix this, select Tools→Sync Node Times.

Admin network connection to node 192.168.1.X can’t be established. Check Ethernet cable. Restart network daemon on that node.

CrissCross test fails to

Check the message on the output window or

Interconnect between nodes can't be

complete.

terminal:

established:

Mpirun: one or more remote shell commands exited with non-zero status which may indicate a remote access problem.

Use the checkic command to find out which nodes have a broken interconnect.

You might have a bad cable or bad Interconnect PCI card (InfiniBand, or driver not loaded).

Restart the network daemon or openibd on the node having the problem.

CrissCross test: a node responds with less optimal bandwidth compared to others.

Check the interconnect cable and the link

Replace the interconnect cable, the

LED on the PCI card.

 

interconnect PIC card, or both.

Check firmware of the Interconnect PCI

Update card firmware.

card.

Reseat the line cards on the interconnect

Use diagnostics software that comes with

 

switch.

the interconnect switch to diagnose the

Update switch firmware.

switch.

 

 

Test4 fails to complete Did the CrissCross test complete successfully?

Does any node shut itself down during the Test4 test?

Observe the Performance Monitor to see if any node drops off or has no activity on the interconnect. See “The performance monitor” (page 33).

Follow the hints above for troubleshooting the CrissCross test if CrissCross did not complete successfully.

Heat related problem – check to see if all fans on the shut down node are running at expected speeds. If not, replace fans on that node.

You might need to replace bad nodes.

Linpack can’t start on a Check the system date on that node. If the

node

date is far off the current date, Linpack can’t

 

start because the hpmpi license might expire.

A node shuts down

Heat related

itself during Linpack test

 

Set the system date to current date.

Check fans on that node.

Replace the node.

Troubleshooting Cluster Test

45

Page 45
Image 45
HP Cluster Test Software manual Troubleshooting Cluster Test, Cluster Test Troubleshooting Guide