Troubleshooting Cluster Test
Table 4 Cluster Test Troubleshooting Guide
Symptom | How to diagnose | Possible solution |
A test terminates right | Check the message on the output window or | • The Platform MPI license has expired. Get |
away. | terminal: | new license and copy it to /opt/hpmpi/ |
| • Cannot check out license | licenses |
•ssh: connect to host 192.168.1.X port 22: No route to host.
•The date and time on the head node is not set correctly. This often happens in
License failures can also occur because the dates on the compute nodes are not
consistent with the date on the head node. To fix this, select Tools→Sync Node Times.
•Admin network connection to node 192.168.1.X can’t be established. Check Ethernet cable. Restart network daemon on that node.
CrissCross test fails to | Check the message on the output window or | Interconnect between nodes can't be |
complete. | terminal: | established: |
•Mpirun: one or more remote shell commands exited with
•Use the checkic command to find out which nodes have a broken interconnect.
•You might have a bad cable or bad Interconnect PCI card (InfiniBand, or driver not loaded).
•Restart the network daemon or openibd on the node having the problem.
CrissCross test: a node responds with less optimal bandwidth compared to others.
• Check the interconnect cable and the link | • | Replace the interconnect cable, the | |
LED on the PCI card. |
| interconnect PIC card, or both. | |
• Check firmware of the Interconnect PCI | • | Update card firmware. | |
card. | • Reseat the line cards on the interconnect | ||
• Use diagnostics software that comes with | |||
| switch. | ||
the interconnect switch to diagnose the | • | Update switch firmware. | |
switch. | |||
|
|
Test4 fails to complete • Did the CrissCross test complete successfully?
•Does any node shut itself down during the Test4 test?
•Observe the Performance Monitor to see if any node drops off or has no activity on the interconnect. See “The performance monitor” (page 33).
•Follow the hints above for troubleshooting the CrissCross test if CrissCross did not complete successfully.
•Heat related problem – check to see if all fans on the shut down node are running at expected speeds. If not, replace fans on that node.
•You might need to replace bad nodes.
Linpack can’t start on a Check the system date on that node. If the
node | date is far off the current date, Linpack can’t |
| start because the hpmpi license might expire. |
A node shuts down | Heat related |
itself during Linpack test |
|
Set the system date to current date.
•Check fans on that node.
•Replace the node.
Troubleshooting Cluster Test | 45 |