Troubleshooting Cluster Test
Table 4 Cluster Test Troubleshooting Guide
Symptom | How to diagnose | Possible solution | ||
|
|
| ||
A test terminates right | Check the message on the output window or | • The Platform MPI license has expired. Get | ||
away. | terminal: |
| new license and copy it to /opt/hpmpi/ | |
| • Cannot check out license |
| licenses | |
| • ssh: connect to host 192.168.1.X port 22: | • The date and time on the head node is not | ||
| No route to host. |
| set correctly. This often happens in | |
|
|
| ||
|
|
| and time with the date command. See | |
|
|
| date(1) for more information. | |
|
|
| License failures can also occur because the | |
|
|
| dates on the compute nodes are not | |
|
|
| consistent with the date on the head node. | |
|
|
| To fix this, select Tools→Sync Node Times. | |
|
| • Admin network connection to node | ||
|
|
| 192.168.1.X can’t be established. Check | |
|
|
| Ethernet cable. Restart network daemon on | |
|
|
| that node. | |
|
|
| ||
CrissCross test fails to | Check the message on the output window or | Interconnect between nodes can't be | ||
complete. | terminal: | established: | ||
| • Mpirun: one or more remote shell | • You might have a bad cable or bad | ||
| commands exited with |
| Interconnect PCI card (InfiniBand, or driver | |
| which may indicate a remote access |
| not loaded). | |
| problem. | • Restart the network daemon or openibd | ||
| • Use the checkic command to find out | |||
|
| on the node having the problem. | ||
| which nodes have a broken interconnect. |
|
| |
|
|
| ||
CrissCross test: a node | • Check the interconnect cable and the link | • Replace the interconnect cable, the | ||
responds with less | LED on the PCI card. |
| interconnect PIC card, or both. | |
optimal bandwidth | • Check firmware of the Interconnect PCI | • | Update card firmware. | |
compared to others. | ||||
card. | • Reseat the line cards on the interconnect | |||
| ||||
| • Use diagnostics software that comes with | |||
|
| switch. | ||
| the interconnect switch to diagnose the | • | Update switch firmware. | |
| switch. | |||
|
|
| ||
|
|
| ||
Test4 fails to complete | • Did the CrissCross test complete | • Follow the hints above for troubleshooting | ||
| successfully? |
| the CrissCross test if CrissCross did not | |
| • Does any node shut itself down during the |
| complete successfully. | |
|
|
| ||
| Test4 test? | • Heat related problem – check to see if all | ||
| • Observe the Performance Monitor to see |
| fans on the shut down node are running at | |
|
| expected speeds. If not, replace fans on | ||
| if any node drops off or has no activity on |
| ||
|
| that node. | ||
| the interconnect. See “The performance |
| ||
|
|
| ||
| • You might need to replace bad nodes. | |||
|
|
| ||
Linpack can’t start on a | Check the system date on that node. If the | Set the system date to current date. | ||
node | date is far off the current date, Linpack can’t |
|
| |
| start because the hpmpi license might expire. |
|
| |
|
|
| ||
A node shuts down | Heat related | • Check fans on that node. | ||
itself during Linpack test |
| • | Replace the node. | |
|
| |||
|
|
|
|
Troubleshooting Cluster Test | 45 |