HP Cluster Test Software manual Troubleshooting Cluster Test, Cluster Test Troubleshooting Guide

Page 45

Troubleshooting Cluster Test

Table 4 Cluster Test Troubleshooting Guide

Symptom

How to diagnose

Possible solution

A test terminates right

Check the message on the output window or

The Platform MPI license has expired. Get

away.

terminal:

new license and copy it to /opt/hpmpi/

 

Cannot check out license

licenses

ssh: connect to host 192.168.1.X port 22: No route to host.

The date and time on the head node is not set correctly. This often happens in fresh-from-the-factory machines. Set the date and time with the date command. See date(1) for more information.

License failures can also occur because the dates on the compute nodes are not

consistent with the date on the head node. To fix this, select Tools→Sync Node Times.

Admin network connection to node 192.168.1.X can’t be established. Check Ethernet cable. Restart network daemon on that node.

CrissCross test fails to

Check the message on the output window or

Interconnect between nodes can't be

complete.

terminal:

established:

Mpirun: one or more remote shell commands exited with non-zero status which may indicate a remote access problem.

Use the checkic command to find out which nodes have a broken interconnect.

You might have a bad cable or bad Interconnect PCI card (InfiniBand, or driver not loaded).

Restart the network daemon or openibd on the node having the problem.

CrissCross test: a node responds with less optimal bandwidth compared to others.

Check the interconnect cable and the link

Replace the interconnect cable, the

LED on the PCI card.

 

interconnect PIC card, or both.

Check firmware of the Interconnect PCI

Update card firmware.

card.

Reseat the line cards on the interconnect

Use diagnostics software that comes with

 

switch.

the interconnect switch to diagnose the

Update switch firmware.

switch.

 

 

Test4 fails to complete Did the CrissCross test complete successfully?

Does any node shut itself down during the Test4 test?

Observe the Performance Monitor to see if any node drops off or has no activity on the interconnect. See “The performance monitor” (page 33).

Follow the hints above for troubleshooting the CrissCross test if CrissCross did not complete successfully.

Heat related problem – check to see if all fans on the shut down node are running at expected speeds. If not, replace fans on that node.

You might need to replace bad nodes.

Linpack can’t start on a Check the system date on that node. If the

node

date is far off the current date, Linpack can’t

 

start because the hpmpi license might expire.

A node shuts down

Heat related

itself during Linpack test

 

Set the system date to current date.

Check fans on that node.

Replace the node.

Troubleshooting Cluster Test

45

Image 45
Contents HP Cluster Test Administration Guide January Contents Useful files and directories Utility commands Sample test outputDocumentation feedback Glossary Index CT Image using a network Varieties of Cluster TestCT Image RPM Starting Cluster Test Cluster Test GUIFiles generated by Cluster Test Running cluster testsCluster Test GUI Running cluster tests Configuration settings Running tests in a batch Using scripts to run tests Running cluster tests CrissCross Test descriptionsNodes monitoring window Monitoring tests and viewing resultsTest output window Monitoring tests and viewing results Performance analysis Test report Checking the InfiniBand fabric Cluster Test toolbar menus Cluster Test toolbar menusAccelerator test GUI Starting accelerator testsFiles generated by accelerator test Running accelerator tests GPU detectionVerify Dgemm Double Precision General Matrix Multiply Test Sgemm Single Precision General Matrix Multiply TestBandWidth GPU Bandwidth Test Memory TestNvidia Linpack Cuda Accelerated Linpack Benchmark Cluster Test procedure as recommended by HP Configuring Cluster Test when using RPMAdditional software Accelerator test procedure Cluster Test procedure as recommended by HP Cluster Test procedure # checkadm Cluster Test procedure Cluster Test procedure as recommended by HP Performance monitor utility Performance monitorPerformance Monitor toolbar menu Xperf utility Cluster Test tools Hardware InventoryFirmware Summary Server health check Excluding the head node from tests Disk Scrubber Cluster Test tools Running tests in parallel Creating and changing per node files An example per-node directoryAn example cloned per-node directory Nfs NFS performance tuningDetecting new hardware TroubleshootingCluster Test Troubleshooting Guide Troubleshooting Cluster TestScope of this document Support and other resourcesIntended audience Contacting HPRelated information New and changed information in this editionDocumentation WebsitesCustomer self repair Typographic conventionsCustomer self repair Cluster Test Useful Files and Directories Useful files and directoriesAnalyze Utility commandsConrep Files generated by ibfabriccheck Inspectibfabric.pl Inspectibfabric.pl Utility commands Ipmitool Pdsh CrissCross Sample test outputSample test output Test4 Pallas Mpibyte Sample test output Stream Node24 Triad 3078.7949 3355 3488 3536 CPU Disk TestUTK LinpackPassed Passed Passed Documentation feedback CMU GlossaryIndex MPI Accelerator
Related manuals
Manual 25 pages 60.17 Kb