HP Cluster Test Software manual Running accelerator tests, GPU detection, Verify

Page 23

NOTE: Remove all files when you are finished testing with accelerator test.

Running accelerator tests

GPU detection

When you start testnodes.pl -gpu, a test is launched to check all nodes for the presence of accelerator cards (GPUs). If any GPUs are detected and they are responsive to communication, the node will be marked by adding /g<number of nodes> to the node name in the nodes window. In the example below, each node has three detected and responsive GPUs.

You should compare the number of GPUs indicated in the nodes monitoring window to the actual number of GPUs for each node. Any discrepancies indicate a problem with GPUs on that node. It might be helpful to run the Verify test, described below, to get more information about problem nodes. Additional information on the nodes monitoring window is available at “The nodes monitoring window” (page 15).

IMPORTANT: For all the accelerator tests, only nodes with detected GPUs should be selected. Deselect any nodes that do not have GPUs.

Verify

The Verify test is similar to the GPU detection run on testnodes.pl -gnustartup. Each selected node is tested for the presence of GPUs using lspci and is then queried. The test report shows the accelerators detected for each node and whether communication with the GPU was successful. If a GPU is installed on a node but not detected, reseat the GPU and repeat the test. An example test report is shown below.

----------------

n21

----------------

**The lspci command shows that there are 3 GPGPUs installed on node

**All 3 GPGPUs appear to be functional on this node

GPU

Model

Video BIOS

Link Speed

Width

Bus ID

0

Tesla S2050

70.00.2f.00.03

5GT/s,

x16,

06.00.0

1

Tesla

S2050

70.00.2f.00.03

5GT/s,

x16,

14.00.0

2

Tesla

S2050

70.00.2f.00.03

5GT/s,

x16,

11.00.0

To use the Verify test report:

Make sure all GPUs are listed for each node.

Verify the Model numbers.

Verify the Video BIOS.

The Link Speed can be reported as either 2.5, 5, or UNKNOWN. A report of 5 or UNKNOWN indicates the GPU is running at Gen2 speed and is acceptable. A value of 2.5 might indicate the GPU is not properly configured. However this test is timing sensitive, so it is recommended you retest any nodes reporting 2.5. If the test consistently reports 2.5, the GPU should be re-seated and the test repeated. If all the GPUs report 2.5, there might be a BIOS setting error.

Running accelerator tests

23

Image 23
Contents HP Cluster Test Administration Guide January Contents Documentation feedback Glossary Index Useful files and directories Utility commandsSample test output CT Image CT Image using a networkVarieties of Cluster Test RPM Running cluster tests Cluster Test GUIStarting Cluster Test Files generated by Cluster TestCluster Test GUI Running cluster tests Configuration settings Running tests in a batch Using scripts to run tests Running cluster tests CrissCross Test descriptionsNodes monitoring window Monitoring tests and viewing resultsTest output window Monitoring tests and viewing results Performance analysis Test report Checking the InfiniBand fabric Cluster Test toolbar menus Cluster Test toolbar menusFiles generated by accelerator test Accelerator test GUIStarting accelerator tests Verify Running accelerator testsGPU detection Memory Test Sgemm Single Precision General Matrix Multiply TestDgemm Double Precision General Matrix Multiply Test BandWidth GPU Bandwidth TestNvidia Linpack Cuda Accelerated Linpack Benchmark Additional software Cluster Test procedure as recommended by HPConfiguring Cluster Test when using RPM Accelerator test procedure Cluster Test procedure as recommended by HP Cluster Test procedure # checkadm Cluster Test procedure Cluster Test procedure as recommended by HP Performance monitor utility Performance monitorPerformance Monitor toolbar menu Xperf utility Firmware Summary Cluster Test toolsHardware Inventory Server health check Excluding the head node from tests Disk Scrubber Cluster Test tools Running tests in parallel An example cloned per-node directory Creating and changing per node filesAn example per-node directory Nfs NFS performance tuningDetecting new hardware TroubleshootingCluster Test Troubleshooting Guide Troubleshooting Cluster TestContacting HP Support and other resourcesScope of this document Intended audienceWebsites New and changed information in this editionRelated information DocumentationCustomer self repair Typographic conventionsCustomer self repair Cluster Test Useful Files and Directories Useful files and directoriesAnalyze Utility commandsConrep Files generated by ibfabriccheck Inspectibfabric.pl Inspectibfabric.pl Utility commands Ipmitool Pdsh CrissCross Sample test outputSample test output Test4 Pallas Mpibyte Sample test output Stream Node24 Triad 3078.7949 3355 3488 3536 CPU Disk TestUTK LinpackPassed Passed Passed Documentation feedback CMU GlossaryIndex MPI Accelerator
Related manuals
Manual 25 pages 60.17 Kb