NOTE: Remove all files when you are finished testing with accelerator test.

Running accelerator tests

GPU detection

When you start testnodes.pl -gpu, a test is launched to check all nodes for the presence of accelerator cards (GPUs). If any GPUs are detected and they are responsive to communication, the node will be marked by adding /g<number of nodes> to the node name in the nodes window. In the example below, each node has three detected and responsive GPUs.

You should compare the number of GPUs indicated in the nodes monitoring window to the actual number of GPUs for each node. Any discrepancies indicate a problem with GPUs on that node. It might be helpful to run the Verify test, described below, to get more information about problem nodes. Additional information on the nodes monitoring window is available at “The nodes monitoring window” (page 15).

IMPORTANT: For all the accelerator tests, only nodes with detected GPUs should be selected. Deselect any nodes that do not have GPUs.

Verify

The Verify test is similar to the GPU detection run on testnodes.pl -gnustartup. Each selected node is tested for the presence of GPUs using lspci and is then queried. The test report shows the accelerators detected for each node and whether communication with the GPU was successful. If a GPU is installed on a node but not detected, reseat the GPU and repeat the test. An example test report is shown below.

----------------

n21

----------------

**The lspci command shows that there are 3 GPGPUs installed on node

**All 3 GPGPUs appear to be functional on this node

GPU

Model

Video BIOS

Link Speed

Width

Bus ID

0

Tesla S2050

70.00.2f.00.03

5GT/s,

x16,

06.00.0

1

Tesla

S2050

70.00.2f.00.03

5GT/s,

x16,

14.00.0

2

Tesla

S2050

70.00.2f.00.03

5GT/s,

x16,

11.00.0

To use the Verify test report:

Make sure all GPUs are listed for each node.

Verify the Model numbers.

Verify the Video BIOS.

The Link Speed can be reported as either 2.5, 5, or UNKNOWN. A report of 5 or UNKNOWN indicates the GPU is running at Gen2 speed and is acceptable. A value of 2.5 might indicate the GPU is not properly configured. However this test is timing sensitive, so it is recommended you retest any nodes reporting 2.5. If the test consistently reports 2.5, the GPU should be re-seated and the test repeated. If all the GPUs report 2.5, there might be a BIOS setting error.

Running accelerator tests

23

Page 23
Image 23
HP Cluster Test Software manual Running accelerator tests, GPU detection, Verify