Profiling with Scali MPI Connect, Example

Chapter 4	Profiling with Scali MPI Connect

The Scali MPI communication library has a number of built-in timing and trace facilities. These features are built into the run time version of the library, so no extra recompiling or linking of libraries is needed. All MPI calls can be timed and/or traced. A number of different environment variables control this functionality. In addition an implied barrier call can be automatically inserted before all collective MPI calls. All of this can give detailed insights into application performance.

The trace and timing facilities are initiated by environment variables that either can be set and exported or set at the command line just before running mpimon.

There are different tools available that can be useful to detect and analyze the cause of performance bottlenecks:

•Built-in proprietary trace and profiling tools provided with SMC

•Commercial tools that collect information during run and postprocesses and presents results afterwards such as Vampir from Pallas GmbH. See http://www.pallas.de for more information.

The main difference between these tools is that the SMC built-in tools can be used with an existing binary while the other tools require reloading with extra libraries.

The powerful run time facilities Scali MPI Connect trace and Scali MPI Connect timing can be used to monitor and keep track of MPI calls and their characteristics. The various trace and timing options can yield many different views of an application's usage of MPI. Common to most of these logs are the massive amount of data which can sometimes be overwhelming, especially when run with many processes and using both trace and timing concurrently.

The second part shows the timing of these different MPI calls. The timing is a sum of the timing for all MPI calls for all MPI processes and since there are many MPI processes the timing can look unrealistically high. However, it reflects the total time spent in all MPI calls. For situations in which benchmarking focuses primarily on timing rather than tracing MPI calls, the timing functionality is more appropriate. The trace functionality introduces some overhead and the total wall clock run time of the application goes up. The timing functionality is relatively light and can be used to time the application for performance benchmarking.

4.1 Example

To illustrate the potential of tracing and timing with Scali MPI Connect consider the code fragment below (full source reproduced in A-2).

int main( int argc, char** argv )

{

MPI_Init( &argc, &argv );

MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); /* read image from file */

/* broadcast to all nodes */

MPI_Bcast( &my_count, 1, MPI_INT, 0, MPI_COMM_WORLD ); /* scatter the image */

MPI_Scatter( pixels, my_count, MPI_UNSIGNED_CHAR, recvbuf, my_count, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD );

/* sum the squares of the pixels in the sub-image */

Escali 4.4 manual Profiling with Scali MPI Connect, Example

Models: 4.4

Chapter 4

Profiling with Scali MPI Connect

4.1 Example

Scali MPI Connect Release 4.4 Users Guide