Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

9.5Structuring Code with Prefetch Instructions to Hide Memory Latency

Optimization

When utilizing prefetch instructions, attend to:

The time allotted (latency) for data to reach the processor between issuing a prefetch instruction and using the data.

Structuring the code to best take advantage of prefetching.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

Prefetch instructions bring the cache line containing a specified memory location into the processor cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.) Prefetching hides the main memory load latency, which is typically many orders of magnitude larger than a processor clock cycle.

There are two types of loops:

Loop type

Description

 

 

Memory-limited

Data can be processed and requested faster than it can be fetched from memory.

 

 

Processor-limited

Data can be requested and brought into the processor before it is needed because

 

considerable processing occurs during each unrolled loop iteration.

 

 

The example provided below illustrates the importance of the above considerations in an example that multiplies a double-precision 32 32 matrix A with another 32 32 transposed double-precision matrix, BT; the result is returned in another 32 32 transposed double-precision matrix, CT. (The transposition of B and C is performed to efficiently access their elements because matrices in the C programming language are stored in row-major format. Doing the transposition in advance reduces the problem of matrix multiplication to one of computing several dot-products—one for each element of the results matrix, CT. This “dotting” operation is implemented as the sum of pair-wise products of the elements of two equal-length vectors.) For this example, assume the processor clock speed is

2 GHz, and the memory latency is 60 ns. In this example, the rows of matrix A are repeatedly

200

Optimizing with SIMD Instructions

Chapter 9

Page 216
Image 216
AMD 250 manual 200, Loop type Description