25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

“dotted” with a column of BT. Once this is done, the rows of matrix A are “dotted” with the next column of BT, and the process is repeated through all the columns of BT.

From a performance standpoint, there are several caveats to recognize, as follows:

Once all the rows of A have been multiplied with the first column of B, all the rows of A are in the cache, and subsequent accesses to them do not cause cache misses.

The rows of BT are brought into the cache by “dotting” the first four rows of A with each row of

BT in the Ctr_row_num for-loop.

The elements of CT are not initially in the cache, and every time a new set of four rows of A are “dotted” with a new row of BT, the processor has to wait for CT to arrive in the cache before the results can be written.

You can address the last two caveats by prefetching to improve performance. However, to efficiently exploit prefetching, you must structure the code to issue the prefetch instructions such that:

Enough time is provided for memory requests sent out through prefetch requests to bring data into the processor’s cache before the data is needed.

The loops containing the prefetch instructions are ordered to issue sufficient prefetch instructions to fetch all the pertinent data.

The matrix order of 32 is not a coincidence. A double-precision number consists of 8 bytes. Prefetch instructions bring memory into the processor in chunks called cache lines consisting of 64 bytes (or eight double-precision numbers). We need to issue four prefetch instructions to prefetch a row of BT. Consequently, when multiplying all 32 rows of A with a particular column of B, we want to arrange the for-loop that cycles through the rows of A such that it is repeated four times. To achieve this, we need to dot eight rows of A with a row of BT every time we pass through the Ctr_row_num for-loop. Additionally, “dotting” eight rows of A upon a row of BT produces eight doubles of CT (that is, a full cache line).

Assume it takes 60 ns to retrieve data from memory; then we must ensure that at least this much time elapses between issuing the prefetch instruction and the processor loading that data into its registers. The dot-product of eight rows of A with a row of BT consists of 512 floating-point operations (dotting a single row of A with a row of BT consists of 32 additions and 32 multiplications). The

AMD Athlon, AMD Athlon 64, and AMD Opteron processors are capable of performing a maximum of two floating point operations per clock cycle; therefore, it takes the processor no less than

256clock cycles to process each Ctr_row_num for-loop. Choosing a matrix order of 32 is convenient for these reasons:

All three matrices A, BT, and CT can fit into the processor’s 64-Kbyte L1 data cache.

On a 2-GHz processor running at full floating-point utilization, 128 ns elapse during the 256 clock cycles, considerably more than the 60 ns to retrieve the data from memory.

Chapter 9

Optimizing with SIMD Instructions

201

Page 217
Image 217
AMD 250 manual 201