Chapter 9 Optimizing with SIMD Instructions 201
Software Optimization Guide for AMD64 Processors
25112 Rev. 3.06 September 2005
“dotted” with a column of BT. Once this is done, the rows of matrix A are “dotted” with the next
column of BT, and the process is repeated through all the columns of BT.
From a performance standpoint, there are several caveats to recognize, as follows:
Once all the rows of A have been multiplied with the first column of B, all the rows of A are in the
cache, and subsequent accesses to them do not cause cache misses.
The rows of BT are brought into the cache by “dotting” the first four rows of A with each row of
BT in the Ctr_row_num for-loop.
The elements of CT are not initially in the cache, and every time a new set of four rows of A are
“dotted” with a new row of BT, the processor has to wait for CT to arrive in the cache before the
results can be written.
You can address the last two caveats by prefetching to improve performance. However, to efficiently
exploit prefetching, you must structure the code to issue the prefetch instructions such that:
Enough time is provided for memory requests sent out through prefetch requests to bring data into
the processor’s cache before the data is needed.
The loops containing the prefetch instructions are ordered to issue sufficient prefetch instructions
to fetch all the pertinent data.
The matrix order of 32 is not a coincidence. A double-precision number consists of 8 bytes. Prefetch
instructions bring memory into the processor in chunks called cache lines consisting of 64 bytes (or
eight double-precision numbers). We need to issue four prefetch instructions to prefetch a row of BT.
Consequently, when multiplying all 32rows of A with a particular column of B, we want to arrange
the for-loop that cycles through the rows of A such that it is repeated four times. To achieve this, we
need to dot eight rows of A with a row of BT every time we pass through the Ctr_row_num for-loop.
Additionally, “dotting” eight rows of A upon a row of BT produces eight doubles of CT (that is, a full
cache line).
Assume it takes 60 ns to retrieve data from memory; then we must ensure that at least this much time
elapses between issuing the prefetch instruction and the processor loading that data into its registers.
The dot-product of eight rows of A with a row of BT consists of 512 floating-point operations (dotting
a single row of A with a row of BT consists of 32 additions and 32 multiplications). The
AMDAthlon, AMD Athlon 64, and AMD Opteron processors are capable of performing a maximum
of two floating point operations per clock cycle; therefore, it takes the processor no less than
256 clock cycles to process each Ctr_row_num for-loop.
Choosing a matrix order of 32 is convenient for these reasons:
All three matrices A, BT, and CT can fit into the processor’s 64-Kbyte L1 data cache.
On a 2-GHz processor running at full floating-point utilization, 128 ns elapse during the
256 clock cycles, considerably more than the 60 ns to retrieve the data from memory.