Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

general, unrolling loops improves performance by providing opportunities for the processor to work on data pertaining to the next loop iteration while waiting for the result of an operation from the previous iteration. The reciprocal_sqrt_1xloop loop performs the reciprocation and square root on the remaining elements that do not form a full segment of 16 floating-point values. In this chapter, the previous function is the only example that handles any vector stream of num_points size. This is done to preserve space, but all examples in this chapter can be modified in a similar manner and used universally.

Additionally, the previous SSE function makes use of the PREFETCHNTA instruction to reduce cache latency. The unrolled loop reciprocal_sqrt_4xloop was chosen to work with 64 bytes of data per iteration, which happens to be the size of one cache line (the term used to signify the quantum of data brought into the processor’s cache by a memory access, if the data does not reside there already). The prefetch causes the processor to load the floating-point operands of the reciprocal and square root operations for the next four loop iterations. While the processor works on the next three iterations, the data for the fourth iteration is sent to the processor. The processor does not have to wait while the aligned SSE instruction MOVAPS is fetched from memory before performing operations on the fourth iteration. This type of memory optimization can be very useful in gaming and high-performance computing, in which data sets are unlikely to reside in the processor’s cache. For example, in a simulation involving a million vertices or atoms in which the storage for their coordinates would require 12 bytes per vertex, the total space for the data would be more than 12 Mbytes.

214

Optimizing with SIMD Instructions

Chapter 9

Page 230
Image 230
AMD 250 manual 214