Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

5.13Memory Copy

Optimization

For a very fast general purpose memory copy routine, call the libc memcpy() function included with the Microsoft or gcc tools. This function features optimizations for all block sizes and alignments.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

The memcpy() routines included with recent compilers from Microsoft and gcc feature optimizations for all block sizes and alignments for AMD Athlon 64 and AMD Opteron processors.

Copying Small Data Structures

Use inline assembly code to copy a small data structure in cache. Use an unrolled series of MOV instructions. Alternate loads and stores in sequences such as load/store/load/store routines, or use load/load/store/store sequences for even better performance. Align the destination (and source) if possible.

Example 1

The following 64-bit example copies 18 bytes of data:

;rsi = source

;rdi = destination

mov

r8, [rsi]

; 8 bytes of source

mov

r9, [rsi+8]

; next 8 bytes of source

mov

[rdi], r8

; write 8 bytes

mov

[rdi+8], r9

; write next 8

mov

r8w, [rsi+16]

; read two bytes "r8 word"

mov

[rdi+16], r8w

; write the last 2 bytes

Example 2

The following example illustrates how to copy blocks of 32 bytes and larger, in cache. This code performs best when the source and destination addresses are 8-byte aligned. Align the destination

120

Cache and Memory Optimizations

Chapter 5

Page 136
Image 136
AMD 250 manual Memory Copy, Copying Small Data Structures, 120