25112 Rev. 3.06 September 2005

Software Optimization Guide for AMD64 Processors

8.3Repeated String Instructions

Optimization

Avoid using the REP prefix when performing string operations, especially when copying blocks of memory.

Rational

In general, using the REP prefix to repeatedly perform string instructions is less optimal than other methods, especially when copying blocks of memory. For a discussion of alternate memory-copy methods, see “Memory Copy” on page 120.

Latency of Repeated String Instructions

Table 6 shows the latency of repeated string instructions on the AMD Athlon 64 and AMD Opteron processors.

Table 6 lists the latencies with the direction flag (DF) = 0 (increment) and DF = 1 (decrement). In addition, these latencies are assumed for aligned memory operands. Note that for MOVS and STOS, when DF = 1, the overhead portion of the latency increases significantly. However, these types are less commonly found. The user should use the formula and round up to the nearest integer value to determine the latency.

Table 6.

Latency of Repeated String Instructions

 

 

 

 

 

 

 

 

 

 

 

 

 

Number of Cycles

 

 

 

 

 

 

 

Instruction

 

When ECX = 0

When ECX = c1, DF = 0

When ECX = c1, DF = 1

rep movs

 

11

15

+ (1 * c)

25

+ (4/3 * c)

 

 

 

 

 

 

 

rep stos

 

11

14

+ (1 * c)

24

+ (1 * c)

 

 

 

 

 

 

 

rep lods

 

11

15

+ (2 * c)

15

+ (2 * c)

 

 

 

 

 

 

 

rep scas

 

11

15

+ (5/2 * c)

15

+ (5/2 * c)

 

 

 

 

 

 

 

rep cmps

 

11

16

+ (10/3 * c)

16

+ (10/3 * c)

 

 

 

 

 

 

 

Note:

 

 

 

 

 

 

1. c > 0

 

 

 

 

 

 

 

 

 

 

 

 

 

Guidelines for Repeated String Instructions

To help achieve good performance, the following sections contain guidelines for the careful scheduling of VectorPath repeated string instructions.

Chapter 8

Integer Optimizations

167

Page 183
Image 183
AMD 250 manual Latency of Repeated String Instructions, Guidelines for Repeated String Instructions, 167