Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

8.2Alternative Code for Multiplying by a Constant

Optimization

Devise instruction sequences with lower latency to accomplish multiplication by certain constant multipliers.

Rationale

A 32-bit integer multiplied by a constant has a latency of 3 cycles; a 64-bit integer multiplied by a constant has a latency of 4 cycles. For certain constant multipliers, instruction sequences can be devised that accomplish the multiplication with lower latency. Because the AMD Athlon 64 and AMD Opteron processors contain only one integer multiplier but three integer execution units, the replacement code can provide better throughput as well.

Most replacement sequences require the use of an additional temporary register, thus increasing register pressure. If register pressure in a piece of code that performs integer multiplication with a constant is already high, it could be better for the overall performance of that code to use the IMUL instruction instead of the replacement code. Similarly, replacement sequences with low latency but containing many instructions may negatively influence decode bandwidth as compared to the IMUL instruction. In general, replacement sequences containing more than four instructions are not recommended.

The following code samples are designed for the original source to receive the final result. Other sequences are possible if the result is in a different register. Sequences that do not require a temporary register are favored over ones requiring a temporary register, even if the latency is higher. Arithmetic- logic-unit operations are preferred over shifts to keep code size small. Similarly, both arithmetic- logic-unit operations and shifts are favored over the LEA instruction.

There are improvements in the AMD Athlon 64 and AMD Opteron processors’ multiplier over that of previous x86 processors. For this reason, when doing 32-bit multiplication, only use the alternative sequence if the alternative sequence has a latency that is less than or equal to 2 cycles. For 64-bit multiplication, only use the alternative sequence if the alternative sequence has a latency that is less than or equal to 3 cycles.

Examples

by 2:

add reg1, reg1

; 1

cycle

by 3:

lea reg1, [reg1+reg1*2]

; 2

cycles

by

4:

shl

reg1,

2

;

1

cycle

by

5:

lea

reg1,

[reg1+reg1*4]

;

2

cycles

164

Integer Optimizations

Chapter 8

Page 180
Image 180
AMD 250 manual Alternative Code for Multiplying by a Constant, 164