Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

6.1Density of Branches

Optimization

When possible, align branches such that they do not cross a 16-byte boundary.

Application

This optimization applies to:

32-bit software

64-bit software

Rationale

The AMD Athlon™ 64 and AMD Opteron™ processors have the capability to cache branch- prediction history for a maximum of three near branches (CALL, JMP, conditional branches, or returns) per 16-byte fetch window. A branch instruction that crosses a 16-byte boundary is counted in the second 16-byte window. Due to architectural restrictions, a branch that is split across a 16-byte boundary cannot dispatch with any other instructions when it is predicted taken. Perform this alignment by rearranging code; it is not beneficial to align branches using padding sequences.

The following branches are limited to three per 16-byte window:

jcc rel8 jcc rel32 jmp rel8 jmp rel32 jmp reg

jmp WORD PTR jmp DWORD PTR call rel16 call r/m16 call rel32 call r/m32

Coding more than three branches in the same 16-byte code window may lead to conflicts in the branch target buffer. To avoid conflicts in the branch target buffer, space out branches such that three

126

Branch Optimizations

Chapter 6

Page 142
Image 142
AMD 250 manual Density of Branches, 126