25112 Rev. 3.06 September 2005Software Optimization Guide for AMD64 Processors

Chapter 5

Cache and Memory Optimizations

.91

5.1

Memory-Size Mismatches

.92

5.2

Natural Alignment of Data Objects

.95

5.3

Cache-Coherent Nonuniform Memory Access (ccNUMA)

.96

5.4

Multiprocessor Considerations

.99

5.5

Store-to-Load Forwarding Restrictions

100

5.6

Prefetch Instructions

104

5.7

Streaming-Store/Non-Temporal Instructions

112

5.8

Write-combining

113

5.9

L1 Data Cache Bank Conflicts

114

5.10

Placing Code and Data in the Same 64-Byte Cache Line

116

5.11

Sorting and Padding C and C++ Structures

117

5.12

Sorting Local Variables

119

5.13

Memory Copy

120

5.14

Stack Considerations

122

5.15

Cache Issues when Writing Instruction Bytes to Memory

123

5.16

Interleave Loads and Stores

124

Chapter 6

Branch Optimizations

125

6.1

Density of Branches

126

6.2

Two-ByteNear-Return RET Instruction

128

6.3

Branches That Depend on Random Data

130

6.4

Pairing CALL and RETURN

132

6.5

Recursive Functions

133

6.6

Nonzero Code-Segment Base Values

135

6.7

Replacing Branches with Computation

136

6.8

The LOOP Instruction

141

6.9

Far Control-Transfer Instructions

142

Chapter 7

Scheduling Optimizations

143

7.1

Instruction Scheduling by Latency

144

7.2

Loop Unrolling

145

Contents

v

Page 5
Image 5
AMD 250 manual Chapter Cache and Memory Optimizations