Software Optimization Guide for AMD64 Processors

25112 Rev. 3.06 September 2005

REP string with low variable counts 168 unroll small loops 13

unrolling loops 145

M

memory

dynamic memory allocation 19 pushing memory data 157

MMX™ instructions PANDN instruction 137 PREFETCHNTA/T0/T1/T2 instructions 105

MOVZX and MOVSX instructions 153 multiplication

by constant 164

multiplies over division, floating-point 238 muxing constructs 136

N

Nonuniform Memory Access 96

O

operands

largest possible operand size, repeated string 168

P

parallelism 35 PF2ID instructions 52 pointers

dereferenced arguments 44 use array-style code instead 10

population-count function 179 prefetch

determining distance 108 multiple 107

PREFETCH and PREFETCHW instructions 104, 106, 108 prototypes 29

R

recursive functions 132

register reads and writes, partial 81 REP prefix 168

S

scalar code translated into 3DNow! code 138 scheduling 144

SHLD instruction 85 SHR instruction 85

single-byte near-return RET instruction (opcode C3h) 128 SSE 193, 355

SSE2 193, 355

stack

alignment considerations 122 store-to-load forwarding 20, 22, 100103String Instructions 167

string instructions 167 structure (struct) 41, 117, 119 subexpressions, explicitly extract common 37 superscalar processor 251

switch statement 25, 28, 33

U

unit-stride access 105, 110

W

write combining 113, 260, 263264,266

X

XOR instruction 169

368

Index

Page 384
Image 384
AMD 250 manual Index