AMD Athlon™ Processor x86 Code Optimization

22007E/0 — November 1999

Step 3

For the first time, the value in each k-bit field is small enough

 

that adding two k-bit fields results in a value that still fits in the

 

k-bit field. Thus the following computation is performed:

 

y = (x + (x >> 4)) & 0x0F0F0F0F

 

The result is four 8-bit fields whose lower half has the desired

 

sum and whose upper half contains "junk" that has to be

 

masked out. In a symbolic form:

 

x

= 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh

 

x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg

 

sum

= 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ

 

The WWWW, XXXX, YYYY, and ZZZZ values are the

 

interesting sums with each at most 1000b, or 8 decimal.

Step 4

The four 4-bit sums can now be rapidly accumulated by means

of a multiply with a "magic" multiplier. This can be derived from looking at the following chart of partial products:

0p0q0r0s * 01010101 =

:0p0q0r0s

0p:0q0r0s

0p0q:0r0s

0p0q0r:0s

000pxxww:vvuutt0s

Here p, q, r, and s are the 4-bit sums from the previous step, and

vvis the final result in which we are interested. Thus, the final result:

z = (y * 0x01010101) >> 24

Example:

unsigned int popcount(unsigned int v)

{

unsigned int retVal;

 

__asm

{

 

MOV

EAX, [v]

;v

MOV

EDX, EAX

;v

SHR

EAX, 1

;v >> 1

AND

EAX, 055555555h

;(v >> 1) & 0x55555555

SUB

EDX, EAX

;w = v - ((v >> 1) & 0x55555555)

MOV

EAX, EDX

;w

SHR

EDX, 2

;w >> 2

AND

EAX, 033333333h

;w & 0x33333333

AND

EDX, 033333333h

;(w >> 2) & 0x33333333

92

Efficient Implementation of Population Count Function

Page 108
Image 108
AMD x86 manual Bit field. Thus the following computation is performed