Beginnings of a document on random

P4 optimizations and insights

 

Those considering assembly language optimizations on the P4 should first read the Intel manuals at

http://developer.intel.com/design/Pentium4/manuals/ especially the Processor Optimization Reference manual at http://developer.intel.com/design/pentium4/manuals/248966.htm.

 

Each section tries to analyze one small aspect of the P4 architecture.  It is followed by a conclusion or speculation regarding the observed behavior.

 

Each code snipet is timed in a loop of at least 100 iterations.  Memory is aligned on a 4KB page boundary.  The code is then run 200 times and the best timing is reported.  Note my machine is a 1.4 GHz P4 with PC600 memory (theoretical bandwidth of 2.4 GB/sec).

 

Note, the Intel manuals sometimes refer to cache line size as 128 bytes and sometimes as 64 bytes.  In the examples below I’ll refer to a cache line as 64 bytes.  The L1 cache line size is 64 bytes.  The L2 cache is kind of a hybrid 128/64 byte cache.  A cache line fill reads 128 bytes, but my tests show that the L2 cache maintains separate dirty bits for each 64-byte chunk.

 

 

Reading contiguous data

 

This code reads a contiguous block of memory using 4 bytes at a time.  Timings are done on 3 memory size.  4KB will read from the L1 cache only, 64KB will read from the L2 cache only, and 1MB will test reading from main memory.

 

        mem = 4096            ; Read 4KB

        cnt = mem/64          ; Read 64 bytes per iteration

        mov     ecx, cnt

loop1:  mov     eax, [esi]              ; Read one cache line

        mov     eax, [esi+4]             ; 4 bytes at a time

        mov     eax, [esi+8]

        mov     eax, [esi+12]

        mov     eax, [esi+16]

        mov     eax, [esi+20]

        mov     eax, [esi+24]

        mov     eax, [esi+28]

        mov     eax, [esi+32]

        mov     eax, [esi+36]

        mov     eax, [esi+40]

        mov     eax, [esi+44]

        mov     eax, [esi+48]

        mov     eax, [esi+52]

        mov     eax, [esi+56]

        mov     eax, [esi+60]

        lea     esi, [esi+64]            ; Next cache line

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]      ; Restore esi

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per loop iteration

Bandwidth

Comments

4096 (4KB)

1037

16.2

5.53GB/sec

The expected result

65536 (64KB)

23625

23.07

3.88GB/sec

Any good theories here?

1048576 (1MB)

1180000

72.02

1.24GB/sec

50% of theoretical throughput.  Why?

 

Now using MMX instructions:

 

        mem = 4096                      ; Read 4KB

        cnt = mem/64                    ; 64 bytes per iteration

        mov     ecx, cnt

loop1:  movq    mm1, [esi]               ; Read one cache line

        movq    mm1, [esi+8]             ; 8 bytes at a time

        movq    mm1, [esi+16]

        movq    mm1, [esi+24]

        movq    mm1, [esi+32]

        movq    mm1, [esi+40]

        movq    mm1, [esi+48]

        movq    mm1, [esi+56]

        lea     esi, [esi+64]            ; Next cache line

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per loop iteration

Bandwidth

Comments

4096 (4KB)

523

8.17

10.96GB/sec

As expected.

65536 (64KB)

15395

15.03

5.96GB/sec

 

1048576 (1MB)

1036299

63.25

1.42GB/sec

Still not theoretical throughput.  Why?

 

Now using XMM instructions:

 

        mem = 4096                      ; Read 4KB

        cnt = mem/64                    ; 64 bytes per iteration

        mov     ecx, cnt

loop1:  movapd  xmm1, [esi]               ; Read one cache line

        movapd  xmm1, [esi+16]       ; 16 bytes at a time

        movapd  xmm1, [esi+32]

        movapd  xmm1, [esi+48]

        lea     esi, [esi+64]            ; Next cache line

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per loop iteration

Bandwidth

Comments

4096 (4KB)

268

4.19

21.40GB/sec

 

65536 (64KB)

7766

7.58

11.81GB/sec

 

1048576 (1MB)

817776

49.91

1.80GB/sec

Still not theoretical throughput.  Why?

 

Conclusions:  The P4 is unable to hide all the latencies when reading from the L2 cache.  This is somewhat surprising, as it should be getting near perfect branch prediction.  How does Intel justify claims of 44.8GB/sec for L2 cache bandwidth?

 

One would hope that the P4 would get closer to its 2.4GB/sec peak bandwidth.  Why isn’t it?

 

 

Now we will see if using the SSE2 prefetch instructions can get us closer to the 2.4GB/sec theoretical bandwidth.  We’ll also see how much as a performance penalty prefetch incurs when reading from the L2 cache.

 

        mem = 16*16*4096                ; Read 1MB

        pages = mem/4096                ; Number of 4KB pages

        mov     ecx, pages

        sub     ebx, ebx

loop0:  mov     eax, [esi+4096]         ; Preload next page TLB

loop1:  movapd  xmm1, [esi]             ; Read two cache lines

        movapd  xmm1, [esi+16]          ; 16 bytes at a time

        movapd  xmm1, [esi+32]

        movapd  xmm1, [esi+48]

        movapd  xmm1, [esi+64]

        movapd  xmm1, [esi+80]

        movapd  xmm1, [esi+96]

        movapd  xmm1, [esi+112]

        prefetcht1 [esi+4096]           ; Preload lines in next page

        lea     esi, [esi+128]          ; Next cache lines

        add     bl, 256/32              ; 32 iterations per page

        jnc     loop1

        dec     ecx

        jnz     loop0

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per 64 byte cache line

Bandwidth

Comments

65536 (64KB)

8602

8.40

10.67GB/sec

Unnecessary prefetch causes some overhead

1048576 (1MB)

737952

45.04

1.99GB/sec

Still not theoretical throughput.  Why?

 

Conclusions: The prefetch instruction caused a modest 11% overhead (8602 vs. 7766 clocks) when not reading from memory.  This overhead may well disappear in real-world code where the loop is actually doing real work.

 

Reading non-contiguous data

 

This example reads an entire block of memory, but in a non-contiguous fashion.

 

        mem = 4096                      ; Read 4KB

        cnt = mem/(4*128)               ; 128 bytes per iteration

        dist = 64

        sub     ebx, ebx

        mov     ecx, cnt

loop1:  movapd  xmm1, [esi+0*dist]      ; Read 8 cache lines

        movapd  xmm1, [esi+1*dist]

        movapd  xmm1, [esi+2*dist]

        movapd  xmm1, [esi+3*dist]

        movapd  xmm1, [esi+4*dist]

        movapd  xmm1, [esi+5*dist]

        movapd  xmm1, [esi+6*dist]

        movapd  xmm1, [esi+7*dist]

        lea     esi, [esi+16]           ; Same 8 cache lines

        add     bl, 256/4           ; 4 inner loop iterations

        jnc     loop1

        lea     esi, [esi-4*16+8*dist]  ; Next set of 8 cache lines

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per 64 byte cache line

Bandwidth

Comments

4096 (4KB)

268

4.19

21.40GB/sec

Same as contiguous

65536 (64KB)

5900

5.76

15.55GB/sec

Much faster!

1048576 (1MB)

818193

49.94

1.80GB/sec

Same as contiguous

 

Conclusions:  This is very interesting.  If your application’s data fits in the L2 cache, grouping your L1 cache misses together increases your throughput.  You may want to try preloading L2 cache lines into the L1 cache in an earlier iteration.  For example, adding the line

        mov     eax, [esi+512]          ; Preload cache line into L1

in the contiguous read code speeds up the 64KB timing.

 

If the prefetchnt1 SSE2 instruction works as advertised, the trick of grouping L2 cache misses together should also benefit the 1MB read example.

 

Writing contiguous data

 

        mem = 4096                      ; Write 4KB

        cnt = mem/64                    ; 64 bytes per iteration

        mov     ecx, cnt

loop1:  movapd  [esi], xmm1           ; Write one cache line

        movapd  [esi+16], xmm1      ; 16 bytes at a time

        movapd  [esi+32], xmm1

        movapd  [esi+48], xmm1

        lea     esi, [esi+64]            ; Next cache line

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per 64 byte cache line

Bandwidth

Comments

4096 (4KB)

902

14.09

6.36GB/sec

Slower than reading

65536 (64KB)

14372

14.04

6.38GB/sec

Slower than reading.

1048576 (1MB)

1768230

107.92

0.83GB/sec

 

 

Conclusions:  The write-through L1 cache is a killer here.  It takes 14 clocks to write the L1 cache line back to the L2 cache.  This places a limit on how fast you can code an application that reads and writes

only from the L1 cache.  To keep the CPU fully working you must find 14 clocks of work to do for each 64 bytes your program writes.

 

The main memory benchmark is also interesting.  Obviously, the processor needs to read the cache line from memory before doing the write operation.  The theoretical maximum bandwidth is thus 1.2GB/sec.  This example achieves only 66% of theoretical, while the contiguous read example achieves 75% of theoretical.  Why?

 

Now let’s try it using the movntpd instruction.  This should avoid the reading of the cache line prior to writing the cache line.

 

        mem = 4096                      ; Write 4KB

        cnt = mem/64                    ; 64 bytes per iteration

        mov     ecx, cnt

loop1:  movntpd  [esi], xmm1           ; Write one cache line

        movntpd  [esi+16], xmm1      ; 16 bytes at a time

        movntpd  [esi+32], xmm1

        movntpd  [esi+48], xmm1

        lea     esi, [esi+64]            ; Next cache line

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per 64 byte cache line

Bandwidth

Comments

65536 (64KB)

45052

44.00

2.03GB/sec

Slower, but see conclusions.

1048576 (1MB)

722035

44.07

2.03GB/sec

 

 

Conclusions:  At first glance, the L2 cache test run looks slower than the movapd case.  This is misleading.  Using the movapd instruction left the L2 full of dirty cache lines that must one day be written to main memory.  What’s worse is the cache lines written to the L2 cache may have ousted other cache lines that might have been used in the future.  These dirty cache lines will slow down future cache line reads (the cache line will need to be written before the new cache line can be read in).  So, use movntpd for those cases where you truly will not be using the data in the future.

 

The main memory benchmark is also interesting.  I can find no reason why this should be faster than the read contiguous data case, but it it.  Why?

 

Writing non-contiguous data

 

        mem = 4096                      ; Write 4KB

        cnt = mem/(4*128)               ; 128 bytes per iteration

        dist = 64

        sub     ebx, ebx

        mov     ecx, cnt

loop1:  movapd  [esi+0*dist], xmm1      ; Write 8 cache lines

        movapd  [esi+1*dist], xmm1

        movapd  [esi+2*dist], xmm1

        movapd  [esi+3*dist], xmm1

        movapd  [esi+4*dist], xmm1

        movapd  [esi+5*dist], xmm1

        movapd  [esi+6*dist], xmm1

        movapd  [esi+7*dist], xmm1

        lea     esi, [esi+16]           ; Same cache lines

        add     bl, 256/4           ; 4 inner loop iterations

        jnc     loop1

        lea     esi, [esi-4*16+8*dist]  ; Next set of 8 cache lines

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per iteration

Bandwidth

Comments

4096 (4KB)

2845

88.25

2.02GB/sec

Slower than reading

65536 (64KB)

45516

88.25

2.02GB/sec

Slower than reading.

1048576 (1MB)

1957866

239.00

0.75GB/sec

Slower than reading.  Why?

 

Conclusions:  Group your writes together!!!  Failure to do so reduces throughput by 66%.  Here each cache-line write takes 11 clocks instead of  14 in the contiguous case.  This test below will tell us if write-combining is helping to some degree.

 

Now we’ll do the same as above only writing to 16 bytes in each cache line.

 

        mem = 4096                      ; Write 4KB

        cnt = mem/(8*64)                ; 8 cache lines each iteration

        dist = 64

        mov     ecx, cnt

loop1:  movapd  [esi+0*dist], xmm1      ; Write 8 cache lines

        movapd  [esi+1*dist], xmm1

        movapd  [esi+2*dist], xmm1

        movapd  [esi+3*dist], xmm1

        movapd  [esi+4*dist], xmm1

        movapd  [esi+5*dist], xmm1

        movapd  [esi+6*dist], xmm1

        movapd  [esi+7*dist], xmm1

        lea     esi, [esi+8*dist]       ; Next set of 8 cache lines

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per iteration

Comments

4096 (4KB)

706

88.25

Same as timing above

 

Conclusions:  This indicates that each write takes 11 clocks.  Write-combining was of no value in the previous benchmark.  I speculate the 14 clocks in the contiguous write case is broken down as follows, 11 for the write to L2 cache and 1 clock for each of the 3 movapd write-combinings that took place.

 

Cost of a TLB miss

 

This code reads one cache line from each 4KB page.  The L1 and L2 cache line collisions are minimized so that we can analyze the cost of a TLB miss.

 

        tlbs = 32

        cnt = tlbs/8

        mov     ecx, cnt

        sub     eax, eax

loop1:  movapd  xmm1, [esi]             ; Read from a 4KB page

        lea     esi, [esi+4096+64]      ; Different line, next page

        add     al, 256/8               ; 8 inner loops - for

        jnc     loop1                    ; perfect branch prediction

        dec     ecx

        jnz     loop1

        lea     esi, [esi-tlbs*(4096+64)]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Tlbs

Clocks

Clocks per TLB

Comments

32

82

2.56

 

64

147

2.3

Timings vary widely

72

357

4.95

TLB miss penalty starting to kick in

128

2608

20.38

Should be no TLB hits

 

Conclusion:  A TLB miss where the page table entry is in the L2 cache costs 18 clocks.

 

Real world programs – reading and writing contiguous data

 

The above examples are nice, but most applications read data, operate on it, and write it back.

 

        mem = 4096                      ; Work on 4KB

        cnt = mem/64                    ; 64 bytes per iteration

        mov     ecx, cnt

loop1:  movapd  xmm0, [esi]             ; Read one cache line

        movapd  xmm1, [esi+16]

        movapd  xmm2, [esi+32]

        movapd  xmm3, [esi+48]

        subpd   xmm0, xmm0              ; Operate on the data

        pxor    xmm1, xmm1

        subpd   xmm2, xmm2

        pxor    xmm3, xmm3

        movapd  [esi], xmm0             ; Write the cache line

        movapd  [esi+16], xmm1

        movapd  [esi+32], xmm2

        movapd  [esi+48], xmm3

        lea     esi, [esi+64]           ; Next cache line

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per 64 byte cache line

Bandwidth

Comments

4096 (4KB)

901

14.07

6.36GB/sec

Limited by write-through to L2 speed

65536 (64KB)

24669

24.09

3.72GB/sec

 

1048576 (1MB)

1893992

115.60

0.78GB/sec

 

 

Conclusions:  We need a good explanation for the 24 clock number in the L2 cache timing.  Ideas?

 

If you want to write a program that is CPU bound instead of memory bandwidth bound, you must perform at least 116 clocks of ALU/FPU instructions for every 64 bytes read!  Obviously, it is very difficult to do this for applications whose data does not fit in the L2 cache.

 

Now let’s try replacing the last 4 movapd instructions with movntpd:

Mem

Clocks

Clocks per 64 byte cache line

Bandwidth

Comments

1048576 (1MB)

1822550

111.24

0.81GB/sec

 

 

Conclusion:  Not much difference.  This could be within our margin or error in timing or it could indicate that there is a slight penalty for reading into a dirty cache line as opposed to writing dirty cache lines immediately.

 

In coding one application, I thought it would be advantageous to use the clflush command to write out dirty cache lines during a period where the program was CPU bound (a kind of prefetcht1 instruction in reverse!)  This turned out to be a losing move.  Adding a “clflush [esi]” instruction after the 4 write instructions balloons the timing to 2625967 clocks.  Avoid clflush at all costs.

 

 

Now what happens when we read from one area of memory and write to another:

 

        mem = 1048576                   ; Work on 1MB

        cnt = mem/64                    ; 64 bytes per iteration

        mov     ecx, cnt

loop1:  movapd  xmm0, [esi]             ; Read one cache line

        movapd  xmm1, [esi+16]

        movapd  xmm2, [esi+32]

        movapd  xmm3, [esi+48]

        subpd   xmm0, xmm0              ; Operate on the data

        pxor    xmm1, xmm1

        subpd   xmm2, xmm2

        pxor    xmm3, xmm3

        movntpd  [esi+mem], xmm0        ; Write the cache line

        movntpd  [esi+mem+16], xmm1

        movntpd  [esi+mem+32], xmm2

        movntpd  [esi+mem+48], xmm3

        lea     esi, [esi+64]           ; Next cache line

        dec     ecx

        jnz     loop1

        lea     esi, [esi-mem]

 

Timing for the above on a 1.4GHz P4 with PC600 memory:

Mem

Clocks

Clocks per 64 byte cache line

Bandwidth

Comments

1048576 (1MB)

1680771

102.59

0.87GB/sec

 

 

Conclusions:  Why would this be 10% faster than writing to the cache line just read in?

 

Tricks with SSE floating point exponents

 

In my FFT macros, I often need to multiply an XMM register by two.  At first I thought of two options:

            ADDPD    xmm, xmm            ;; Add register to itself

Or

            XMM_TWO  DQ     2.0, 2.0

            MULPD    xmm, XMM_TWO        ;; Multiply by a global variable

I later thought of a third option:

            XMM_INC_EXP DB     1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0

            PADDB    xmm, XMM_INC_EXP   ;; Add 1 to each exponent!

 

In my case the ADDPD option was unattractive because I was already doing a lot of ADDPDs and they can only be issued every other clock cycle.  The PADDB instruction is an attractive alternative to the MULPD instruction in that it has a shorter latency.  Note:  I haven’t tested the above yet!

 

There are probably other opportunities for such optimizations.