P4 optimizations and
insights
Those considering assembly language optimizations on the P4 should first read the Intel manuals at
http://developer.intel.com/design/Pentium4/manuals/ especially the Processor Optimization Reference manual at http://developer.intel.com/design/pentium4/manuals/248966.htm.
Each section tries to analyze one small aspect of the P4 architecture. It is followed by a conclusion or speculation regarding the observed behavior.
Each code snipet is timed in a loop of at least 100 iterations. Memory is aligned on a 4KB page boundary. The code is then run 200 times and the best timing is reported. Note my machine is a 1.4 GHz P4 with PC600 memory (theoretical bandwidth of 2.4 GB/sec).
Note, the Intel manuals sometimes refer to cache line size as 128 bytes and sometimes as 64 bytes. In the examples below I’ll refer to a cache line as 64 bytes. The L1 cache line size is 64 bytes. The L2 cache is kind of a hybrid 128/64 byte cache. A cache line fill reads 128 bytes, but my tests show that the L2 cache maintains separate dirty bits for each 64-byte chunk.
This code reads a contiguous block of memory using 4 bytes at a time. Timings are done on 3 memory size. 4KB will read from the L1 cache only, 64KB will read from the L2 cache only, and 1MB will test reading from main memory.
mem =
4096 ; Read 4KB
cnt =
mem/64 ; Read 64 bytes per
iteration
mov ecx, cnt
loop1: mov eax, [esi] ; Read one cache line
mov eax, [esi+4] ; 4 bytes at a time
mov eax, [esi+8]
mov eax, [esi+12]
mov eax, [esi+16]
mov eax, [esi+20]
mov eax, [esi+24]
mov eax, [esi+28]
mov eax, [esi+32]
mov eax, [esi+36]
mov eax, [esi+40]
mov
eax, [esi+44]
mov eax, [esi+48]
mov eax, [esi+52]
mov eax, [esi+56]
mov eax, [esi+60]
lea esi, [esi+64] ; Next cache line
dec ecx
jnz loop1
lea esi, [esi-mem] ; Restore esi
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per loop iteration |
Bandwidth |
Comments |
|
4096 (4KB) |
1037 |
16.2 |
5.53GB/sec |
The expected result |
|
65536 (64KB) |
23625 |
23.07 |
3.88GB/sec |
Any good theories here? |
|
1048576 (1MB) |
1180000 |
72.02 |
1.24GB/sec |
50% of theoretical throughput. Why? |
Now using MMX instructions:
mem = 4096 ; Read 4KB
cnt = mem/64 ; 64 bytes per iteration
mov ecx, cnt
loop1: movq
mm1, [esi] ;
Read one cache line
movq mm1, [esi+8] ;
8 bytes at a time
movq mm1, [esi+16]
movq mm1, [esi+24]
movq mm1, [esi+32]
movq mm1, [esi+40]
movq mm1, [esi+48]
movq mm1, [esi+56]
lea esi, [esi+64] ;
Next cache line
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per loop iteration |
Bandwidth |
Comments |
|
4096 (4KB) |
523 |
8.17 |
10.96GB/sec |
As expected. |
|
65536 (64KB) |
15395 |
15.03 |
5.96GB/sec |
|
|
1048576 (1MB) |
1036299 |
63.25 |
1.42GB/sec |
Still not theoretical throughput. Why? |
Now using XMM instructions:
mem = 4096 ; Read 4KB
cnt = mem/64 ; 64 bytes per iteration
mov ecx, cnt
loop1: movapd
xmm1, [esi] ;
Read one cache line
movapd xmm1, [esi+16] ; 16 bytes at a time
movapd xmm1, [esi+32]
movapd xmm1, [esi+48]
lea esi, [esi+64] ;
Next cache line
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per loop iteration |
Bandwidth |
Comments |
|
4096 (4KB) |
268 |
4.19 |
21.40GB/sec |
|
|
65536 (64KB) |
7766 |
7.58 |
11.81GB/sec |
|
|
1048576 (1MB) |
817776 |
49.91 |
1.80GB/sec |
Still not theoretical throughput. Why? |
Conclusions: The P4
is unable to hide all the latencies when reading from the L2 cache. This is somewhat surprising, as it should be
getting near perfect branch prediction.
How does Intel justify claims of 44.8GB/sec for L2 cache
bandwidth?
One would hope that the P4 would get closer to its 2.4GB/sec
peak bandwidth. Why isn’t it?
Now we will see if using the SSE2 prefetch instructions can get us closer to the 2.4GB/sec theoretical bandwidth. We’ll also see how much as a performance penalty prefetch incurs when reading from the L2 cache.
mem = 16*16*4096 ; Read 1MB
pages = mem/4096 ; Number of 4KB pages
mov ecx, pages
sub ebx, ebx
loop0: mov
eax, [esi+4096] ; Preload
next page TLB
loop1: movapd
xmm1, [esi] ; Read
two cache lines
movapd xmm1, [esi+16] ;
16 bytes at a time
movapd xmm1, [esi+32]
movapd xmm1, [esi+48]
movapd xmm1, [esi+64]
movapd xmm1, [esi+80]
movapd xmm1, [esi+96]
movapd xmm1, [esi+112]
prefetcht1 [esi+4096] ; Preload lines in next page
lea esi, [esi+128]
; Next cache lines
add bl, 256/32
; 32 iterations per page
jnc loop1
dec ecx
jnz loop0
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per 64 byte cache line |
Bandwidth |
Comments |
|
65536 (64KB) |
8602 |
8.40 |
10.67GB/sec |
Unnecessary prefetch causes some overhead |
|
1048576 (1MB) |
737952 |
45.04 |
1.99GB/sec |
Still not theoretical throughput. Why? |
Conclusions: The prefetch instruction caused a modest 11% overhead (8602 vs. 7766 clocks) when not reading from memory. This overhead may well disappear in real-world code where the loop is actually doing real work.
This example reads an entire block of memory, but in a non-contiguous fashion.
mem = 4096 ; Read 4KB
cnt = mem/(4*128) ; 128 bytes per iteration
dist = 64
sub ebx, ebx
mov ecx, cnt
loop1: movapd
xmm1, [esi+0*dist] ; Read 8
cache lines
movapd xmm1, [esi+1*dist]
movapd xmm1, [esi+2*dist]
movapd xmm1, [esi+3*dist]
movapd xmm1, [esi+4*dist]
movapd xmm1, [esi+5*dist]
movapd xmm1, [esi+6*dist]
movapd xmm1, [esi+7*dist]
lea esi, [esi+16]
; Same 8 cache lines
add bl, 256/4 ;
4 inner loop iterations
jnc loop1
lea esi, [esi-4*16+8*dist]
; Next set of 8 cache lines
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per 64 byte cache line |
Bandwidth |
Comments |
|
4096 (4KB) |
268 |
4.19 |
21.40GB/sec |
Same as contiguous |
|
65536 (64KB) |
5900 |
5.76 |
15.55GB/sec |
Much faster! |
|
1048576 (1MB) |
818193 |
49.94 |
1.80GB/sec |
Same as contiguous |
Conclusions: This is very interesting. If your application’s data fits in the L2 cache, grouping your L1 cache misses together increases your throughput. You may want to try preloading L2 cache lines into the L1 cache in an earlier iteration. For example, adding the line
mov eax, [esi+512]
; Preload cache line into L1
in the contiguous read code speeds up the 64KB timing.
If the prefetchnt1 SSE2 instruction works as advertised, the trick of grouping L2 cache misses together should also benefit the 1MB read example.
mem = 4096 ; Write 4KB
cnt = mem/64 ; 64 bytes per iteration
mov ecx, cnt
loop1: movapd
[esi], xmm1 ; Write one cache line
movapd [esi+16], xmm1 ; 16 bytes at a time
movapd [esi+32], xmm1
movapd [esi+48], xmm1
lea esi, [esi+64] ;
Next cache line
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per 64 byte cache line |
Bandwidth |
Comments |
|
4096 (4KB) |
902 |
14.09 |
6.36GB/sec |
Slower than reading |
|
65536 (64KB) |
14372 |
14.04 |
6.38GB/sec |
Slower than reading. |
|
1048576 (1MB) |
1768230 |
107.92 |
0.83GB/sec |
|
Conclusions: The write-through L1 cache is a killer here. It takes 14 clocks to write the L1 cache line back to the L2 cache. This places a limit on how fast you can code an application that reads and writes
only from the L1 cache. To keep the CPU fully working you must find 14 clocks of work to do for each 64 bytes your program writes.
The main memory benchmark is also interesting. Obviously, the processor needs to read the
cache line from memory before doing the write operation. The theoretical maximum bandwidth is thus
1.2GB/sec. This example achieves only
66% of theoretical, while the contiguous read example achieves 75% of
theoretical. Why?
Now let’s try it using the movntpd instruction. This should avoid the reading of the cache line prior to writing the cache line.
mem = 4096 ; Write 4KB
cnt = mem/64 ; 64 bytes per iteration
mov ecx, cnt
loop1: movntpd
[esi], xmm1 ; Write one cache line
movntpd [esi+16], xmm1 ; 16 bytes at a time
movntpd [esi+32], xmm1
movntpd [esi+48], xmm1
lea esi, [esi+64] ;
Next cache line
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per 64 byte cache line |
Bandwidth |
Comments |
|
65536 (64KB) |
45052 |
44.00 |
2.03GB/sec |
Slower, but see conclusions. |
|
1048576 (1MB) |
722035 |
44.07 |
2.03GB/sec |
|
Conclusions: At first glance, the L2 cache test run looks slower than the movapd case. This is misleading. Using the movapd instruction left the L2 full of dirty cache lines that must one day be written to main memory. What’s worse is the cache lines written to the L2 cache may have ousted other cache lines that might have been used in the future. These dirty cache lines will slow down future cache line reads (the cache line will need to be written before the new cache line can be read in). So, use movntpd for those cases where you truly will not be using the data in the future.
The main memory benchmark is also interesting. I can find no reason why this should be
faster than the read contiguous data case, but it it. Why?
mem = 4096 ; Write 4KB
cnt = mem/(4*128) ; 128 bytes per iteration
dist = 64
sub ebx, ebx
mov ecx, cnt
loop1: movapd
[esi+0*dist], xmm1 ; Write 8
cache lines
movapd [esi+1*dist],
xmm1
movapd [esi+2*dist], xmm1
movapd [esi+3*dist], xmm1
movapd [esi+4*dist], xmm1
movapd [esi+5*dist], xmm1
movapd [esi+6*dist], xmm1
movapd [esi+7*dist], xmm1
lea
esi, [esi+16] ; Same cache lines
add bl, 256/4 ;
4 inner loop iterations
jnc loop1
lea esi, [esi-4*16+8*dist]
; Next set of 8 cache lines
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per iteration |
Bandwidth |
Comments |
|
4096 (4KB) |
2845 |
88.25 |
2.02GB/sec |
Slower than reading |
|
65536 (64KB) |
45516 |
88.25 |
2.02GB/sec |
Slower than reading. |
|
1048576 (1MB) |
1957866 |
239.00 |
0.75GB/sec |
Slower than reading. Why? |
Conclusions: Group your writes together!!! Failure to do so reduces throughput by 66%. Here each cache-line write takes 11 clocks instead of 14 in the contiguous case. This test below will tell us if write-combining is helping to some degree.
Now we’ll do the same as above only writing to 16 bytes in each cache line.
mem = 4096 ; Write 4KB
cnt = mem/(8*64) ; 8 cache lines each iteration
dist = 64
mov ecx, cnt
loop1: movapd
[esi+0*dist], xmm1 ; Write 8
cache lines
movapd [esi+1*dist], xmm1
movapd [esi+2*dist], xmm1
movapd [esi+3*dist], xmm1
movapd [esi+4*dist], xmm1
movapd [esi+5*dist], xmm1
movapd [esi+6*dist], xmm1
movapd [esi+7*dist], xmm1
lea esi, [esi+8*dist]
; Next set of 8 cache lines
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per iteration |
Comments |
|
4096 (4KB) |
706 |
88.25 |
Same as timing above |
Conclusions: This indicates that each write takes 11 clocks. Write-combining was of no value in the previous benchmark. I speculate the 14 clocks in the contiguous write case is broken down as follows, 11 for the write to L2 cache and 1 clock for each of the 3 movapd write-combinings that took place.
This code reads one cache line from each 4KB page. The L1 and L2 cache line collisions are minimized so that we can analyze the cost of a TLB miss.
tlbs = 32
cnt = tlbs/8
mov ecx, cnt
sub eax, eax
loop1: movapd
xmm1, [esi] ; Read
from a 4KB page
lea esi, [esi+4096+64]
; Different line, next page
add al, 256/8
; 8 inner loops - for
jnc loop1 ;
perfect branch prediction
dec ecx
jnz loop1
lea esi, [esi-tlbs*(4096+64)]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Tlbs |
Clocks |
Clocks per TLB |
Comments |
|
32 |
82 |
2.56 |
|
|
64 |
147 |
2.3 |
Timings vary widely |
|
72 |
357 |
4.95 |
TLB miss penalty starting to kick in |
|
128 |
2608 |
20.38 |
Should be no TLB hits |
Conclusion: A TLB miss where the page table entry is in the L2 cache costs 18 clocks.
The above examples are nice, but most applications read data, operate on it, and write it back.
mem = 4096 ; Work on 4KB
cnt = mem/64 ; 64 bytes per iteration
mov ecx, cnt
loop1: movapd
xmm0, [esi] ; Read
one cache line
movapd xmm1, [esi+16]
movapd xmm2, [esi+32]
movapd xmm3, [esi+48]
subpd xmm0, xmm0 ;
Operate on the data
pxor xmm1, xmm1
subpd xmm2, xmm2
pxor xmm3, xmm3
movapd [esi], xmm0 ;
Write the cache line
movapd [esi+16], xmm1
movapd [esi+32], xmm2
movapd [esi+48], xmm3
lea esi, [esi+64]
; Next cache line
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per 64 byte cache line |
Bandwidth |
Comments |
|
4096 (4KB) |
901 |
14.07 |
6.36GB/sec |
Limited by write-through to L2 speed |
|
65536 (64KB) |
24669 |
24.09 |
3.72GB/sec |
|
|
1048576 (1MB) |
1893992 |
115.60 |
0.78GB/sec |
|
Conclusions: We need
a good explanation for the 24 clock number in the L2 cache timing. Ideas?
If you want to write a program that is CPU bound instead of memory bandwidth bound, you must perform at least 116 clocks of ALU/FPU instructions for every 64 bytes read! Obviously, it is very difficult to do this for applications whose data does not fit in the L2 cache.
Now let’s try replacing the last 4 movapd instructions with movntpd:
|
Mem |
Clocks |
Clocks per 64 byte cache line |
Bandwidth |
Comments |
|
1048576 (1MB) |
1822550 |
111.24 |
0.81GB/sec |
|
Conclusion: Not much difference. This could be within our margin or error in timing or it could indicate that there is a slight penalty for reading into a dirty cache line as opposed to writing dirty cache lines immediately.
In coding one application, I thought it would be
advantageous to use the clflush command to write out dirty cache lines during a
period where the program was CPU bound (a kind of prefetcht1 instruction in
reverse!) This turned out to be a
losing move. Adding a “clflush [esi]”
instruction after the 4 write instructions balloons the timing to 2625967
clocks. Avoid clflush at all costs.
Now what happens when we read from one area of memory and write to another:
mem = 1048576 ; Work on 1MB
cnt = mem/64 ; 64 bytes per iteration
mov
ecx, cnt
loop1: movapd
xmm0, [esi] ; Read
one cache line
movapd xmm1, [esi+16]
movapd xmm2, [esi+32]
movapd xmm3, [esi+48]
subpd xmm0, xmm0 ;
Operate on the data
pxor xmm1, xmm1
subpd xmm2, xmm2
pxor xmm3, xmm3
movntpd [esi+mem], xmm0 ;
Write the cache line
movntpd [esi+mem+16], xmm1
movntpd [esi+mem+32], xmm2
movntpd [esi+mem+48], xmm3
lea esi, [esi+64] ; Next cache line
dec ecx
jnz loop1
lea esi, [esi-mem]
Timing for the above on a 1.4GHz P4 with PC600 memory:
|
Mem |
Clocks |
Clocks per 64 byte cache line |
Bandwidth |
Comments |
|
1048576 (1MB) |
1680771 |
102.59 |
0.87GB/sec |
|
Conclusions: Why would this be 10% faster than writing to the cache line just read in?
In my FFT macros, I often need to multiply an XMM register by two. At first I thought of two options:
ADDPD xmm,
xmm ;;
Add register to itself
Or
XMM_TWO DQ 2.0, 2.0
MULPD xmm,
XMM_TWO ;;
Multiply by a global variable
I later thought of a third option:
XMM_INC_EXP DB 1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
PADDB xmm, XMM_INC_EXP ;; Add 1 to each exponent!
In my case the ADDPD option was unattractive because I was already doing a lot of ADDPDs and they can only be issued every other clock cycle. The PADDB instruction is an attractive alternative to the MULPD instruction in that it has a shorter latency. Note: I haven’t tested the above yet!
There are probably other opportunities for such optimizations.