Comment 3 for bug 446480

Revision history for this message
Daniel J Blueman (danielblueman) wrote :

X86_PAT (Page Attribute Table support) use cases/justification:

- Tania uses her workstation for graphic design and wants optimal performance
 -> she deploys a discreet graphics card, but is disappointed by the performance

- James has a high-end GPGPU he uses to mathematical modelling
 -> he experiences full performance on Fedora, and has to rebuild the Karmic kernel with X86_PAT to get the same performance, but has lost a lot of time investigating

Benchmarks:

- using Karmic 9.10 beta as of 2009-10-09 with accelerated radeon driver on a ATI Radeon HD 3470 (R600):
$ uname -r
2.6.31-12-generic
$ grep X86_PAT /boot/config-2.6.31-12-generic
# CONFIG_X86_PAT is not set
$ dmesg
[ 7.005248] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
[ 7.005282] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
[ 7.198527] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
[ 7.198562] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
[ 7.198587] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining
$ cat /proc/mtrr
reg00: base=0x0d0000000 ( 3328MB), size= 256MB, count=1: uncachable
reg01: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
reg02: base=0x000000000 ( 0MB), size= 4096MB, count=1: write-back
reg03: base=0x100000000 ( 4096MB), size= 512MB, count=1: write-back
reg04: base=0x120000000 ( 4608MB), size= 256MB, count=1: write-back
$ x11perf -shmputxy500
x11perf - X11 performance program, version 1.2
The X.Org Foundation server version 10603000 on :0.0
from veyron
Fri Oct 9 11:09:46 2009

Sync time adjustment is 0.0530 msecs.

   2400 reps @ 2.6098 msec ( 383.0/sec): ShmPutImage XY 500x500 square
   2400 reps @ 2.6925 msec ( 371.0/sec): ShmPutImage XY 500x500 square
   2400 reps @ 2.6896 msec ( 372.0/sec): ShmPutImage XY 500x500 square
   2400 reps @ 2.6905 msec ( 372.0/sec): ShmPutImage XY 500x500 square
   2400 reps @ 2.6905 msec ( 372.0/sec): ShmPutImage XY 500x500 square
  12000 trep @ 2.6746 msec ( 374.0/sec): ShmPutImage XY 500x500 square

-> results: BIOS has not specified an MTRR covering the prefetchable PCI BAR for the graphics card. Xorg tries to modify the MTRRs (see dmesg) but failed. 372 reps/s.
-> built kernel using same upstream sources with X86_PAT enabled

$ uname -r
2.6.31.3-295c
$ grep X86_PAT /boot/config-2.6.31.3-295c
CONFIG_X86_PAT=y
$ x11perf -shmputxy500
x11perf - X11 performance program, version 1.2
The X.Org Foundation server version 10603000 on :0.0
from veyron
Fri Oct 9 11:42:19 2009

Sync time adjustment is 0.0390 msecs.

  16000 reps @ 0.3667 msec ( 2730.0/sec): ShmPutImage XY 500x500 square
  16000 reps @ 0.3629 msec ( 2760.0/sec): ShmPutImage XY 500x500 square
  16000 reps @ 0.3622 msec ( 2760.0/sec): ShmPutImage XY 500x500 square
  16000 reps @ 0.3623 msec ( 2760.0/sec): ShmPutImage XY 500x500 square
  16000 reps @ 0.3622 msec ( 2760.0/sec): ShmPutImage XY 500x500 square
  80000 trep @ 0.3633 msec ( 2750.0/sec): ShmPutImage XY 500x500 square

-> results: PAT allowed write-combining to coallesce data-writes to radeon command processor and framebuffer. 2760 reps/s
-> a neck-snapping 7.5x speedup is observed

without PAT, even if the BIOS sets up MTRRs correctly for the framebuffer, direct writes to the framebuffer get desired behaviour, however drivers are unable to mark command queue pages in other PCI BARs write-combining, thus a command/block of data can only be written by the processor 16 bytes at a time, rather than efficient 64 byte blocks, and control of when to flush (via wmb) isn't available.