X86_PAT (Page Attribute Table support) use cases/justification: - Tania uses her workstation for graphic design and wants optimal performance -> she deploys a discreet graphics card, but is disappointed by the performance - James has a high-end GPGPU he uses to mathematical modelling -> he experiences full performance on Fedora, and has to rebuild the Karmic kernel with X86_PAT to get the same performance, but has lost a lot of time investigating Benchmarks: - using Karmic 9.10 beta as of 2009-10-09 with accelerated radeon driver on a ATI Radeon HD 3470 (R600): $ uname -r 2.6.31-12-generic $ grep X86_PAT /boot/config-2.6.31-12-generic # CONFIG_X86_PAT is not set $ dmesg [ 7.005248] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining [ 7.005282] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining [ 7.198527] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining [ 7.198562] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining [ 7.198587] mtrr: type mismatch for d0000000,10000000 old: write-back new: write-combining $ cat /proc/mtrr reg00: base=0x0d0000000 ( 3328MB), size= 256MB, count=1: uncachable reg01: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable reg02: base=0x000000000 ( 0MB), size= 4096MB, count=1: write-back reg03: base=0x100000000 ( 4096MB), size= 512MB, count=1: write-back reg04: base=0x120000000 ( 4608MB), size= 256MB, count=1: write-back $ x11perf -shmputxy500 x11perf - X11 performance program, version 1.2 The X.Org Foundation server version 10603000 on :0.0 from veyron Fri Oct 9 11:09:46 2009 Sync time adjustment is 0.0530 msecs. 2400 reps @ 2.6098 msec ( 383.0/sec): ShmPutImage XY 500x500 square 2400 reps @ 2.6925 msec ( 371.0/sec): ShmPutImage XY 500x500 square 2400 reps @ 2.6896 msec ( 372.0/sec): ShmPutImage XY 500x500 square 2400 reps @ 2.6905 msec ( 372.0/sec): ShmPutImage XY 500x500 square 2400 reps @ 2.6905 msec ( 372.0/sec): ShmPutImage XY 500x500 square 12000 trep @ 2.6746 msec ( 374.0/sec): ShmPutImage XY 500x500 square -> results: BIOS has not specified an MTRR covering the prefetchable PCI BAR for the graphics card. Xorg tries to modify the MTRRs (see dmesg) but failed. 372 reps/s. -> built kernel using same upstream sources with X86_PAT enabled $ uname -r 2.6.31.3-295c $ grep X86_PAT /boot/config-2.6.31.3-295c CONFIG_X86_PAT=y $ x11perf -shmputxy500 x11perf - X11 performance program, version 1.2 The X.Org Foundation server version 10603000 on :0.0 from veyron Fri Oct 9 11:42:19 2009 Sync time adjustment is 0.0390 msecs. 16000 reps @ 0.3667 msec ( 2730.0/sec): ShmPutImage XY 500x500 square 16000 reps @ 0.3629 msec ( 2760.0/sec): ShmPutImage XY 500x500 square 16000 reps @ 0.3622 msec ( 2760.0/sec): ShmPutImage XY 500x500 square 16000 reps @ 0.3623 msec ( 2760.0/sec): ShmPutImage XY 500x500 square 16000 reps @ 0.3622 msec ( 2760.0/sec): ShmPutImage XY 500x500 square 80000 trep @ 0.3633 msec ( 2750.0/sec): ShmPutImage XY 500x500 square -> results: PAT allowed write-combining to coallesce data-writes to radeon command processor and framebuffer. 2760 reps/s -> a neck-snapping 7.5x speedup is observed without PAT, even if the BIOS sets up MTRRs correctly for the framebuffer, direct writes to the framebuffer get desired behaviour, however drivers are unable to mark command queue pages in other PCI BARs write-combining, thus a command/block of data can only be written by the processor 16 bytes at a time, rather than efficient 64 byte blocks, and control of when to flush (via wmb) isn't available.