Created attachment 451152
Provide info on the Evo channel in debugfs

I was able to do some debugging on this issue today, and have a mosly reproducible test case (~90-95%) on my laptop (also using an external monitor). Here's nouveau's spiel about the card on boot of the 2.6.34.7-56.fc13.x86_64 kernel.

[drm] nouveau 0000:01:00.0: Detected an NV50 generation card (0x086900a2)
[drm] nouveau 0000:01:00.0: Attempting to load BIOS image from PRAMIN
[drm] nouveau 0000:01:00.0: ... appears to be valid
[drm] nouveau 0000:01:00.0: BIT BIOS found
[drm] nouveau 0000:01:00.0: Bios version 60.86.68.00
[drm] nouveau 0000:01:00.0: TMDS table revision 2.0 not currently supported
[drm] nouveau 0000:01:00.0: Found Display Configuration Block version 4.0
[drm] nouveau 0000:01:00.0: Raw DCB entry 0: 01000323 00010034
[drm] nouveau 0000:01:00.0: Raw DCB entry 1: 02011300 00000028
[drm] nouveau 0000:01:00.0: Raw DCB entry 2: 02022312 00010010
[drm] nouveau 0000:01:00.0: Raw DCB entry 3: 010333f1 00c0c070
[drm] nouveau 0000:01:00.0: Raw DCB entry 4: 0000000e 00000000
[drm] nouveau 0000:01:00.0: DCB connector table: VHER 0x40 5 14 2
[drm] nouveau 0000:01:00.0:   0: 0x00000041: type 0x41 idx 0 tag 0xff
[drm] nouveau 0000:01:00.0: unknown type, using 0x40
[drm] nouveau 0000:01:00.0:   1: 0x00000100: type 0x00 idx 1 tag 0xff
[drm] nouveau 0000:01:00.0:   2: 0x00001255: type 0x55 idx 2 tag 0x07
[drm] nouveau 0000:01:00.0: unknown type, using 0x31
[drm] nouveau 0000:01:00.0:   3: 0x00000310: type 0x10 idx 3 tag 0xff
[drm] nouveau 0000:01:00.0:   4: 0x00000311: type 0x11 idx 4 tag 0xff
[drm] nouveau 0000:01:00.0:   5: 0x00000313: type 0x13 idx 5 tag 0xff
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 0 at offset 0xC11C
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 1 at offset 0xC486
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 2 at offset 0xD0C7
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 3 at offset 0xD1B9
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table 4 at offset 0xD3B3
[drm] nouveau 0000:01:00.0: Parsing VBIOS init table at offset 0xD418
[drm] nouveau 0000:01:00.0: 0xD418: Condition still not met after 20ms, skipping following opcodes
[drm] nouveau 0000:01:00.0: 0xB1D6: parsing output script 0
[drm] nouveau 0000:01:00.0: 0xB34C: parsing output script 0
[drm] nouveau 0000:01:00.0: 0xA228: parsing output script 0
[drm] nouveau 0000:01:00.0: Detected 256MiB VRAM
[drm] nouveau 0000:01:00.0: 512 MiB GART (aperture)
[drm] nouveau 0000:01:00.0: DCB encoder 1 unknown
[drm] nouveau 0000:01:00.0: TV-1 has no encoders, removing
[drm] nouveau 0000:01:00.0: gpio tag 0xff not found
[drm] nouveau 0000:01:00.0: gpio tag 0xff not found
[drm] nouveau 0000:01:00.0: Allocating FIFO number 1
[drm] nouveau 0000:01:00.0: nouveau_channel_alloc: initialised FIFO 1
[drm] nouveau 0000:01:00.0: allocated 1920x1200 fb: 0x40230000, bo ffff8800790a2400
fbcon: nouveaufb (fb0) is primary device
[drm] nouveau 0000:01:00.0: 0xB083: parsing clock script 0
fb0: nouveaufb frame buffer device
[drm] Initialized nouveau 0.0.16 20090420 for 0000:01:00.0 on minor 0
[drm] nouveau 0000:01:00.0: 0xB199: parsing clock script 1
[drm] nouveau 0000:01:00.0: 0x1085: parsing clock script 0
[drm] nouveau 0000:01:00.0: Allocating FIFO number 2
[drm] nouveau 0000:01:00.0: nouveau_channel_alloc: initialised FIFO 2


I find that as we rewind the Evo's push buffer, we somehow trigger a display error, giving the message we've come to hate:

[drm] nouveau 0000:01:00.0: EvoCh 0 Mthd 0x0000 Data 0x00000400 (0x0002 0x01)

So far, this is closely correlated with the driver setting the ring to jump to offset 0 and wait for it to come out of the SKIPS (line 297 in nouveau.git's nouveau_dma.c -- commit 0100828245). I have never seen the error message without a wrap of the Evo ring, and some 30+ occurring in close proximity.
I instrumented the code, and we usually wrap at about index 0x3f9 or 0x3fb, and evo->dma.cur is 0x27 when the message is printed. It's usually nv50_cursor_show() or nv50_cursor_hide() that is running at the time.

The quickest way for me to reproduce this has been to boot, log into GNOME, open a few terminal windows, and watch /sys/kernel/debug/dri/0/channels/evo. I find that locking/unlocking the screen with the blank screen saver adds about ~0x120 to the current location each iteration. Once I'm close to a wrap, I move the mouse between the windows, which advances cur by ~0x20 or so until it wraps.

Once we get a wrap and the error message, the GPU's get location for this channel will be forever 0. We'll continue to put commands in the ring, and once we fill it again, we'll start timing out in READ_GET() and that causes the stuttering mouse movements.

I've gone about as far as I can, I think. I currently don't have the knowledge to figure out why we get the display error on most wraps but not all. It is perhaps timing related -- using nouveau_bo_rd32() on the push buffer to verify contents after we do the WRITE_PUT(NOUVEAU_DMA_SKIPS) made it 3 wraps before it died once, but then it died on the first wrap the next boot.

If you guys can give me a pointer as to where to look next, I'm game. Until then, I hope this helps.