AGP GPUs driven as PCI ones (when AGP is disabled at kernel build time) are known to fail on AMD K8, K10 and Intel Kentsfield platforms

Bug #1902981 reported by Thomas Debesse
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

This bug is to track specific issues faced by AGP GPUs when running as PCI devices (when AGP support is disabled at kernel build time), unless otherwise proven it is believed fixing #1902795 (PCI GPUs support being broken) may not fix all issues for AGP GPUs running as PCI ones (more to come on that topic).

See related bugs:

- https://bugs.launchpad.net/bugs/1902795
> PCI graphics broken on AMD K8/K10 platform (while it works on Intel) verified from Linux 4.4 to 5.10-rc1

- https://bugs.launchpad.net/bugs/1899304
> AGP disablement leaves GPUs without working alternative (PCI fallback is broken), makes very-capable ATI TeraScale GPUs unusable

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1902981

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Thomas Debesse (illwieckz) wrote : Re: AGP GPU on PCI mode (when AGP is disabled at kernel build time) known to fail on K8 and K10 platforms
Download full text (5.4 KiB)

As a reminder, this is a dmesg captured when running ATI Radeon HD 4670 AGP on a K10 host on Linux 5.9 (vanilla).

The ATI Radeon HD 4670 AGP (RV730 XT) is a very capable TeraScale GPU, supporting OpenGL 3.3 (Directx 10 on Windows) and OpenCL 1.0, and featured HDMI output and 1GB of VRAM. The host is also a very capable AMD Phenom II quad core CPU with 16GB of ram.

To verify if its performances match 2020 expectations, I just engaged it (running Ubuntu 20.04) in 2020 Xonotic Defrag World Championship which is currently running (https://xdwc.teichisma.info/), and I got feedback from some players reporting this hardware may be better than their own hardware they compete with. In fact competitive games like Xonotic run at 144fps on 1920×1080 resolution.

The last kernel able to drive this GPU on Ubuntu 20.04 LTS is the 5.4.0-47-generic one, the 5.4.0-48-generic one is believed to have backported the AGP disablement from 5.9-rc1 (ba806f9).

So, when running on 5.4.0-48-generic kernel from Ubuntu repositories, or here, 5.9 vanilla compiled by myself, interesting parts from dmesg log may be:

```
[ 5.242322] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
[ 5.242359] radeon 0000:01:00.0: disabling GPU acceleration
```

and:

```
[ 34.558889] trying to bind memory to uninitialized GART !
[ 34.559048] WARNING: CPU: 1 PID: 2516 at drivers/gpu/drm/radeon/radeon_gart.c:299 radeon_gart_bind+0xdf/0xf0 [radeon]
[ 34.559050] Modules linked in: zram snd_usb_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_usbmidi_lib snd_hda_core snd_hwdep snd_pcm snd_seq_midi kvm_amd snd_seq_midi_event ccp joydev kvm snd_seq snd_rawmidi input_leds snd_timer snd_seq_device snd soundcore k10temp mac_hid serio_raw binfmt_misc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear uas usb_storage hid_generic usbhid hid radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm psmouse forcedeth i2c_nforce2
[ 34.559107] CPU: 1 PID: 2516 Comm: gnome-shell Not tainted 5.9.0 #1
[ 34.559109] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AM2NF3-VSTA, BIOS P3.20 10/09/2009
[ 34.559178] RIP: 0010:radeon_gart_bind+0xdf/0xf0 [radeon]
[ 34.559184] Code: 00 48 89 ef 48 8b 40 60 e8 0e 2f 44 df 31 c0 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 38 6f 6b c0 e8 23 0c 6d de <0f> 0b b8 ea ff ff ff eb dc 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 34.559187] RSP: 0018:ffffc030838f7a28 EFLAGS: 00010282
[ 34.559191] RAX: 0000000000000000 RBX: ffffa0cf6b88eb80 RCX: 0000000000000027
[ 34.559193] RDX: 0000000000000027 RSI: 0000000000000086 RDI: ffffa0cf6fc98d08
[ 34.559196] RBP: ffffc030838f7b28 R08: ffffa0cf6fc98d00 R09: 0000000000000004
[ 34.559198] R10: 0000000000000000 R11: 0000000000000001 R12: ffffc030838f7b28
[ 34.559201] R13: ffffa0cf6a622868 R14: ffffa0cf6c7cc6e8 R15: ffffc030838f7b28
[ 34.559204] FS: 00007f46ae245cc0(0000) GS:ffffa0cf6fc80000(0000) knlGS:0000000000000000
[ 34.559207] CS: ...

Read more...

Revision history for this message
Thomas Debesse (illwieckz) wrote :
Download full text (12.3 KiB)

When applying patch from https://bugs.launchpad.net/bugs/1902795

- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1902795/+attachment/5431335/+files/0001-drm-radeon-make-all-PCI-GPUs-use-32bits-DMA-bit-mask.patch

which reduces the breakage (but not fix completely) the issues faced with PCI GPUs on K8 and K10 hosts by setting DMA bit mask to 32-bits for all PCI GPUs, we can see those this that is fixed on PCI GPUs is not fixed on AGP-as-PCI GPUs (and there is even more errores before that):

```
[ 5.242322] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
```

Things even go that wrong we even don't see those other errors that are expected to be seen after that:

```
[ 5.242359] radeon 0000:01:00.0: disabling GPU acceleration
```

```
 [ 34.558889] trying to bind memory to uninitialized GART !
```

Instead, the kernel loops before reaching those errors, trying desperately to pass this r600_ring_test step.

But before r600_ring_test failure message is printed, more and newer issues about ring 0 being stalled and GU lockup occurs with AGP-as-PCI GPUs that are never seen with PCI-native GPUs, especially when taken in account PCI GPUs can at least pass the r600_ring_test with the patch.

Also, after the r600_ring_test failure message, instead of getting the message telling GPU acceleration is disabled, we get a message about r600 startup failing on resume which is new.

This is why it is believed that fixing PCI GPUs may not be enough to fix AGP GPUs running as PCI ones when AGP is disabled at kernel build time.

Here are the issues that is only seen with AGP-as-PCI GPUs, occurring before and after the r600_ring_test failure message:

```
[ 45.763336] radeon 0000:01:00.0: ring 0 stalled for more than 10256msec
[ 45.763349] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 46.275324] radeon 0000:01:00.0: ring 0 stalled for more than 10768msec
[ 46.275335] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 46.787322] radeon 0000:01:00.0: ring 0 stalled for more than 11280msec
[ 46.787332] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 47.299336] radeon 0000:01:00.0: ring 0 stalled for more than 11792msec
[ 47.299346] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 47.811320] radeon 0000:01:00.0: ring 0 stalled for more than 12304msec
[ 47.811332] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 48.323331] radeon 0000:01:00.0: ring 0 stalled for more than 12816msec
[ 48.323344] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 48.835307] radeon 0000:01:00.0: ring 0 stalled for more than 13328msec
[ 48.835318] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 49.347328] radeon 0000:01:00.0: ring 0 stalled for more than...

Revision history for this message
Thomas Debesse (illwieckz) wrote :

To get a better picture of such top-of-the-line AGP GPU performance, when comparing to others GPUs on Unvanquished GPU compatibility matrix: https://wiki.unvanquished.net/wiki/GPU_compatibility_matrix

we can see the ATI Radeon HD 4670 AGP (RV730 XT, TeraScale 1) performs:

- better than the PCI Express ATI Radeon HD 7450 from Q1 2012 (RV910, Caicos, TeraScale 2),
- like the mobile Nvidia GeForce GT 740M from Q2 2013 with nvidia driver (NVE7, GK107M, Kepler),
- like the mobile Quadro K1100M from Q3 2013 with nvidia driver (NVE7, GK107GLM, Kepler),
- like the integrated Intel HD 4600 from Q1 2014 (i7-4810MQ, Haswell, Gen7 GT2),
- like the integrated Intel HD 520 from Q3 2015 (i3-6100U, Skylake, Gen9 GT2),
- like the PCI Express GeForce GTX 1050 Ti from Q4 2016 when running the nouveau driver (Pascal).

On Nvidia side, to outperform this GPU on Linux with the free open source nouveau driver it is required to acquire at least a GeForce GTX 1060 from 2016 (NV136, GP106-300-A1, Pascal).

Intel users may had to wait for the UHD 600 series (2016) to outperform this ATI AGP GPU. To this day the first verified Intel GPU that is known to outperform this ATI AGP GPU is the UHD 620 from Q3 2019.

Revision history for this message
Thomas Debesse (illwieckz) wrote :

It looks like comment #3 had been truncated, the interesting part of the dmesg log that is missing is:

```

[ 66.755306] radeon 0000:01:00.0: ring 0 stalled for more than 31248msec
[ 66.755317] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000001 last fence id 0x0000000000000002 on ring 0)
[ 66.840372] radeon 0000:01:00.0: Saved 25 dwords of commands on ring 0.
[ 66.840402] radeon 0000:01:00.0: GPU softreset: 0x00000019
[ 66.840408] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA27034A1
[ 66.840414] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000102
[ 66.840419] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200028C0
[ 66.840424] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x04000000
[ 66.840429] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00010100
[ 66.840434] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00008C80
[ 66.840438] radeon 0000:01:00.0: R_008680_CP_STAT = 0x808182E7
[ 66.840443] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
[ 67.364934] radeon 0000:01:00.0: Wait for MC idle timedout !
[ 67.364940] radeon 0000:01:00.0: R_008020_GRBM_SOFT_RESET=0x00007F6B
[ 67.365005] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[ 67.367106] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0x00003028
[ 67.367110] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000002
[ 67.367114] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200028C0
[ 67.367118] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000
[ 67.367122] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000
[ 67.367126] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000
[ 67.367130] radeon 0000:01:00.0: R_008680_CP_STAT = 0x00000000
[ 67.367134] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57
[ 67.367152] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[ 67.842179] radeon 0000:01:00.0: Wait for MC idle timedout !
[ 68.068765] radeon 0000:01:00.0: Wait for MC idle timedout !
[ 68.082273] [drm] PCIE GART of 1024M enabled (table at 0x000000000014C000).
[ 68.082448] radeon 0000:01:00.0: WB enabled
[ 68.082454] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00
[ 68.082459] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c
[ 68.088977] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x000000000005c598
[ 68.374095] [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
[ 68.374176] [drm:rv770_resume [radeon]] *ERROR* r600 startup failed on resume
```

This is what happens when applying the patch to force 32-bit DMA bit mask on PCI devices.

Revision history for this message
Thomas Debesse (illwieckz) wrote :

On a side note, because we see a clear behaviour difference when applying the PCI patch we can assume the driver catch the `rdev->flags & RADEON_IS_PCI` test instead of the `rdev->flags & RADEON_IS_AGP` one when running an AGP GPU with AGP disabled in kernel at build time.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: amd64 focal kernel-bug
summary: - AGP GPU on PCI mode (when AGP is disabled at kernel build time) known to
- fail on K8 and K10 platforms
+ AGP GPUs driven as PCI ones (when AGP is disabled at kernel build time)
+ are known to fail on K8 and K10 platforms
Revision history for this message
Thomas Debesse (illwieckz) wrote :

As said there: https://lkml.org/lkml/2021/5/13/752

The bug was also reproduced on Intel Kentsfield platform (Core 2 Quad Q6600 (with VIA PT880/VT82xx) with R300 and TeraScale GPUs.

summary: AGP GPUs driven as PCI ones (when AGP is disabled at kernel build time)
- are known to fail on K8 and K10 platforms
+ are known to fail on AMD K8, K10 and Intel Kentsfield platforms
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.