PCI graphics broken on AMD K8/K10/Piledriver platform (while it works on Intel) verified from Linux 4.4 to 5.10-rc2

Bug #1902795 reported by Thomas Debesse
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

This is an issue I faced before #1899304 but becomes more critical with
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1899304

The big concern is that if AGP is disabled, there is no fallback display option on those platforms.

After having discovered some K8 and K10 computers running AGP cards stopped working with 5.4.0-48 (Ubuntu 20.04 LTS), I had the idea to try some PCI cards to know if problems came from AGP or something else, and I've discovered another issue instead.

Note that I'm not talking about PCI express, but good old PCI.

The other issue I found is that PCI graphics on AMD K8/K10 platform is broken since years. This got probably unnoticed because such hardware works on Intel platform and those cards are not very common, so the chance to fulfill all the requirements to reproduce the bug are not that high.

To make the test significant enough I used two PCI devices from two makers, and some that are not so old: they both support OpenGL 3.3, have 512MB of VRAM, and one of them even have HDMI.

- PCI ATI Radeon HD 4350 (RV710, Terascale 1), HDMI + DVI-I + VGA
- PCI Nvidia Geforce 8400 GS rev.2 (NV98, Tesla 1.0), DVI-I + VGA

I've driven tests on four computers:

- K10 PCIe based: Dell Optiplex 740 motherboard with AMD Athlon 64 X2 CPU (dual core), Nvidia C51 bridge, 6GB DDR2 667MHz, PCIe + PCI
- K8 AGP based: ASRock AM2NF3-VSTA motherboard with AMD Phenom II X4 970 CPU (quad core), Nvidia nForce3 bridge, 16GB DDR2 800MHz, AGP + PCI
- K8 AGP based: MSI MS-6702E motherboard with AMD Athlon 64 3200+ CPU (single core), VIA K8T800Pro, VT8237/8251 bridge, 3GB DDR 400MHz, AGP + PCI
- Intel PCIe based: Lenovo ThinkCentre M58 motherboard with Pentium E5200 CPU (dual core), Intel 82801 PCI Bridge, 1GB DDR2 800MHz, PCIe + PCI

Both PCI GPU work on the Intel based computer, and I get performances that looks correct for those GPU given they are PCI ones. You can find real-life use case test result here (look for “PCI”):
https://wiki.unvanquished.net/wiki/GPU_compatibility_matrix

I tested two Ubuntu versions and multiple kernels:

Ubuntu 20.04 Focal LTS Linux 5.4.0-48-generic
Ubuntu 20.04 Focal LTS Linux 5.4.0-47-generic
Ubuntu 16.04 Xenial LTS Linux 4.15.0-118-generic
Ubuntu 16.04 Xenial LTS Linux 4.8.0-36-generic
Ubuntu 16.04 Xenial LTS Linux 4.4.0-190-generic

All those configurations fail with those two PCI GPUs on AMD K8/K10 platforms.

I got some logs and screenshots, so I will add them.

Revision history for this message
Thomas Debesse (illwieckz) wrote :

Hmm, minor issues in the host list, fixed:

- K10 AGP based: ASRock AM2NF3-VSTA motherboard with AMD Phenom II X4 970 CPU (quad core), Nvidia nForce3 bridge, 16GB DDR2 800MHz, AGP + PCI
- K8 PCIe based: Dell Optiplex 740 motherboard with AMD Athlon 64 X2 CPU (dual core), Nvidia C51 bridge, 6GB DDR2 667MHz, PCIe + PCI
- K8 AGP based: MSI MS-6702E motherboard with AMD Athlon 64 3200+ CPU (single core), VIA K8T800Pro, VT8237/8251 bridge, 3GB DDR 400MHz, AGP + PCI
- Intel PCIe based: Lenovo ThinkCentre M58 motherboard with Pentium E5200 CPU (dual core), Intel 82801 PCI Bridge, 1GB DDR3 1066MHz, PCIe + PCI

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1902795

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Thomas Debesse (illwieckz) wrote : Re: PCI graphics seems to be broken since years on AMD K8/K10 platform (work on Intel)
Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :

The interesting thing on Nvidia GS 8400GS rev.2 may be:

[ 20.107995] nouveau 0000:03:00.0: DRM: GPU lockup - switching to software fbcon
[ 20.180130] nouveau 0000:03:00.0: [drm] fb0: nouveaudrmfb frame buffer device
[ 20.195263] [drm] Initialized nouveau 1.3.1 20120801 for 0000:03:00.0 on minor 0

You'll find the same lines in 4.8 (Xenial) and 5.10-rc1 (Focal) dmesg, either on K8 (the one with AGP port) and K10 (which also has AGP port).

I forgot to say I also tested this configuration and then, reproduced the bug for both Nvidia and ATI PCI cards:

Ubuntu 20.04 Focal LTS Linux 5.10.0-051000rc1-generic (from mainline Ubuntu PPA).

Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :

I added two screenshots (screen photos) of graphical glitches taken while running the Nvidia Geforce 8400GS rev.2 PCI on K8 AGP and K10 AGP hosts. That's the last thing an user can see (it remains on screen), the desktop never displays. In those case I get dmesg logs through SSH.

On the K8 PCie host, the display goes off immediately at kernel launch, right after GRUB launches it, so there is absolutely nothing to see. If I'm right, the lockup is so hard the system does not run in background and I cannot connect through SSH.

Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :

I've added a screenshot (screen photo) of the GNOME desktop being stuck and unresponsive while running the ATI Radeon 4350 PCI on the K8 AGP host.

You'll notice this is the exact same symptom I get with ATI Radeon AGP cards on this host starting with kernel 5.4.0-48-generic and later (while that AGP card worked flawlessly on 5.4.0-47-generic), see #1899304
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1899304

On the K8 PCie host, the display goes off immediately at kernel launch, right after GRUB launches it, so there is absolutely nothing to see. If I'm right, the lockup is so hard the system does not run in background and I cannot connect through SSH (right, this is the same symptom I get with the Nvidia PCI card).

I don't have access to recent enough (GL 3) Nvidia AGP cards to compare symptoms with that GL 3 Nvidia PCI card.

Revision history for this message
Thomas Debesse (illwieckz) wrote :
Download full text (3.6 KiB)

Here may be the interesting dmesg part when running the ATI Radeon 4350 PCI card on the K10 AGP host with Ubuntu 20.04 Focal and Linux 5.10-rc1 kernel. Note that those messages repeats infinitely and very quickly in a way the whole journal becomes full rapidly (dropping earlier entries):

```
[ 46.802991] trying to bind memory to uninitialized GART !
[ 46.803170] WARNING: CPU: 2 PID: 2610 at drivers/gpu/drm/radeon/radeon_gart.c:297 radeon_gart_bind+0xf1/0x100 [radeon]
[ 46.803173] Modules linked in: zram snd_hda_codec_hdmi binfmt_misc snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib snd_hwdep edac_mce_amd mc snd_seq_midi snd_pcm snd_seq_midi_event snd_rawmidi snd_seq joydev snd_seq_device snd_timer input_leds kvm_amd snd ccp soundcore kvm k10temp serio_raw mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear radeon i2c_algo_bit ttm drm_kms_helper syscopyarea hid_generic sysfillrect sysimgblt fb_sys_fops uas usbhid cec usb_storage hid rc_core psmouse drm forcedeth i2c_nforce2
[ 46.803296] CPU: 2 PID: 2610 Comm: gnome-shell Not tainted 5.10.0-051000rc1-generic #202010291359
[ 46.803300] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AM2NF3-VSTA, BIOS P3.20 10/09/2009
[ 46.803376] RIP: 0010:radeon_gart_bind+0xf1/0x100 [radeon]
[ 46.803383] Code: 00 4c 89 e7 48 8b 40 60 e8 7c 53 3b cd 31 c0 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 48 1a 75 c0 e8 65 90 f6 cc <0f> 0b b8 ea ff ff ff eb dc 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48
[ 46.803387] RSP: 0018:ffffb0da81f57a38 EFLAGS: 00010282
[ 46.803393] RAX: 0000000000000000 RBX: ffff98801231c6e8 RCX: ffff98832fd18988
[ 46.803396] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff98832fd18980
[ 46.803400] RBP: ffffb0da81f57a70 R08: 0000000000000000 R09: ffffb0da81f57818
[ 46.803403] R10: ffffb0da81f57810 R11: ffffffff8e752ca8 R12: ffff988007a4f580
[ 46.803406] R13: ffffb0da81f57b08 R14: ffff988011b19200 R15: ffff98801231c6e8
[ 46.803411] FS: 00007f6e1d3dbcc0(0000) GS:ffff98832fd00000(0000) knlGS:0000000000000000
[ 46.803415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 46.803418] CR2: 000055b0490a78f8 CR3: 0000000106534000 CR4: 00000000000006e0
[ 46.803421] Call Trace:
[ 46.803501] radeon_ttm_tt_bind+0x7e/0x110 [radeon]
[ 46.803519] ttm_bo_handle_move_mem+0x484/0x4a0 [ttm]
[ 46.803534] ttm_bo_validate+0x137/0x150 [ttm]
[ 46.803552] ttm_bo_init_reserved+0x29f/0x320 [ttm]
[ 46.803567] ttm_bo_init+0x69/0xe0 [ttm]
[ 46.803639] ? radeon_update_memory_usage.isra.0+0x50/0x50 [radeon]
[ 46.803712] radeon_bo_create+0x186/0x200 [radeon]
[ 46.803784] ? radeon_update_memory_usage.isra.0+0x50/0x50 [radeon]
[ 46.803859] radeon_gem_object_create+0xad/0x190 [radeon]
[ 46.803934] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon]
[ 46.804090] radeon_gem_create_ioctl+0x69/0x120 [radeon]
[ 46.804182] ? radeon_gem_pwrite_ioctl+0x30/0x30 [radeon]
[ 46.804279] drm_ioctl_kernel+0xae/0xf0 [drm]
[ 46.804353] drm_ioctl+0...

Read more...

summary: - PCI graphics seems to be broken since years on AMD K8/K10 platform (work
- on Intel)
+ PCI graphics broken on AMD K8/K10 platform (while it works on Intel)
+ verified from Linux 4.4 to 5.10-rc1
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Thomas Debesse (illwieckz) wrote : Re: PCI graphics broken on AMD K8/K10 platform (while it works on Intel) verified from Linux 4.4 to 5.10-rc1

I noticed a similar bug was reported on 3.2 kernel in year 2012:
https://bugzilla.redhat.com/show_bug.cgi?id=785375

At the time the bug was fixed by switching the PCI DMA bit mask from 40-bits to 32-bits:
https://bugzilla.redhat.com/attachment.cgi?id=603278

The initial patch was testing against the GPU chip family, but that seems wrong because now we see the same GPUs working with 40-bits mask on some Intel platforms and not on some AMD platforms.

This patch makes all PCI GPU use 32-bit masks. This is expected to be non-optimal platforms supporting 40-bits DMA masks, but is safest. An alternative would be to test against the platforms.

This patch is not enough to fix PCI GPUs working on K8 and K10 platforms, also, this patch only concern Radeon hardware, while Nvidia hardware are both affected on those platforms (both running nouveau or nvidia).

This patch makes enough to workaround this error on ATI PCI devices on K10:

```
[drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
radeon 0000:03:00.0: disabling GPU acceleration
```

And this one on both ATI PCI devices on K10 and ATI AGP devices on Linux 5.9 (AGP disabled?):

```
trying to bind memory to uninitialized GART !
```

This is not enough to fix PCI GPUs on K8 (K8T800) and K10 (nForce3), also, non-ATI Nvidia PCI GPUs are affected by at least one other bug that may be in common.

This is not a call to disable AGP, and other bug(s) left such AGP hardware unusable once AGP is disabled.

This patch has been written against Linux 5.8 vanilla but applies correctly on on 5.9 and 5.10-rc2.

Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :
Revision history for this message
Thomas Debesse (illwieckz) wrote :

Here is a dmesg log from september, running the PCI Nvidia 8400GS rev.2 with nouveau on a K8 non-AGP host with Nvidia C51 PCI Express bridge. The GPU is driven by nouveau. So at this time I at least managed to connect through SSH while there was no display. The dmesg log reports a GPU lockup.

Yet again, this PCI GPU is known to work with nouveau driver when plugged into an Intel platform.

Here may be the interesting parts:

```
[ 52.256093] nouveau 0000:05:00.0: DRM: core notifier timeout
[ 54.256228] nouveau 0000:05:00.0: DRM: base-0: timeout
[ 54.256455] Console: switching to colour frame buffer device 240x67
[ 54.256542] nouveau 0000:05:00.0: fifo: DMA_PUSHER - ch 1 [DRM] get 0000000000 put 0000000000 ib_get 00000002 ib_put 00000003 state a0000000 (err: IB_EMPTY) push 00406040
[ 54.256633] nouveau 0000:05:00.0: fifo: DMA_PUSHER - ch 1 [DRM] get 0000000000 put 0000000000 ib_get 00000003 ib_put 00000011 state a0000000 (err: IB_EMPTY) push 00406040
[ 54.256687] nouveau 0000:05:00.0: fifo: DMA_PUSHER - ch 1 [DRM] get 0000000000 put 0000000000 ib_get 00000011 ib_put 00000020 state a0000000 (err: IB_EMPTY) push 003020b0
[ 54.256740] nouveau 0000:05:00.0: fifo: DMA_PUSHER - ch 1 [DRM] get 0000000000 put 0000000000 ib_get 00000020 ib_put 00000033 state a0000000 (err: IB_EMPTY) push 00406040
[ 54.256811] nouveau 0000:05:00.0: fifo: DMA_PUSHER - ch 1 [DRM] get 0000000000 put 0000000000 ib_get 00000033 ib_put 00000049 state a0000000 (err: IB_EMPTY) push 003020b0
[ 54.725747] nouveau 0000:05:00.0: DRM: GPU lockup - switching to software fbcon
[ 54.808979] nouveau 0000:05:00.0: fb0: nouveaudrmfb frame buffer device
[ 54.809352] [drm] Initialized nouveau 1.3.1 20120801 for 0000:05:00.0 on minor 1
```

Revision history for this message
Thomas Debesse (illwieckz) wrote :
Download full text (5.2 KiB)

Here is a dmesg log from september, running the PCI Nvidia 8400GS rev.2 with nouveau on a K8 non-AGP host with Nvidia C51 PCI Express bridge. The GPU is driven by proprietary non-free closed nvidia driver.

Yet again, this PCI GPU is known to work with nvidia driver when plugged into an Intel platform.

Here may be the interesting parts:

```
[ 76.927044] NVRM: GPU at PCI:0000:05:00: GPU-d18ccf5d-6557-e114-0ca8-23449bccf157
[ 76.927050] NVRM: Xid (PCI:0000:05:00): 6, PE0002
[ 78.018458] NVRM: Xid (PCI:0000:05:00): 8, Channel 00000001
[ 80.010804] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 82.017647] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 90.401898] NVRM: Xid (PCI:0000:05:00): 8, Channel 0000007e
[ 92.385614] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 96.321610] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 106.785733] NVRM: Xid (PCI:0000:05:00): 8, Channel 0000007e
[ 108.785676] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 112.785766] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 123.169763] NVRM: Xid (PCI:0000:05:00): 8, Channel 0000007e
[ 125.153618] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 129.153612] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 138.079318] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than pnp 00:06 [mem 0x000ce000-0x000cffff]
[ 138.079726] caller os_map_kernel_space+0x6d/0xb0 [nvidia] mapping multiple BARs
[ 145.569349] NVRM: GPU at PCI:0000:05:00: GPU-d18ccf5d-6557-e114-0ca8-23449bccf157
[ 145.569386] NVRM: Xid (PCI:0000:05:00): 6, PE0002
[ 146.689722] NVRM: Xid (PCI:0000:05:00): 8, Channel 00000001
[ 148.689670] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 150.690247] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 159.009723] NVRM: Xid (PCI:0000:05:00): 8, Channel 0000007e
[ 161.009666] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 165.009757] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 175.393763] NVRM: Xid (PCI:0000:05:00): 8, Channel 0000007e
[ 177.393696] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 181.393788] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 191.777761] NVRM: Xid (PCI:0000:05:00): 8, Channel 0000007e
[ 193.505609] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[ 197.717628] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
```

Note that there is two things you can ignore from this log, this part:

```
[ 47.301362] NVRM: The NVIDIA GeForce 6150 LE GPU installed in this system is
               NVRM: supported through the NVIDIA 304.xx Legacy drivers. Please
               NVRM: visit http://www.nvi...

Read more...

Revision history for this message
Thomas Debesse (illwieckz) wrote :

Just a reupload of the previously posted patch, fixing some typos.

tags: added: patch
tags: added: amd64 focal
tags: added: kernel-bug
tags: added: xenial
description: updated
Revision history for this message
Thomas Debesse (illwieckz) wrote :

See patch and comments on https://lkml.org/lkml/2020/11/5/307

Patch was rewritten in a way the message is shorter and comment uses better language.

Revision history for this message
Thomas Debesse (illwieckz) wrote :

I've reproduced the issue on Piledriver platform with AMD 9590 CPU, with both ATI and Nvidia PCI GPUs.

With the ATI GPU, I get the usual symptom of the computer freezing display while the GNOME desktop is partially loaded. It's possible to open a TTY console by switching consoles but once returned to the graphical desktop, the computer becomes definitely unresponsive. This is similar to what is seen on older hardware like K10 or K8 platforms.

With the Nvidia GPUs, I get garbage (we can even notice some parts of the GNOME desktop), and the computer is unresponsive. Sometime the display is lost then returns and it cycles like that. This is the usual symptom seen on older hardware like K10 and K8 platforms.

summary: - PCI graphics broken on AMD K8/K10 platform (while it works on Intel)
- verified from Linux 4.4 to 5.10-rc1
+ PCI graphics broken on AMD K8/K10/Piledriver platform (while it works on
+ Intel) verified from Linux 4.4 to 5.10-rc1
summary: PCI graphics broken on AMD K8/K10/Piledriver platform (while it works on
- Intel) verified from Linux 4.4 to 5.10-rc1
+ Intel) verified from Linux 4.4 to 5.10-rc2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.