VI dGPU fails to initialize on Intel platforms w/ 5.14+

Bug #2036742 reported by Paolo Gentili
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Confirmed
High
Unassigned
Lunar
Won't Fix
High
Unassigned
Mantic
Confirmed
High
Unassigned

Bug Description

[Impact]

Booting from USB the latest Mantic Desktop daily image (2023-09-20), just after the initial logs, nothing is displayed on screen. The system is still alive since _autoinstall_ works as intended. Once provisioned, the problem is still present.

It seems related to https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/2029396 .

dmesg attached.

[Test Case]

Live boot Ubuntu Mantic Desktop canary (2023-09-19)

[Where Problems Could Occur]

Dell Optiplex 5090
- Intel Core(TM) i7-11700
- Advanced Micro Devices, Inc. [AMD/ATI] - 1002:699f

Tags: mantic patch
Revision history for this message
Paolo Gentili (pgentili) wrote (last edit ):

Linux 6.3

Revision history for this message
Paolo Gentili (pgentili) wrote :
description: updated
Changed in linux-firmware (Ubuntu):
milestone: none → ubuntu-23.10
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

[ 4.918050] kernel: UBSAN: array-index-out-of-bounds in /build/linux-IPoq5q/linux-6.5.0/drivers/gpu/drm/amd/amdgpu/../pm/powerplay/hwmgr/smu7_hwmgr.c:3669:4

is not good

Revision history for this message
Mario Limonciello (superm1) wrote :

> It seems related to https://bugs.launchpad.net/ubuntu/+source/linux-firmware/+bug/2029396 .

I don't believe these to be related. That issue is specifically with navi3x dGPU, your system has a much older dGPU.

Your 6.3 and 6.5 logs both appear to crash similarly; Do you have a point in time that this system does work correctly?

Can you try a few of the mainline kernels to narrow down? These are the ones that I would think be best candidates:

https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.6-rc2/
https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.1.54/
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.19.17/
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.132/

affects: linux-firmware (Ubuntu) → linux (Ubuntu)
Revision history for this message
Juerg Haefliger (juergh) wrote :

UBSAN warnings could be a red herring. They've added a compiler flag that complains about flexible arrays if they're declared incorrectly (false positive). Will take a look tomorrow.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2036742

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Mario Limonciello (superm1) wrote : Re: amdgpu crash on Mantic

> [ 5.134271] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [ 5.322247] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [ 5.510230] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.

Is this connected to a KVM? The lack of reading the EDID is concerning.

> UBSAN warnings could be a red herring. They've added a compiler flag that complains about flexible arrays if they're declared incorrectly (false positive). Will take a look tomorrow.

Yeah I agree they're probably a red herring. The actual issue is that UVD IP block fails to init due to a timeout.

[ 6.025262] kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
[ 6.025511] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v6_0> failed -110
[ 6.025661] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.025663] kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[ 6.025737] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

As a potential workaround (this isn't a solution), you might be able to skip the uvd_v6_0 IP block init.
To do this you need to look up which IP block number it is which is from your logs:

[ 4.836457] kernel: [drm] add ip block number 0 <vi_common>
[ 4.836458] kernel: [drm] add ip block number 1 <gmc_v8_0>
[ 4.836459] kernel: [drm] add ip block number 2 <tonga_ih>
[ 4.836459] kernel: [drm] add ip block number 3 <gfx_v8_0>
[ 4.836460] kernel: [drm] add ip block number 4 <sdma_v3_0>
[ 4.836461] kernel: [drm] add ip block number 5 <powerplay>
[ 4.836462] kernel: [drm] add ip block number 6 <dm>
[ 4.836462] kernel: [drm] add ip block number 7 <uvd_v6_0>
[ 4.836463] kernel: [drm] add ip block number 8 <vce_v3_0>

Then you can add "amdgpu.ip_block_mask=0xffffff7f" to your kernel command line to skip IP block 7 (uvd_v6_0).

If that helps the issue then it does confirm the out of bounds check is a red herring and the real issue is the uvd stuff. I'd like to see data points for those other kernels I suggested to narrow down when this problem started.

Revision history for this message
Timo Aaltonen (tjaalton) wrote (last edit ):

Yes it's a KVM of sorts.. But apparently this did work before, so a rough bisect with mainline builds will be conducted.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Paolo Gentili (pgentili) wrote :

The setup involves a custom KVM indeed, I'll attach the EDID file involved in this setup.

I tested everything as requested, no luck unfortunately. The working configuration, which I've now replicated, is involving the OEM image for Ubuntu 20.04 with which the device has been certified.

Please find attached every collected dmesg and the EDID file.

Revision history for this message
Mario Limonciello (superm1) wrote :

> dmesg-ip-block-mask

[ 6.150330] kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vce0 test failed (-110)
[ 6.150581] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <vce_v3_0> failed -110
[ 6.150726] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.150728] kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[ 6.150730] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

It looks like vce IP block also fails if you ignore uvd. We can play whack-a-mole on the others (if you clear bit 8 also) but I suspect this means the issue is not in the IP block but higher up code.

> dmesg-5.15

This one fails differently than all the rest newer ones. It's failing from a missing firmware file

[ 2.842833] kernel: amdgpu 0000:01:00.0: Direct firmware load for amdgpu/polaris12_k_mc.bin failed with error -2
[ 2.842836] kernel: amdgpu: mc: Failed to load firmware "amdgpu/polaris12_k_mc.bin"
[ 2.842841] kernel: [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
[ 2.843033] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -2

Any idea why that's missing?

I noticed that it also failed for i915, it feels like a totally missing firmware package.

[ 2.743220] kernel: i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=mem
[ 2.744329] kernel: i915 0000:00:02.0: Direct firmware load for i915/rkl_dmc_ver2_03.bin failed with error -2
[ 2.744333] kernel: i915 0000:00:02.0: [drm] Failed to load DMC firmware i915/rkl_dmc_ver2_03.bin. Disabling runtime power management.
[ 2.744334] kernel: i915 0000:00:02.0: [drm] DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

> dmesg 5.10.0-1034-oem

Can you see if it keeps working with latest 5.10 OEM? 1057 it looks like is newest.

Revision history for this message
Paolo Gentili (pgentili) wrote :

> Can you see if it keeps working with latest 5.10 OEM? 1057 it looks like is newest.

Yes, it still works with that version.

Revision history for this message
Mario Limonciello (superm1) wrote :

Thanks for checking. So good to know it hasn't regressed from stable in 5.10. Can you redo your 5.15 test with the firmware in place so we can see if we're OK there or not?

Revision history for this message
Mario Limonciello (superm1) wrote :

I don't expect it helps your boot issue, but the UBSAN issue will be fixed by this commit.

https://<email address hidden>/T/#me31ff6b88640b03be1a8edfc6fc8878ac78ca6bb

Please redo the test with 5.15.

Revision history for this message
Paolo Gentili (pgentili) wrote :

> Thanks for checking. So good to know it hasn't regressed from stable in 5.10. Can you redo your 5.15 test with the firmware in place so we can see if we're OK there or not?

Unfortunately still no luck. I booted Focal with 5.10 OEM and then rebooted to Mantic with 5.15.134. The screen freezes at the end of boot logs. Attached dmesg.

Revision history for this message
Mario Limonciello (superm1) wrote :

Something is really fishy here - the 5.15 test is again missing firmware for both i915, amdgpu and ath10k:

[ 2.345119] kernel: i915 0000:00:02.0: Direct firmware load for i915/rkl_dmc_ver2_03.bin failed with error -2
[ 2.345122] kernel: i915 0000:00:02.0: [drm] Failed to load DMC firmware i915/rkl_dmc_ver2_03.bin. Disabling runtime power management.
[ 2.345123] kernel: i915 0000:00:02.0: [drm] DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915

[ 2.361656] kernel: amdgpu 0000:01:00.0: Direct firmware load for amdgpu/polaris12_k_mc.bin failed with error -2
[ 2.361658] kernel: amdgpu: mc: Failed to load firmware "amdgpu/polaris12_k_mc.bin"
[ 2.361662] kernel: [drm:gmc_v8_0_sw_init [amdgpu]] *ERROR* Failed to load mc firmware!
[ 2.361797] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* sw_init of IP block <gmc_v8_0> failed -2

[ 2.545915] kernel: ath10k_pci 0000:03:00.0: Failed to find firmware-N.bin (N between 2 and 6) from ath10k/QCA6174/hw3.0: -2
[ 2.545920] kernel: ath10k_pci 0000:03:00.0: could not fetch firmware files (-2)
[ 2.545922] kernel: ath10k_pci 0000:03:00.0: could not probe fw (-2)

I think you might be missing another firmware package, did it get split up in mantic?

Revision history for this message
Paolo Gentili (pgentili) wrote :

On Focal, it also works with 5.14.0-1059. dmesg attached.

Revision history for this message
Kristijan Žic  (kristijan-zic) wrote (last edit ):

I think I have the same issue but I'm not sure. Please advise if I can test anything and how?

GPU: AMD Radeon RX Vega 64 Liquid
CPU: AMD Threadripper 1900x
DS: Wayland

With the new installer:
The new installer crashes the entire session when it opens the “Connectivity” screen.

---------------------------------------------------

Having installed the OS using the legacy installer:

About 10s to 30s into launching and using any app the screen freezes and then the screen turns off and comes back to display colourful artifacts while the screen is still frozen.

Here’s an example of what happens when I start using App Store in the attachment.

Revision history for this message
Kristijan Žic  (kristijan-zic) wrote :

Here’s an example of what happens when I start using Brave browser in the attachment.

With X it crashes and then restarts the gdm and brings me to the login screen.
in any case it’s unusable as it crashes maybe 10 seconds into opening any app.

Revision history for this message
Mario Limonciello (superm1) wrote :

This seems like a different bug to me. Can you please open a different report and attach the relevant logs from the journal to it?

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

the firmware on mantic is zstd compressed, so mainline builds from the past can't load the firmware..

Revision history for this message
Mario Limonciello (superm1) wrote :

> the firmware on mantic is zstd compressed, so mainline builds from the past can't load the firmware..

As a hack then maybe just clone https://gitlab.com/kernel-firmware/linux-firmware and put everything (uncompressed) in /lib/firmware/updates/ to get by that.

Or run the check on Jammy instead.

Revision history for this message
Paolo Gentili (pgentili) wrote :

I can't install Jammy because I can't see the installer, so I did the test on Focal.

5.13.19 works, 5.14-rc2 fails (same for 5.14.0). Dmesg attached.

Revision history for this message
Mario Limonciello (superm1) wrote :

To explain some of the differences in the logs:

* UVD is the IP block that fails to init (so no "UVD and UVD ENC initialized successfully").
* Once an IP block fails, next one isn't even tried (so no "VCE initialized successfully")
* kfd doesn't initialize because amdgpu_amdkfd_device_init() won't be called until end of hw_init (all blocks must succeed)

Between 5.13.19 and 5.14-rc2 there aren't any UVD changes that would cause this.

$ git log --oneline v5.13.19..v5.14-rc2 amdgpu/amdgpu_uvd.c amdgpu/uvd_v6_0.c
09b020bb05a5 Merge tag 'drm-misc-next-2021-06-09' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
d3fae3b3daac dma-buf: drop the _rcu postfix on function names v3
5745d647d556 Merge tag 'amd-drm-next-5.14-2021-06-02' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
66c46621c812 amdgpu: remove unreachable code
ae4c0d7674a7 drm/amdgpu: make sure we unpin the UVD BO
f89f8c6bafd0 drm/amdgpu: Guard against write accesses after device removal
355b60296143 Merge drm/drm-next into drm-misc-next

This tells me it's probably something outside of the direct code causing it. Without seeing further steps into the bisect the biggest suspect in my mind is 0064b0ce85bb.

Can you please try amdgpu.aspm=0?

Revision history for this message
Mario Limonciello (superm1) wrote :

If amdgpu.aspm=0 helps the issue, then can you please test this patch against Mantic? If it helps I'll submit it.

Revision history for this message
Paolo Gentili (pgentili) wrote :

With `amdgpu.aspm=0` it works fine.

Revision history for this message
Mario Limonciello (superm1) wrote :

Great, thanks! Let me know if the patch works too.

tags: added: patch
Revision history for this message
Paolo Gentili (pgentili) wrote :

Ok the patch fixes the problem! I can see a stack trace still but it's probably related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742/comments/13 ?

Thanks!

Revision history for this message
Mario Limonciello (superm1) wrote :

Great! Thanks I've posted it for review.

https://lore<email address hidden>/T/#u

Yes the "UBSAN: array-index-out-of-bounds" traces are to be fixed in 6.7. If you want to pull those patches early I can point you at them.

Revision history for this message
Mario Limonciello (superm1) wrote :

I'd say let's track the out of bounds stuff in this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2039926

summary: - amdgpu crash on Mantic
+ VI dGPU fails to initialize on Intel platforms w/ 5.14+
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Mario Limonciello (superm1) wrote :

Here's the mainline commit with the fix for your issue:

https://github.com/torvalds/linux/commit/64ffd2f1d00c6235dabe9704bbb0d9ce3e28147f

Changed in linux (Ubuntu Jammy):
status: New → Confirmed
Changed in linux (Ubuntu Lunar):
status: New → Confirmed
Changed in linux (Ubuntu Mantic):
status: New → Confirmed
Changed in linux (Ubuntu Jammy):
importance: Undecided → High
Changed in linux (Ubuntu Lunar):
importance: Undecided → High
Changed in linux (Ubuntu Mantic):
importance: Undecided → High
Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 23.04 (Lunar Lobster) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux (Ubuntu Lunar):
status: Confirmed → Won't Fix
Timo Aaltonen (tjaalton)
Changed in linux (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.