Ubuntu
linux package

Bug #2036742
Comment #7

Comment 7 for bug 2036742

Revision history for this message

Mario Limonciello (superm1) wrote on 2023-09-20: Re: amdgpu crash on Mantic

> [ 5.134271] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [ 5.322247] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [ 5.510230] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.

Is this connected to a KVM? The lack of reading the EDID is concerning.

> UBSAN warnings could be a red herring. They've added a compiler flag that complains about flexible arrays if they're declared incorrectly (false positive). Will take a look tomorrow.

Yeah I agree they're probably a red herring. The actual issue is that UVD IP block fails to init due to a timeout.

[ 6.025262] kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
[ 6.025511] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v6_0> failed -110
[ 6.025661] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.025663] kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[ 6.025737] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

As a potential workaround (this isn't a solution), you might be able to skip the uvd_v6_0 IP block init.
To do this you need to look up which IP block number it is which is from your logs:

[ 4.836457] kernel: [drm] add ip block number 0 <vi_common>
[ 4.836458] kernel: [drm] add ip block number 1 <gmc_v8_0>
[ 4.836459] kernel: [drm] add ip block number 2 <tonga_ih>
[ 4.836459] kernel: [drm] add ip block number 3 <gfx_v8_0>
[ 4.836460] kernel: [drm] add ip block number 4 <sdma_v3_0>
[ 4.836461] kernel: [drm] add ip block number 5 <powerplay>
[ 4.836462] kernel: [drm] add ip block number 6 <dm>
[ 4.836462] kernel: [drm] add ip block number 7 <uvd_v6_0>
[ 4.836463] kernel: [drm] add ip block number 8 <vce_v3_0>

Then you can add "amdgpu.ip_block_mask=0xffffff7f" to your kernel command line to skip IP block 7 (uvd_v6_0).

If that helps the issue then it does confirm the out of bounds check is a red herring and the real issue is the uvd stuff. I'd like to see data points for those other kernels I suggested to narrow down when this problem started.

> [    5.134271] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [    5.322247] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.
> [    5.510230] kernel: [drm:detect_link_and_local_sink [amdgpu]] *ERROR* No EDID read.

Is this connected to a KVM?  The lack of reading the EDID is concerning.

> UBSAN warnings could be a red herring. They've added a compiler flag that complains about flexible arrays if they're declared incorrectly (false positive). Will take a look tomorrow.

Yeah I agree they're probably a red herring.  The actual issue is that UVD IP block fails to init due to a timeout.

[    6.025262] kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd test failed (-110)
[    6.025511] kernel: [drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v6_0> failed -110
[    6.025661] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed
[    6.025663] kernel: amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init
[    6.025737] kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.

As a potential workaround (this isn't a solution), you might be able to skip the uvd_v6_0 IP block init.
To do this you need to look up which IP block number it is which is from your logs:

[    4.836457] kernel: [drm] add ip block number 0 <vi_common>
[    4.836458] kernel: [drm] add ip block number 1 <gmc_v8_0>
[    4.836459] kernel: [drm] add ip block number 2 <tonga_ih>
[    4.836459] kernel: [drm] add ip block number 3 <gfx_v8_0>
[    4.836460] kernel: [drm] add ip block number 4 <sdma_v3_0>
[    4.836461] kernel: [drm] add ip block number 5 <powerplay>
[    4.836462] kernel: [drm] add ip block number 6 <dm>
[    4.836462] kernel: [drm] add ip block number 7 <uvd_v6_0>
[    4.836463] kernel: [drm] add ip block number 8 <vce_v3_0>

Then you can add "amdgpu.ip_block_mask=0xffffff7f" to your kernel command line to skip IP block 7 (uvd_v6_0).

If that helps the issue then it does confirm the out of bounds check is a red herring and the real issue is the uvd stuff.  I'd like to see data points for those other kernels I suggested to narrow down when this problem started.

Ubuntulinux package

Comment 7 for bug 2036742

Ubuntu
linux package