Keeps rebooting with AMD W6400, W6600, and W6800 graphic cards

Bug #2000110 reported by AceLan Kao
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
HWE Next
New
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
AceLan Kao
Jammy
Won't Fix
Undecided
Unassigned
Kinetic
Fix Released
Undecided
AceLan Kao
Lunar
Fix Released
Undecided
AceLan Kao
linux-oem-5.17 (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Fix Released
Undecided
AceLan Kao
Kinetic
Invalid
Undecided
Unassigned
Lunar
Invalid
Undecided
Unassigned
linux-oem-6.0 (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Fix Released
Undecided
AceLan Kao
Kinetic
Invalid
Undecided
Unassigned
Lunar
Invalid
Undecided
Unassigned

Bug Description

[Impact]
With AMD W6400, W6600, or W6800 graphic cards the system keeps rebooting while entering graphics mode.

[Fix]
These 3 commits fix the issue
a8671493d207 drm/amdgpu: make sure to init common IP before gmc
e3163bc8ffdf drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
dc1d85cb790f drm/amdgpu: move nbio ih_doorbell_range() into ih code for vega

cherry picked one moew commit as it fixes e3163bc8ffdfdb ("drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega")
50b0e4d4da09 drm/amdgpu: fix sdma doorbell init ordering on APUs

======
V1.
AMD provides a list of commits to fix this issue
1. drm/amdgpu: Remove ATC L2 access for MMHUB 2.1.x
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20221107&id=d2c4c1569a7d7d5c8f75963bf2d62d7aeac30e2a
2. drm/amdgpu: Don't enable LTR if not supported
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20221107&id=6c20490663553cd7e07d8de8af482012329ab9d6
3. Patch series: fix PCI AER issues
drm/amdgpu: make sure to init common IP before gmc
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20221107&id=a8671493d2074950553da3cf07d1be43185ef6c6
drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20221107&id=e3163bc8ffdfdb405e10530b140135b2ee487f89
drm/amdgpu: move nbio ih_doorbell_range() into ih code for vega
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20221107&id=dc1d85cb790f2091eea074cee24a704b2d6c4a06
4. Disable amdgpu runpm, will have a update later.
https://patchwork.freedesktop.org/patch/507366/

[Test]
Verified on the machine with AMD graphic card, and confirmed the issue is gone.

[Where problems could occur]
The main idea is to disable BACO on those cards, the fix is quite straightforward with 2 Fixes commits, have impact on limited cards.

AceLan Kao (acelankao)
tags: added: oem-priority originate-from-1989125 somerville
Changed in linux (Ubuntu Jammy):
status: New → Invalid
status: Invalid → Won't Fix
Changed in linux (Ubuntu Kinetic):
assignee: nobody → AceLan Kao (acelankao)
status: New → In Progress
Changed in linux (Ubuntu Lunar):
assignee: nobody → AceLan Kao (acelankao)
status: New → In Progress
Changed in linux-oem-5.17 (Ubuntu Jammy):
assignee: nobody → AceLan Kao (acelankao)
status: New → In Progress
Changed in linux-oem-5.17 (Ubuntu Kinetic):
status: New → Invalid
Changed in linux-oem-5.17 (Ubuntu Lunar):
status: New → Invalid
Changed in linux-oem-6.0 (Ubuntu Jammy):
assignee: nobody → AceLan Kao (acelankao)
status: New → In Progress
Changed in linux-oem-6.0 (Ubuntu Kinetic):
status: New → Invalid
Changed in linux-oem-6.0 (Ubuntu Lunar):
status: New → Invalid
AceLan Kao (acelankao)
description: updated
AceLan Kao (acelankao)
description: updated
AceLan Kao (acelankao)
description: updated
Timo Aaltonen (tjaalton)
Changed in linux-oem-5.17 (Ubuntu Jammy):
status: In Progress → Fix Committed
Timo Aaltonen (tjaalton)
Changed in linux-oem-6.0 (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oem-6.0/6.0.0-1010.10 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-oem-6.0 verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oem-5.17/5.17.0-1026.27 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-oem-5.17
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-oem-5.17 - 5.17.0-1026.27

---------------
linux-oem-5.17 (5.17.0-1026.27) jammy; urgency=medium

  * jammy/linux-oem-5.17: 5.17.0-1026.27 -proposed tracker (LP: #2001046)

  * Keeps rebooting with AMD W6400, W6600, and W6800 graphic cards
    (LP: #2000110)
    - drm/amdgpu: Remove ATC L2 access for MMHUB 2.1.x
    - drm/amd/pm: disable BACO entry/exit completely on several sienna cichlid
      cards
    - drm/amdgpu: disable BACO on special BEIGE_GOBY card
    - drm/amdgpu: disable BACO support on more cards
    - drm/amdgpu: make sure to init common IP before gmc

  * Fix SUT can't displayed after resume from WB/CB with dGFX
    installed(FR:6/10)[RX6300][RX6500] (LP: #1999836)
    - drm/amd/display: No display after resume from WB/CB

  * CVE-2022-4378
    - proc: proc_skip_spaces() shouldn't think it is working on C strings
    - proc: avoid integer type confusion in get_proc_long

 -- Timo Aaltonen <email address hidden> Wed, 04 Jan 2023 11:37:57 +0200

Changed in linux-oem-5.17 (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-oem-6.0 - 6.0.0-1010.10

---------------
linux-oem-6.0 (6.0.0-1010.10) jammy; urgency=medium

  * jammy/linux-oem-6.0: 6.0.0-1010.10 -proposed tracker (LP: #2001070)

  * Microphone mute LED not working as expected after pressing the mic mute
    hotkey (LP: #2000909)
    - ALSA: hda/realtek: Apply dual codec fixup for Dell Latitude laptops

  * Soundwire support for the Intel RPL Gen 0C40/0C11 platforms (LP: #2000030)
    - SAUCE: ASoC: Intel: soc-acpi: add configuration for variant of 0C40 product
    - SAUCE: ASoC: Intel: soc-acpi: add configuration for variant of 0C11 product

  * Keeps rebooting with AMD W6400, W6600, and W6800 graphic cards
    (LP: #2000110)
    - drm/amdgpu: disable BACO support on more cards

  * Fix SUT can't displayed after resume from WB/CB with dGFX
    installed(FR:6/10)[RX6300][RX6500] (LP: #1999836)
    - drm/amd/display: No display after resume from WB/CB

  * Soundwire support for the Intel RPL Gen platforms (LP: #1997944)
    - ASoC: Intel: soc-acpi-intel-rpl-match: add rpl_sdca_3_in_1 support
    - ASoC: Intel: sof_sdw: Add support for SKU 0C10 product
    - ASoC: Intel: soc-acpi: add SKU 0C10 SoundWire configuration
    - ASoC: Intel: sof_sdw: Add support for SKU 0C40 product
    - ASoC: Intel: soc-acpi: add SKU 0C40 SoundWire configuration
    - ASoC: Intel: sof_sdw: Add support for SKU 0C4F product
    - ASoC: rt1318: Add RT1318 SDCA vendor-specific driver
    - ASoC: intel: sof_sdw: add rt1318 codec support.
    - ASoC: Intel: sof_sdw: Add support for SKU 0C11 product
    - ASoC: Intel: soc-acpi: add SKU 0C11 SoundWire configuration
    - SAUCE: ASoC: Intel: soc-acpi: update codec addr on 0C11/0C4F product
    - [Config] enable CONFIG_SND_SOC_RT1318_SDW

  * Add additional Mediatek MT7922 BT device ID (LP: #1998885)
    - Bluetooth: btusb: Add a new VID/PID 0489/e0f2 for MT7922

  * Mute/mic LEDs no function on a HP platfrom (LP: #1998882)
    - ALSA: hda/realtek: fix mute/micmute LEDs for a HP ProBook

  * Enable Intel FM350 wwan CCCI driver port logging (LP: #1997686)
    - net: wwan: t7xx: use union to group port type specific data
    - net: wwan: t7xx: Add port for modem logging

  * CVE-2022-4378
    - proc: proc_skip_spaces() shouldn't think it is working on C strings
    - proc: avoid integer type confusion in get_proc_long

  * Add cs35l41 firmware loading support (LP: #1995957)
    - ALSA: hda/realtek: More robust component matching for CS35L41

 -- Timo Aaltonen <email address hidden> Wed, 04 Jan 2023 11:54:42 +0200

Changed in linux-oem-6.0 (Ubuntu Jammy):
status: Fix Committed → Fix Released
Timo Aaltonen (tjaalton)
Changed in linux (Ubuntu Lunar):
status: In Progress → Fix Released
Revision history for this message
AceLan Kao (acelankao) wrote :

Commit a8671493d207 ("drm/amdgpu: make sure to init common IP before gmc") intorduce a regression in 5.17-oem kernel and has been reverted since 5.17.0-1027.28

Changed in linux-oem-5.17 (Ubuntu Jammy):
status: Fix Released → In Progress
Revision history for this message
AceLan Kao (acelankao) wrote :

The fix should contain below 3 commits and confirmed that doesn't introduce regressions on Dell Inspiron 3585 with AMD GPU.
   a8671493d207 drm/amdgpu: make sure to init common IP before gmc
   e3163bc8ffdf drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
   dc1d85cb790f drm/amdgpu: move nbio ih_doorbell_range() into ih code for vega

Kinetic(5.19) already contains those commits since 5.19.0-27.28.

BTW, I can reproduce the regression with only applied commit a8671493d207 ("drm/amdgpu: make sure to init common IP before gmc"). With all the 3 commits, the issue is gone and the display works normally.

AceLan Kao (acelankao)
description: updated
AceLan Kao (acelankao)
description: updated
AceLan Kao (acelankao)
Changed in linux (Ubuntu Kinetic):
status: In Progress → Fix Released
Timo Aaltonen (tjaalton)
Changed in linux-oem-5.17 (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-oem-5.17 - 5.17.0-1028.29

---------------
linux-oem-5.17 (5.17.0-1028.29) jammy; urgency=medium

  * jammy/linux-oem-5.17: 5.17.0-1028.29 -proposed tracker (LP: #2004346)

  * CVE-2023-0045
    - x86/bugs: Flush IBP in ib_prctl_set()

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2023.01.30)

  * Keeps rebooting with AMD W6400, W6600, and W6800 graphic cards
    (LP: #2000110)
    - drm/amdgpu: move nbio ih_doorbell_range() into ih code for vega
    - drm/amdgpu: move nbio sdma_doorbell_range() into sdma code for vega
    - drm/amdgpu: make sure to init common IP before gmc
    - drm/amdgpu: fix sdma doorbell init ordering on APUs

  * CVE-2022-47520
    - wifi: wilc1000: validate pairwise and authentication suite offsets

  * Improve arp_ndisc_evict_nocarrier.sh test result processing (LP: #2006546)
    - selftests: net: return non-zero for failures reported in
      arp_ndisc_evict_nocarrier

  * CVE-2022-43750
    - usb: mon: make mmapped memory read only

  * CVE-2023-0461
    - net/ulp: prevent ULP without clone op from entering the LISTEN status
    - net/ulp: use consistent error code when blocking ULP

  * CVE-2022-3565
    - mISDN: fix use-after-free bugs in l1oip timer handlers

  * CVE-2022-36879
    - xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in
      xfrm_bundle_lookup()

  * CVE-2022-20369
    - NFSD: fix use-after-free in __nfs42_ssc_open()

  * arp_ndisc_evict_nocarrier.sh in net from ubuntu_kernel_selftests failed on
    J-oem-5.17 / K (LP: #1968310)
    - selftests: net: fix cleanup_v6() for arp_ndisc_evict_nocarrier

  * CVE-2022-20566
    - Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put

  * Expose built-in trusted and revoked certificates (LP: #1996892)
    - [Packaging] Expose built-in trusted and revoked certificates

 -- Timo Aaltonen <email address hidden> Fri, 10 Feb 2023 12:15:41 +0200

Changed in linux-oem-5.17 (Ubuntu Jammy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.