ThinkPad L14 Gen 3 blank laptop screen on bootup

Bug #2002467 reported by Paul Gear
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Invalid
Undecided
Unassigned
Kinetic
Invalid
Undecided
Unassigned
Lunar
Invalid
Undecided
Unassigned
linux-oem-5.17 (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Fix Released
Undecided
AceLan Kao
Kinetic
Invalid
Undecided
Unassigned
Lunar
Invalid
Undecided
Unassigned

Bug Description

[Impact]
The commit in 5.17.0.1026.27 introduce an regression that leads to the panel of ThinkPad L14 Gen 3 becomes blank after bootup
224dffc89a957 drm/amdgpu: make sure to init common IP before gmc

[Fix]
Revert this commit as it looks like it requires some other commits to work together, and from the discussion of the 5.10 stable, this commit has been reverted, too.
https://gitlab.freedesktop.org/drm/amd/-/issues/2216

BTW, confirmed by Aaron Ma that 6.0 and 6.1 OEM kernels work well with this commit, so no need to revert it.

[Test]
Verified by Aaron Ma and the bug reporter.

[Where problems could occur]
Revert this commit will lead to the keep rebooting issue be re-opened(bug 2000110), I'll check the issue again and submit another solution for it.

================

When booting linux-image-5.17.0-1026-oem version 5.17.0.1026.27 on a Lenovo ThinkPad L14 Gen 3 AMD (model 21C5CTO1WW), the screen goes blank shortly after bootup.

As far as I can tell (the screen blanks pretty quickly, so I'm not 100% sure), the last message before the screen disappears is this one:

Jan 9 07:58:30 work kernel: [ 1.321487] fb0: switching to amdgpu from EFI VGA

Booting linux-image-5.17.0-1025-oem version 5.17.0.1025.26 works fine, so this appears to be a regression in whatever changes there were in the framebuffer/amdgpu drivers between 5.17.0.1025.26 and 5.17.0.1026.27.

I think this should be high priority, because it gives the impression that the system is completely borked, when in actuality it's probably just waiting for me to enter my drive encryption password.

If I type the encryption password, it seems to progress to starting the system (the microphone mute LED comes on as expected), but even X starting does not turn the screen back on. Switching to the frame buffer console with Ctrl-Alt-Fx does nothing.

Paul Gear (paulgear)
description: updated
summary: - Blank screen on bootup
+ ThinkPad L14 Gen 3 blank screen on bootup
Paul Gear (paulgear)
description: updated
summary: - ThinkPad L14 Gen 3 blank screen on bootup
+ ThinkPad L14 Gen 3 blank laptop screen on bootup
description: updated
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

How did you end up with oem-5.17 kernel?

affects: linux-signed-oem-5.10 (Ubuntu) → linux-oem-5.17 (Ubuntu)
Changed in linux-oem-5.17 (Ubuntu):
status: New → Invalid
Changed in linux-oem-5.17 (Ubuntu Jammy):
status: New → Incomplete
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

You could try installing linux-oem-22.04b which installs oem-6.0, that one also has the fix for bug 2000110 but it didn't need other backports. I suspect those broke just oem-5.17

Revision history for this message
Paul Gear (paulgear) wrote :

I ended up on 5.17 because I installed linux-oem-22.04 myself. (Ubuntu isn't available on this hardware model in my region; it's a self-install.) There was a bug introduced in 5.19 (I'll find the reference when I get a moment) which meant that it wasn't stable on my system, but I can give the latest 6.0 a try to see if it's fixed now.

Revision history for this message
AceLan Kao (acelankao) wrote :

Hi Paul,

Could you upload the result of "lspci -vvnn" command here?
Thanks.

Revision history for this message
Paul Gear (paulgear) wrote :

lspci -vvnn output

Revision history for this message
Paul Gear (paulgear) wrote :

Log from booting 5.17.0-1025-oem - this is the kernel that works correctly.

Revision history for this message
Paul Gear (paulgear) wrote :

Log from booting 5.17.0-1026-oem - this is the kernel that blanks the screen and is thus unusable. Note the amdgpu errors beginning at line 961 which do not appear in the 5.17.0-1025-oem log.

Revision history for this message
Paul Gear (paulgear) wrote :

Log from booting 6.0.0-1010-oem - non-viable due to what appears to be an instance of https://bugs.launchpad.net/ubuntu/kinetic/+source/linux/+bug/1995956 or a related bug. Errors start at line 1202. The 80 minutes which it took for this bug to surface while using Slack is the longest this machine has lasted on any 5.18+ kernel version.

Revision history for this message
Paul Gear (paulgear) wrote :

@tjaalton: Added logs for 1 working and 2 non-working versions. It seems like there has been quite a lot of churn in the amdgpu driver between 5.17 and 6.0. Hope this helps narrow down the problem commits.

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

I'm not sure this will get fixed for 5.17 at this point, we're already planning to move to oem-6.1 and EOL 5.17 & 6.0

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

note that 6.0.0-1010 has the fix for 1995956

712c02e5d2d3680 drm/amdgpu: workaround for TLB seq race

so if it's buggy, test oem-6.1 and hope it has more goodies

Revision history for this message
AceLan Kao (acelankao) wrote :

Similar to the issue on upstream
https://gitlab.freedesktop.org/drm/amd/-/issues/2220

Compare what we backported and the kernel versions reported on the upstream bug, I feel like it's below commit introduce the regression

Could you try this test kernel to see if it helps, it only reverts the below commit.
https://people.canonical.com/~acelan/bugs/lp2002467/

commit d2c4c1569a7d7d5c8f75963bf2d62d7aeac30e2a
Author: Lijo Lazar <email address hidden>
Date: Fri Sep 30 10:43:08 2022 +0530

    drm/amdgpu: Remove ATC L2 access for MMHUB 2.1.x

    MMHUB 2.1.x versions don't have ATCL2. Remove accesses to ATCL2 registers.

    Since they are non-existing registers, read access will cause a
    'Completer Abort' and gets reported when AER is enabled with the below patch.
    Tagging with the patch so that this is backported along with it.

    v2: squash in uninitialized warning fix (Nathan Chancellor)

    Fixes: 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")

    Signed-off-by: Lijo Lazar <email address hidden>
    Reviewed-by: Guchun Chen <email address hidden>
    Signed-off-by: Alex Deucher <email address hidden>
    Cc: <email address hidden>

Revision history for this message
Paul Gear (paulgear) wrote :

I've confirmed that 6.1.0-1004-oem (the current version of linux-image-oem-22.04c) does not have the screen disabling bug. I'll test for stability in the face of amdgpu protection faults today and see what happens.

Revision history for this message
AaronMa (mapengyu) wrote :

Hi Acelan,

The first bad commit:

commit 224dffc89a957e2f006de3202d12c09bf348f992 (refs/bisect/bad)
Author: Alex Deucher <email address hidden>
Date: Wed Dec 21 15:26:47 2022 +0800

    drm/amdgpu: make sure to init common IP before gmc

    BugLink: https://launchpad.net/bugs/2000110

    Move common IP init before GMC init so that HDP gets
    remapped before GMC init which uses it.

    This fixes the Unsupported Request error reported through
    AER during driver load. The error happens as a write happens
    to the remap offset before real remapping is done.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373

    The error was unnoticed before and got visible because of the commit
    referenced below. This doesn't fix anything in the commit below, rather
    fixes the issue in amdgpu exposed by the commit. The reference is only
    to associate this commit with below one so that both go together.

    Fixes: 8795e182b02d ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")

    Acked-by: Christian König <email address hidden>
    Reviewed-by: Lijo Lazar <email address hidden>
    Signed-off-by: Alex Deucher <email address hidden>
    Cc: <email address hidden>
    (cherry picked from commit a8671493d2074950553da3cf07d1be43185ef6c6)
    Signed-off-by: Chia-Lin Kao (AceLan) <email address hidden>

Is this commit necessary for LP:2000110, if not it should be reverted.
Otherwise, the following series are needed:
https://lore<email address hidden>/

Changed in linux-oem-5.17 (Ubuntu Jammy):
status: Incomplete → Triaged
Revision history for this message
AceLan Kao (acelankao) wrote :

This commit is required for bug 2000110, and I can't figure out any good solution for this.
"drm/amdgpu: make sure to init common IP before gmc" introduces regression which could be found here https://gitlab.freedesktop.org/drm/amd/-/issues/2216 and has been reverted from 5.10 stable kernel.

Anyway, I'll submit SRU to revert this commit, and try to provide better solution for bug 2000110

Revision history for this message
Paul Gear (paulgear) wrote :

6.1.0-1004-oem worked fine in terms of stability over yesterday's work day, with no amdgpu protection faults, but when I tried to suspend by closing the lid at the end of the work day, I got some very strange behaviour that I've never seen before - see video here: https://photos.app.goo.gl/hkyy9a9U2jJ7GPL59

I'll test your custom 5.17 kernel next, because it seems that the 5.17 train is the only viable one at the moment on this hardware. I reverted to 5.17.0-1025-eom today and everything went fine.

As a point of comparison, I'm updating this bug on my personal laptop, which is a ThinkPad E14 AMD Gen3 - very similar specs to my work laptop, yet this one suspends fine on 6.1.0-1004-oem and has had no amdgpu problems that I know of. I'll add the lspci -nnvv for comparison shortly.

Revision history for this message
Paul Gear (paulgear) wrote :

lspci -nnvv output on ThinkPad E14 AMD Gen3.

Revision history for this message
Paul Gear (paulgear) wrote :

Unfortunately, the 5.17.0-2026.27 kernel removing commit d2c4c1569a7d7d5c8f75963bf2d62d7aeac30e2a did not resolve the problem - it exhibits the same display blanking characteristics on the ThinkPad L14 Gen3 AMD as the signed release 5.17.0-1026-oem does. Boot log for this kernel attached.

AceLan Kao (acelankao)
Changed in linux-oem-5.17 (Ubuntu Jammy):
assignee: nobody → AceLan Kao (acelankao)
Revision history for this message
AceLan Kao (acelankao) wrote :

Hi Paul,

Aaron has found the bad commit which is
224dffc89a957 drm/amdgpu: make sure to init common IP before gmc

I've updated the test kernel at the same location which reverted the commit.
https://people.canonical.com/~acelan/bugs/lp2002467/

I'm going to submit SRU to revert this one, please help to verify it, thanks.

AceLan Kao (acelankao)
description: updated
AceLan Kao (acelankao)
description: updated
Revision history for this message
Paul Gear (paulgear) wrote :

@acelankao: Thanks very much! Confirmed that your test kernel fixes the screen blanking problem.

Revision history for this message
AceLan Kao (acelankao) wrote :
description: updated
Changed in linux-oem-5.17 (Ubuntu Jammy):
status: Triaged → In Progress
Timo Aaltonen (tjaalton)
Changed in linux-oem-5.17 (Ubuntu Jammy):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-oem-5.17 - 5.17.0-1027.28

---------------
linux-oem-5.17 (5.17.0-1027.28) jammy; urgency=medium

  * jammy/linux-oem-5.17: 5.17.0-1027.28 -proposed tracker (LP: #2003451)

  * CVE-2022-3545
    - nfp: fix use-after-free in area_cache_get()

  * CVE-2022-42895
    - Bluetooth: L2CAP: Fix attempting to access uninitialized memory

  * ThinkPad L14 Gen 3 blank laptop screen on bootup (LP: #2002467)
    - Revert "drm/amdgpu: make sure to init common IP before gmc"

  * CVE-2023-0179
    - netfilter: nft_payload: incorrect arithmetics when fetching VLAN header bits

  [ Ubuntu: 5.17.0-15.16~22.04.8 ]

  * jammy/linux-hwe-5.17: 5.17.0-15.16~22.04.8 -proposed tracker (LP: #2003452)
  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts
    - debian/dkms-versions -- update from kernel-versions (main/2023.01.02)
  * Revoke & rotate to new signing key (LP: #2002812)
    - [Packaging] Revoke and rotate to new signing key
  * CVE-2022-45934
    - Bluetooth: L2CAP: Fix u8 overflow
  * CVE-2022-42896
    - Bluetooth: L2CAP: Fix accepting connection request for invalid SPSM
    - Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm
  * CVE-2022-4378
    - proc: proc_skip_spaces() shouldn't think it is working on C strings
    - proc: avoid integer type confusion in get_proc_long

 -- Timo Aaltonen <email address hidden> Mon, 23 Jan 2023 12:37:39 +0200

Changed in linux-oem-5.17 (Ubuntu Jammy):
status: Fix Committed → Fix Released
Timo Aaltonen (tjaalton)
Changed in linux (Ubuntu Lunar):
status: New → Invalid
Changed in linux (Ubuntu Kinetic):
status: New → Invalid
Changed in linux (Ubuntu Jammy):
status: New → Invalid
Changed in linux-oem-5.17 (Ubuntu Kinetic):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.