frequent hangs with latest hirsute kernel (nvidia graphics)

Bug #1939349 reported by Steve Langasek
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

After upgrading to the latest kernel, I am experiencing frequent kernel hangs. The system does not respond to Alt+SysRq, and I have to power it off.

In some cases the hang happens before X has even been launched at boot.

When it hangs, the fan spins up to full speed.

ProblemType: Bug
DistroRelease: Ubuntu 21.04
Package: linux-image-5.11.0-25-generic 5.11.0-25.27
ProcVersionSignature: Ubuntu 5.11.0-25.27-generic 5.11.22
Uname: Linux 5.11.0-25-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu65.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: vorlon 10452 F.... pulseaudio
 /dev/snd/controlC0: vorlon 10452 F.... pulseaudio
CasperMD5CheckResult: unknown
CurrentDesktop: ubuntu:GNOME
Date: Mon Aug 9 14:15:58 2021
InstallationDate: Installed on 2019-12-23 (594 days ago)
InstallationMedia: Ubuntu 19.10 "Eoan Ermine" - Release amd64 (20191017)
MachineType: LENOVO 20QVCTO1WW
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/BOOT/ubuntu_di672m@/vmlinuz-5.11.0-25-generic root=ZFS=rpool/ROOT/ubuntu_di672m ro quiet splash vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-5.11.0-25-generic N/A
 linux-backports-modules-5.11.0-25-generic N/A
 linux-firmware 1.197.2
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 10/17/2019
dmi.bios.release: 1.27
dmi.bios.vendor: LENOVO
dmi.bios.version: N2OET40W (1.27 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20QVCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: SDK0R32862 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.ec.firmware.release: 1.19
dmi.modalias: dmi:bvnLENOVO:bvrN2OET40W(1.27):bd10/17/2019:br1.27:efr1.19:svnLENOVO:pn20QVCTO1WW:pvrThinkPadX1Extreme2nd:rvnLENOVO:rn20QVCTO1WW:rvrSDK0R32862WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad X1 Extreme 2nd
dmi.product.name: 20QVCTO1WW
dmi.product.sku: LENOVO_MT_20QV_BU_Think_FM_ThinkPad X1 Extreme 2nd
dmi.product.version: ThinkPad X1 Extreme 2nd
dmi.sys.vendor: LENOVO
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu65.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: vorlon 8854 F.... pulseaudio
 /dev/snd/controlC0: vorlon 8854 F.... pulseaudio
CasperMD5CheckResult: unknown
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 21.04
InstallationDate: Installed on 2019-12-23 (600 days ago)
InstallationMedia: Ubuntu 19.10 "Eoan Ermine" - Release amd64 (20191017)
MachineType: LENOVO 20QVCTO1WW
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair nvidia_modeset nvidia
Package: linux (not installed)
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/BOOT/ubuntu_di672m@/vmlinuz-5.11.0-25-generic root=ZFS=rpool/ROOT/ubuntu_di672m ro quiet splash vt.handoff=1
ProcVersionSignature: Ubuntu 5.11.0-25.27-generic 5.11.22
RelatedPackageVersions:
 linux-restricted-modules-5.11.0-25-generic N/A
 linux-backports-modules-5.11.0-25-generic N/A
 linux-firmware 1.197.2
Tags: hirsute
Uname: Linux 5.11.0-25-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip libvirt lpadmin lxd plugdev sambashare sbuild src sudo
_MarkForUpload: True
dmi.bios.date: 04/28/2021
dmi.bios.release: 1.39
dmi.bios.vendor: LENOVO
dmi.bios.version: N2OET52W (1.39 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20QVCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: SDK0R32862 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.ec.firmware.release: 1.22
dmi.modalias: dmi:bvnLENOVO:bvrN2OET52W(1.39):bd04/28/2021:br1.39:efr1.22:svnLENOVO:pn20QVCTO1WW:pvrThinkPadX1Extreme2nd:rvnLENOVO:rn20QVCTO1WW:rvrSDK0R32862WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad X1 Extreme 2nd
dmi.product.name: 20QVCTO1WW
dmi.product.sku: LENOVO_MT_20QV_BU_Think_FM_ThinkPad X1 Extreme 2nd
dmi.product.version: ThinkPad X1 Extreme 2nd
dmi.sys.vendor: LENOVO

Revision history for this message
Steve Langasek (vorlon) wrote :
Revision history for this message
AaronMa (mapengyu) wrote :

Could you try 5.13 or 5.14 kernel combined with nvidia-driver-465/470?

And upload the log when hang.

Revision history for this message
Steve Langasek (vorlon) wrote :

Hi Aaron, could you please clarify exactly which packages at which versions you want me to test?

Revision history for this message
Stefan Bader (smb) wrote :

I probably would first suggest to try the proposed 5.11.0-26 from this cycle. The other path which might be useful (but I do not know whether this is possible with the HW) would be to attempt a bit with the nouveau driver. That might separate whether this is in Nvidia or the kernel.

Generally fast spinning fans sound like a spinlock problem. But this could as well be caused by a crash somewhere with a lock taken. Or a hand for other reasons, again while being in a lock section. I suspect Steve already checked previous boot logs and did not find any useful message. That makes it hard to put a finger somewhere.

Revision history for this message
AaronMa (mapengyu) wrote :

I didn't reproduce this issue on another X1 2nd.
Reboot for 10 times, all good.

Verified OS version: Focal
$ uname -a
Linux u-ThinkPad-X1-Extreme-Gen2 5.11.0-25-generic #27~20.04.1-Ubuntu SMP Tue Jul 13 17:41:23 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 460.91.03

The error I found in dmesg is:
[ 14.291240] loop0: detected capacity change from 0 to 226816
[ 14.328998] loop1: detected capacity change from 0 to 231360

It could be related to ZFS raid, It's better to backup everything and scan your partition.

Revision history for this message
Alberto Milone (albertomilone) wrote :

If you want to try going back to 5.11.0-16-generic, you can install the nvidia-dkms-$flavour package for your NVIDIA driver, and, assuming you are using secure boot, let DKMS enroll its own key.

Revision history for this message
Andreas (opendreas) wrote :

I'm not sure that I have the same problem.
But on a Dell Precision 7530 with kernel 5.11 I get a black screen and nothing works.
Intel UHD P630 on Xeon E-2186M
Nvidia Quadro P3200

This happens on Ubuntu 21.04 or Ubuntu 20.04 with kernel 5.11.
But I can boot usb live in failsafe mode with llvmpipe.

Revision history for this message
Steve Langasek (vorlon) wrote :

I managed to catch a hang while at console on shutdown of -25. Photo attached of the last kernel output before the hang.

Similar behavior was seen with 22 and current Nvidia.

Now booted to 16 to see if the problem follows.

Revision history for this message
Steve Langasek (vorlon) wrote :

Andreas, please open a separate bug report for your issue.

Revision history for this message
AaronMa (mapengyu) wrote :

Hi Steve,

#8 looks that NVIDIA and ACPI are related.
BIOS N2OET40W is a bit old, would you mind update it:
https://pcsupport.lenovo.com/us/en/products/laptops-and-netbooks/thinkpad-x-series-laptops/thinkpad-x1-extreme-gen-2/downloads/ds540308-bios-update-utility-bootable-cd-for-windows-10-64-bit-linux-thinkpad-p1-gen-2-types-20qt-20qu-x1-extreme-gen-2-types-20qv-20qw

If there is a regression in 5.11.0-25, it's better to check mainline build 5.11 too to find if it is a upstream regression.

Revision history for this message
Steve Langasek (vorlon) wrote :

I have downgraded to 5.11.0-16 and the problem is still reproducible. I am now going to attempt to downgrade the nvidia driver to confirm that it's not reproducible with the previous version.

Revision history for this message
Steve Langasek (vorlon) wrote :

Problem has also reproduced, with both 5.11.0-16 and 5.11.0-25, using a downgraded nvidia stack.

Will now look at upgrading the firmware as suggested to see if that helps.

Revision history for this message
Steve Langasek (vorlon) wrote :

Have upgraded the firmware, but the intermittent hang is still reproducible. I'm attaching updated apport data after the firmware update.

tags: added: apport-collected
description: updated
Revision history for this message
Steve Langasek (vorlon) wrote : AlsaInfo.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : CRDA.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : IwConfig.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : Lspci.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : Lspci-vt.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : Lsusb.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : Lsusb-t.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : Lsusb-v.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : PaInfo.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : ProcEnviron.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : ProcModules.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : PulseList.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : RfKill.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : UdevDb.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : WifiSyslog.txt

apport information

Revision history for this message
Steve Langasek (vorlon) wrote : acpidump.txt

apport information

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Is there any log in `/var/lib/systemd/pstore`?

Revision history for this message
AaronMa (mapengyu) wrote :

Hi Steve,
Could you try disabling Secure Boot in BIOS?

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1939349] Re: frequent hangs with latest hirsute kernel (nvidia graphics)

On Mon, Aug 16, 2021 at 02:15:25AM -0000, AaronMa wrote:
> Could you try disabling Secure Boot in BIOS?

Would you expect disabling Secure Boot to have an effect here? It should be
out of the way after the kernel has booted.

I have disabled Secure Boot, and the hang is still reproducible.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Mon, Aug 16, 2021 at 02:05:07AM -0000, Kai-Heng Feng wrote:
> Is there any log in `/var/lib/systemd/pstore`?

No.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Steve Langasek (vorlon) wrote :

Have upgraded to 5.11.0-33-generic from hirsute-proposed; hang still occurs.

Revision history for this message
Steve Langasek (vorlon) wrote :

I've also tried upgrading to a newer shim, in case something left around in memory was causing problems. This also didn't help.

And quoting the kernel error here for text searchability:

[ 675.724534] ACPI Error: Aborting method \_SB.PCI0.PEG0.PEGP.NVP0 due to previous error (AE_AML_LOOP_TIMEOUT) (20201113/psparse-529)
[ 675.724698] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20201113/psparse-529)
[ 675.724704] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20201113/psparse-529)

Revision history for this message
Steve Langasek (vorlon) wrote :

Here's a related bug on Arch Linux.

https://forums.lenovo.com/t5/Other-Linux-Discussions/ACPI-Errors-Hanging-the-Shutdown-Process-X1-Extreme-Gen-2/m-p/4643802

The user there says installing acpid resolved the issue for them. acpid is already installed here (it's part of the ubuntu-desktop task).

Revision history for this message
AaronMa (mapengyu) wrote :

Could you uninstall nvidia-driver to narrow down the scope of root cause?

Revision history for this message
Steve Langasek (vorlon) wrote :

Before I had a chance to try uninstalling the nvidia driver, the laptop crashed again, and then I was unable to get it to boot across multiple (10+) attempts. After catching kernel output on one hung boot (by pressing Esc to clear plymouth), I was able to see that the hardware hang happened right after the start of gpu-manager.service. So I went into the firmware and tried disabling hybrid graphics in favor of discrete graphics.

The machine booted successfully on the very next boot and has been stable now for just over a day (at the expense of having no brightness control).

Revision history for this message
Steve Langasek (vorlon) wrote :

The system has now been running stably for 6 days, which is an order of magnitude better than what I was experiencing since filing this bug report.

I will at some point flip the firmware setting back to see how it behaves. Are there any other tests I should try?

Revision history for this message
Steve Langasek (vorlon) wrote :

The tradeoff of disabling hybrid graphics is that, with discrete nvidia graphics only running, I lose all power management, screen power off, and suspend/resume support (because suspend works, but after resume video does not come back).

Since I'm now at a sprint, these are bigger practical issues that have caused me to revisit this bug. I've confirmed that re-enabling hybrid graphics still makes the system hang early after boot.

I've now removed the nvidia drivers from my system, and so far have an uptime of 55 minutes, which is a substantial improvement.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hi Steve,

This bug was reported against Hirsute 5.11 kernel which is EOL so I'm closing this bug as Won't Fix. Please feel free to re-open it if you are still having the issue with more recent kernels.

Thanks.

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.