soft lockup / stall on CPU when shutting down with hwe 4.10 kernel

Bug #1713751 reported by Michael Pardee on 2017-08-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Joseph Salisbury
Zesty
High
Joseph Salisbury
Artful
High
Joseph Salisbury
Bionic
High
Joseph Salisbury

Bug Description

Instead of normal complete shutdowns we're getting soft lockup failures. This started when 16.04 hwe packages switched to the 4.10 kernel about a month ago. I help manage a few hundred machines spanning several different sites and several different hardware models and they're all experiencing this intermittently, approximately 5% get stuck on shutdown each day.

Here is an example of what is on the screen after it happens, the machine is unresponsive and requires a hard reset. I can't see anything in syslog or dmesg that differs when this happens, I think all logging has stopped at this point in the shutdown.

[54566.220003] ? (t=6450529 jiffies g=141935 c=141934 q=1288)
[54592.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54620.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54648.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54676.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54704.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54732.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54746.232003] INFO: rcu_sched self-detected stall on CPU
[54746.232003] ?1-...: (6495431 ticks this GP) idle=5c7/140000000000001/0 softirq=218389/218389 fqs=3247712

This repeats every ~ 22 seconds, sometimes it is stuck for 23s instead of 22:
... NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!

Reverting to 4.8.0-58 avoids the problem. I believe the problem has been present with every hwe 4.10 kernel package through the current linux-image-4.10.0-33-generic. This bug was filed with data right after it occurred with linux-image-4.10.0-33-generic.

This only happens approximately 5% of the time with no discernible pattern. I am able to reproduce the issue on one particular machine by scheduling shutdowns 3 times per day and waiting up to a few days for the problem to occur. Shutting down and starting up more frequently, like every 5 minutes or even an hour, will not trigger the problem, it seems like the machine needs to be running for a while. It does not seem to depend on any user actions, it happens even if you never login. It has happened on reboots as as opposed to shutdowns as well. I found a few similar bug reports but nothing for these exact symptoms.

I have tried blacklisting mei_me with no change in behavior. I'm not sure but the majority of the affected machines are using intel video chips. Next I am going to try a mainline 4.10 kernel.

lsb_release -rd
Description: Ubuntu 16.04.3 LTS
Release: 16.04

apt-cache policy linux-image-4.10.0-33-generic
linux-image-4.10.0-33-generic:
  Installed: 4.10.0-33.37~16.04.1
  Candidate: 4.10.0-33.37~16.04.1
  Version table:
 *** 4.10.0-33.37~16.04.1 500
        500 http://us.archive.ubuntu.com/ubuntu xenial-security/main amd64 Packages
        100 /var/lib/dpkg/status

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.10.0-33-generic 4.10.0-33.37~16.04.1
ProcVersionSignature: Ubuntu 4.10.0-33.37~16.04.1-generic 4.10.17
Uname: Linux 4.10.0-33-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
CurrentDesktop: XFCE
Date: Tue Aug 29 08:57:26 2017
SourcePackage: linux-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

I've been running the 4.10.0-041000-generic upstream kernel for a week with 3 reboots a day and not a single problem - so the issue is probably in the Ubuntu kernel. What is the next step I should take to solve this?

I switched back to the Ubuntu linux-image-4.10.0-33-generic kernel and about 6 reboots later it happened again, so its definitely specific to the Ubuntu kernel. I was able to catch it right as it happened, the last thing before the NMI watchgdog messages occur is:

[ OK ] Reached target Shutdown.

The only errors visible on the screen are [FAILED] unmounting /tmp and /var , I always get the failure unmounting /var but not /tmp, maybe that is significant? I had /tmp on a separate partition, I removed its fstab entry so it will be on the / root partition, we'll see if that changes anything.

The problem still occurs without a separate /tmp partition. Still experiencing the failure on Ubuntu linux-image-4.10.0-33-generic while 4.10.0-041000-generic is problem free.

Changed in linux-hwe (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to identify the Ubuntu specific commit that introduced this. First can you test the Zesty and Artful kernels to see if the bug has already been fixed in the newer releases? If it has, we can focus of finding the fix with a "Reverse" bisect.

The kernels can be downloaded from:
Zesty: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/13563574
Artful: https://launchpad.net/~canonical-kernel-security-team/+archive/ubuntu/ppa2/+build/13567624

To install the kernels, just be sure to install both the linux-image and linux-image-extra .deb packages.

Thanks in advance!

Changed in linux (Ubuntu):
status: New → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux-hwe (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: New → In Progress

OK, I'm trying out 4.13.0-16-generic ( and extra ) from artful now on an 8 hour shutdown cycle, it may take a few days to know if the problem will occur.

There were no shutdown issues with 4.13.0-16-generic in 20 shutdowns - very promising. I am switching back to 4.10.0-33 now to make sure the problem still occurs and nothing else has changed.

Changed in linux (Ubuntu Artful):
status: New → In Progress
Changed in linux (Ubuntu Zesty):
status: New → Incomplete
Changed in linux (Ubuntu Artful):
importance: Undecided → High
Changed in linux (Ubuntu Zesty):
importance: Undecided → High
Changed in linux (Ubuntu Artful):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Zesty):
assignee: nobody → Joseph Salisbury (jsalisbury)
no longer affects: linux-hwe (Ubuntu)
no longer affects: linux-hwe (Ubuntu Zesty)
no longer affects: linux-hwe (Ubuntu Artful)
no longer affects: linux-hwe (Ubuntu Bionic)

The problem no longer seems to be occurring anymore even with the 4.10.0-33-generic kernel. Since 2017-09-11 all of the affected machines have been getting automatic package updates and I'm guessing some non-kernel package has changed something that prevents the problem.

I would like to know what fixed it and I could start reverting those package updates or start over with an old install, but for now as long as it is fixed it might not be worth investigating further.

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Released
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Released
Changed in linux (Ubuntu Zesty):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers