soft lockup / stall on CPU when shutting down with hwe 4.10 kernel
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Zesty |
Fix Released
|
High
|
Unassigned | ||
Artful |
Fix Released
|
High
|
Unassigned | ||
Bionic |
Fix Released
|
High
|
Unassigned |
Bug Description
Instead of normal complete shutdowns we're getting soft lockup failures. This started when 16.04 hwe packages switched to the 4.10 kernel about a month ago. I help manage a few hundred machines spanning several different sites and several different hardware models and they're all experiencing this intermittently, approximately 5% get stuck on shutdown each day.
Here is an example of what is on the screen after it happens, the machine is unresponsive and requires a hard reset. I can't see anything in syslog or dmesg that differs when this happens, I think all logging has stopped at this point in the shutdown.
[54566.220003] ? (t=6450529 jiffies g=141935 c=141934 q=1288)
[54592.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54620.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54648.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54676.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54704.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54732.092003] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! (systemd:1)
[54746.232003] INFO: rcu_sched self-detected stall on CPU
[54746.232003] ?1-...: (6495431 ticks this GP) idle=5c7/
This repeats every ~ 22 seconds, sometimes it is stuck for 23s instead of 22:
... NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
Reverting to 4.8.0-58 avoids the problem. I believe the problem has been present with every hwe 4.10 kernel package through the current linux-image-
This only happens approximately 5% of the time with no discernible pattern. I am able to reproduce the issue on one particular machine by scheduling shutdowns 3 times per day and waiting up to a few days for the problem to occur. Shutting down and starting up more frequently, like every 5 minutes or even an hour, will not trigger the problem, it seems like the machine needs to be running for a while. It does not seem to depend on any user actions, it happens even if you never login. It has happened on reboots as as opposed to shutdowns as well. I found a few similar bug reports but nothing for these exact symptoms.
I have tried blacklisting mei_me with no change in behavior. I'm not sure but the majority of the affected machines are using intel video chips. Next I am going to try a mainline 4.10 kernel.
lsb_release -rd
Description: Ubuntu 16.04.3 LTS
Release: 16.04
apt-cache policy linux-image-
linux-image-
Installed: 4.10.0-
Candidate: 4.10.0-
Version table:
*** 4.10.0-
500 http://
100 /var/lib/
ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-
ProcVersionSign
Uname: Linux 4.10.0-33-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
CurrentDesktop: XFCE
Date: Tue Aug 29 08:57:26 2017
SourcePackage: linux-hwe
UpgradeStatus: No upgrade log present (probably fresh install)
Changed in linux-hwe (Ubuntu): | |
importance: | Undecided → High |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Artful): | |
status: | New → In Progress |
Changed in linux (Ubuntu Zesty): | |
status: | New → Incomplete |
Changed in linux (Ubuntu Artful): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Zesty): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Artful): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
Changed in linux (Ubuntu Zesty): | |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
no longer affects: | linux-hwe (Ubuntu) |
no longer affects: | linux-hwe (Ubuntu Zesty) |
no longer affects: | linux-hwe (Ubuntu Artful) |
no longer affects: | linux-hwe (Ubuntu Bionic) |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Released |
Changed in linux (Ubuntu Artful): | |
status: | In Progress → Fix Released |
Changed in linux (Ubuntu Zesty): | |
status: | Incomplete → Fix Released |
I've been running the 4.10.0- 041000- generic upstream kernel for a week with 3 reboots a day and not a single problem - so the issue is probably in the Ubuntu kernel. What is the next step I should take to solve this?