Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Bug #1730717 reported by Iain Lane on 2017-11-07
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Seth Forshee
Zesty
High
Joseph Salisbury
Artful
High
Joseph Salisbury
Bionic
High
Seth Forshee
qemu-kvm (Ubuntu)
Undecided
Unassigned
Zesty
Undecided
Unassigned
Artful
Undecided
Unassigned
Bionic
Undecided
Unassigned

Bug Description

== SRU Justification ==

The fix to bug 1672819 can cause an lockup because it can spin indefinitely waiting for a child to exit.

[FIX]
Add a sauce patch to the original fix to insert a reasonable small delay and an upper bounds to the number of retries that are made before bailing out with an error. This avoids the lockup and also is less aggressive in the retry loop.

[TEST]
Without the fix the machine hangs. With the fix, the lockup no longer occurs.

[REGRESSION POTENTIAL]
There may be an issue with the interruptible sleep having some unforeseen impact with userspace racy code that expects the system call to return quickly when the race condition occurs and instead it gets delayed by a few milliseconds while the retry loop spins. However, code that relies on timing on fork/exec inside pthreads where this particular code path could bite is generally non-POSIX conforming racy code anyhow.

-----------------------------------

This is impacting us for ubuntu autopkgtests. Eventually the whole region ends up dying because each worker is hit by this bug in turn and backs off until the next reset (6 hourly).

17.10 (and bionic) guests are sometimes failing to reboot. When this happens, you see the following in the console

  [[0;32m OK [0m] Reached target Shutdown.
  [ 191.698969] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
  [ 219.698438] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]
  [ 226.702150] INFO: rcu_sched detected stalls on CPUs/tasks:
  [ 226.704958] »(detected by 0, t=15002 jiffies, g=5347, c=5346, q=187)
  [ 226.706093] All QSes seen, last rcu_sched kthread activity 15002 (4294949060-4294934058), jiffies_till_next_fqs=1, root ->qsmask 0x0
  [ 226.708202] rcu_sched kthread starved for 15002 jiffies! g5347 c5346 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0

One host that exhibits this behaviour was:

  Linux klock 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

guest running:

  Linux version 4.13.0-16-generic (buildd@lcy01-02) (gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu2)) #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 (Ubuntu 4.13.0-16.19-generic 4.13.4)

The affected cloud region is running the xenial/Ocata cloud archive, so the version of qemu-kvm in there may also be relevant.

Here's how I reproduced it in lcy01:

  $ for n in {1..30}; do nova boot --flavor m1.small --image ubuntu/ubuntu-artful-17.10-amd64-server-20171026.1-disk1.img --key-name testbed-`hostname` --nic net-name=net_ues_proposed_migration laney-test${n}; done
  $ <ssh to each instance> sudo reboot
  # wait a minute or so for the instances to all reboot
  $ for n in {1..30}; do echo "=== ${n} ==="; nova console-log laney-test${n} | tail; done

On bad instances you'll see the "soft lockup" message - on good it'll reboot as normal.

We've seen good and bad instances on multiple compute hosts - it doesn't feel to me like a host problem but rather a race condition somewhere that's somehow either triggered or triggered much more often by what lcy01 is running. I always saw this on the first reboot - never on first boot, and never on n>1th boot. (But if it's a race then that might not mean much.)

I'll attach a bad and a good console-log for reference.

If you're at Canonical then see internal rt #107135 for some other details.

Iain Lane (laney) wrote :
Iain Lane (laney) wrote :
Iain Lane (laney) wrote :

Oh also see https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1713751 which has some superficially similar symptoms (cpu stuck on shutdown).

description: updated
Iain Lane (laney) wrote :

I tried 28 (then my quota ran out) xenial guests BTW and none of those failed.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1730717

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Joseph Salisbury (jsalisbury) wrote :

Do you know if this bug is also happening with Zesty, or just Artful and Bionic(>4.13)?

I'm going to working on bisecting bug 1713751 in case it's related.

Also, it would be good to know if this bug is already fixed in the latest mainline kernel. Do you have a way to reproduce this bug? If so, could you give 4.14-rc8 a try? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc8

Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Artful):
status: New → Triaged
importance: Undecided → High

Out of the IRC discussions documenting potentially related issues:
- this bug: KVM: Host-Kernel: Xenial-GA, Qemu: Xenial-Ocata, Guest: Bionic
- bug 1722311 KVM: Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some relation to cache pressure
- bug 1713751 AWS: triggered by Xenial kernel update, supposed fixed but shown up again and again
- bug 1655842 Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some relation to cache pressure
These might after all just run into the same soft lockup symptom, but I thought it was worth to mention for thos enot reading the IRC log.

These cases seem to somewhat agree on:
- Recent guest kernel
- Xenial Host kernel
- some memory pressure

To get further I thought some sort of local reproducer for the kernel Team to work on easier than needing a full cloud.
But so far I failed at setting such a local case up (http://paste.ubuntu.com/25916781/).

Thanks Laney for the openstack based repro description.
@Laney I found it interesting that you essentially only needed to start+reboot.
I assume on the host you had other workload goes on in the background (since it is lcy01)?
If you'd have any sort of non-busy but otherwise comparable system - could you check to confirm the assumption we have so far that there all is fine?
If yes - then the memory pressure theory gets more likely, if not we can focus on simpler reproducers - so we can only win by that check.

Crossing fingers for jsalisbury's hope that 4.14 might already have a fix.

On Wed, Nov 08, 2017 at 11:08:05AM -0000, ChristianEhrhardt wrote:
> @Laney I found it interesting that you essentially only needed to start+reboot.
> I assume on the host you had other workload goes on in the background (since it is lcy01)?

I don't have visibility into what else the hosts are doing - I'm just a
client here. But I do know that *I* had other workloads running at the
same time. I assume that there were a lot of buildd jobs too. These
compute nodes are probably loaded most of the time.

> If you'd have any sort of non-busy but otherwise comparable system - could you check to confirm the assumption we have so far that there all is fine?
> If yes - then the memory pressure theory gets more likely, if not we can focus on simpler reproducers - so we can only win by that check.

Afraid not, sorry. I did try on my artful+artful host but didn't
reproduce the problem - it probably wasn't under enough stress anyway.

> Crossing fingers for jsalisbury's hope that 4.14 might already have a
> fix.

Right now I'm trying to build a new cloud image with this kernel that I
can try.

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Torkoal (our Jenkins node) was idle atm and Ryan reported he had seen the issues there before, so trying there as well.
This is LTS + HWE - Kernel 4.10.0-38-generic, qemu: 1:2.5+dfsg-5ubuntu10

I thought about your case since you seem just to start a lot of them and reboot,
this shouldn't be so much different to:
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=artful
$ for i in {1..30}; do uvt-kvm create --log-console-output --password=ubuntu artful-${i}-bug1730717 release=artful arch=amd64 label=daily; done
$ for i in {1..30}; do uvt-kvm wait --insecure artful-${i}-bug1730717; done
$ for i in {1..30}; do uvt-kvm ssh --insecure artful-${i}-bug1730717 "sudo reboot"; done
$ sudo grep "soft lockup" /var/log/libvirt/qemu/artful-*-bug1730717.log

But this works for me :-/

Waiting for your feedback if you can trigger the same issue on a non-busy openstack system (could after all be some openstack magic at work that makes it behave differently).

Iain Lane (laney) wrote :

On Wed, Nov 08, 2017 at 11:56:02AM -0000, ChristianEhrhardt wrote:
> Torkoal (our Jenkins node) was idle atm and Ryan reported he had seen the issues there before, so trying there as well.
> This is LTS + HWE - Kernel 4.10.0-38-generic, qemu: 1:2.5+dfsg-5ubuntu10
>
> I thought about your case since you seem just to start a lot of them and reboot,
> this shouldn't be so much different to:
> $ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=artful
> $ for i in {1..30}; do uvt-kvm create --log-console-output --password=ubuntu artful-${i}-bug1730717 release=artful arch=amd64 label=daily; done
> $ for i in {1..30}; do uvt-kvm wait --insecure artful-${i}-bug1730717; done
> $ for i in {1..30}; do uvt-kvm ssh --insecure artful-${i}-bug1730717 "sudo reboot"; done
> $ sudo grep "soft lockup" /var/log/libvirt/qemu/artful-*-bug1730717.log

Sounds like it's similar, but maybe you have to put the system under
load - you might need more instances, or maybe start a whole bunch first
and get them to run something memory intensive before running that same
test again. In the cloud there will be buildds and tests running on the
compute nodes too, as well as these 'empty' instances that I use to
reproduce the problem.

> Waiting for your feedback if you can trigger the same issue on a non-
> busy openstack system (could after all be some openstack magic at work
> that makes it behave differently).

I don't have access to a non busy cloud I'm afraid.

ANYWAY! My results are in. I created an image by booting the stock
artful cloud image and installing the mainline kernel v4.14-rc8
(39dae59d66acd86d1de24294bd2f343fd5e7a625) packages, on lcy01 (the busy
cloud that exhibits this problem).

I started 34 (17 × 2 in two runs - that's all I could squeeze in before
I hit my quota) instances, and they were all good. This isn't definitive
proof, but it looks like that kernel might be good.

Cheers,

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Joseph Salisbury (jsalisbury) wrote :

@Laney, thanks for testing the mainline kernel. It's promising that a fix might be in that kernel. The time consuming part will be identifying what commit in that kernel is the actual fix. We could perform a "Reverse" kernel bisect, which would required testing 12 or so test kernels. However, it sounds like to set up a reproducer is time consuming as well.

The easiest thing to try next would be to test the latest upstream 4.13 stable kernels. It's possible the fix that is in 4.14-rc8 was also cc'd to upstream stable and made it's way into 4.13 thorough the normal stable update process.

If possible to test, the latest 4.13 upstream kernel is 4.13.12 and is available here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.12/

Iain Lane (laney) wrote :

On Wed, Nov 08, 2017 at 03:16:29PM -0000, Joseph Salisbury wrote:
> @Laney, thanks for testing the mainline kernel. It's promising that a
> fix might be in that kernel. The time consuming part will be
> identifying what commit in that kernel is the actual fix. We could
> perform a "Reverse" kernel bisect, which would required testing 12 or so
> test kernels. However, it sounds like to set up a reproducer is time
> consuming as well.
>
> The easiest thing to try next would be to test the latest upstream 4.13
> stable kernels. It's possible the fix that is in 4.14-rc8 was also cc'd
> to upstream stable and made it's way into 4.13 thorough the normal
> stable update process.
>
> If possible to test, the latest 4.13 upstream kernel is 4.13.12 and is available here:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.12/

20 instances, all good.

Maybe I should try one that we think is 'bad' to confirm that the way
I'm doing this is actually capable of reproducing the issue? If you
think that's sensible, could you recommend me a kernel to try?

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Joseph Salisbury (jsalisbury) wrote :

Maybe give 4.13.0-16 a try:

https://launchpad.net/~canonical-kernel-security-team/+archive/ubuntu/ppa2/+build/13567624

It could also be the bug is being triggered by a Ubuntu specific SAUCE patch, so it won't happen with upstream kernels.

Iain Lane (laney) wrote :

On Wed, Nov 08, 2017 at 06:57:46PM -0000, Joseph Salisbury wrote:
> Maybe give 4.13.0-16 a try:
>
> https://launchpad.net/~canonical-kernel-security-
> team/+archive/ubuntu/ppa2/+build/13567624
>
> It could also be the bug is being triggered by a Ubuntu specific SAUCE
> patch, so it won't happen with upstream kernels.

It looked to me like that's the same kernel that's in artful release, so
instead I tried with the kernel in artful-proposed (4.13.0-17.20) and
managed to reproduce the bug on 1/6 instances after a few reboot cycles.
So I think my method is okay to check candidate kernels. Feel free to
throw some more at me if you want to bisect.

(I think IS took some of the slower machines out of rotation so the
problem might become slightly harder to reproduce - definitely is still
happening though.)

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Adam Conrad (adconrad) wrote :

Note that if it *is* the same bug as #1713751, that reporter already mentioned that using mainline kernels (and he was hitting this on 4.10) fixed it for him, so it seems more plausible not that 4.14 has a fix, but that Ubuntu's sauce has the breakage. Of course, they may well not be the same bug, cause the symptom itself is pretty generic.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu-kvm (Ubuntu Artful):
status: New → Confirmed
Changed in qemu-kvm (Ubuntu):
status: New → Confirmed
Joseph Salisbury (jsalisbury) wrote :

The bug reporter in bug 1713751 has been unable to reproduce the bug with the 4.13.0-16-generic kernel. He's re-testing with the original kernel that exhibited the bug to ensure he can reproduce it consistently. If he finds that 4.13.0-16-generic is really good he really might be hitting a different bug. I'll update this bug with that answer.

laney, we could start a bisect in this bug at the same time if you want. The first thing we would have to do is identify the last "Good" Ubuntu kernel and first "Bad" Ubuntu kernel. That would mean trying some of the earlier 4.13 kernels, based on that, some 4.12 kernels, etc. Bug 1713751 suggests that he is unable to reproduce the bug with 4.10.0-33, so maybe you could try that one next? It's available from:

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/13234886

Based on what you see with that, we can decide jump to the middle of the versions and try a 4.11 or 4.12 based kernel?

Changed in linux (Ubuntu Artful):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Artful):
status: Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Changed in linux (Ubuntu Zesty):
status: New → Incomplete
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Iain Lane (laney) wrote :

OK a few days ago apw pointed me at http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.8/ which is the mainline kernel that the artful-proposed one I identified as bad is based on.

I ran 35 instances and rebooted them 30 times - all successful. So I think that says this kernel is good.

Will try another one later on.

Joseph Salisbury (jsalisbury) wrote :

The fact that 4.13.8 doesn't reproduce the bug might be another indicator that the bug was introduced by a SAUCE patch.

Iain Lane (laney) wrote :

Just a heads up - this apparently became much harder for me to reproduce at will. We're still seeing it in actual workloads but I'm having trouble recreating manually.

My current strategy is to start stress-ng on a number of machines and then constantly reboot them, with the idea that this will stress the cloud and make the bug more likely, which relies on the assumption that this bug is somehow to do with busyness on the underlying machines.

Andy Whitcroft (apw) wrote :

I would note that the kernel watchdog timeouts here are always at 20 odd seconds. They are not increasing so whatever is occuring is progressing at least as far as the kernel is concerned. If we assume the systemd log is still working (and it was shortly before the event when it reported reaching shutdown state) then we would expect it to be in the process of attempting to deconstruct the system before calling reboot. Most of the deconstructors it calls are reported before calling. There is one, cg_trim(), which is not announced. Looking at the implementation of that it is doing a hierachical remove of the /sys/fs/cgroup hierachy. On my system this is some 15000 files in 1300 directories. If there was a performance issue in there we could easily spend hours in this call with nothing logged.

tags: added: kernel-da-key
removed: kernel-key
Seth Forshee (sforshee) wrote :

We've traced the problem to "UBUNTU: SAUCE: exec: ensure file system accounting in check_unsafe_exec is correct." cking has a fix which will be used for zesty and artful. I've reverted the patch in bionic since there's a fix available for golang and we do not want Ubuntu userspace to become reliant on the non-standard kernel behavior in the patch.

Changed in linux (Ubuntu Bionic):
assignee: Joseph Salisbury (jsalisbury) → Seth Forshee (sforshee)
status: In Progress → Fix Committed
description: updated
Stefan Bader (smb) on 2018-01-23
Changed in linux (Ubuntu Zesty):
status: Incomplete → Won't Fix
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Glyn M Burton (modiford) wrote :

Hello Everyone,

Google led me here with a search for a "soft lockup CPU#" error I am experiencing.

I run Ubuntu Server on Citrix XenServer and have never had this issue with Ubuntu Server 12.04, 14.04 or 16.04. I am here because 18.04 has this issue upon using 'reboot', although 'poweroff' does as intended.

If I be of any assistance to yourselves in retrieving log files, testing or what-ever to remedy this bug for the good of everyone then please let me know. Keen to improve Ubuntu for all.

Thank you and best regards,

Modiford.

Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released

Thanks that this gets into releases now.
I wonder about Bionic's status - any update on that?

Changed in qemu-kvm (Ubuntu Zesty):
status: New → Won't Fix
Changed in qemu-kvm (Ubuntu Bionic):
status: Confirmed → Won't Fix
Changed in qemu-kvm (Ubuntu Artful):
status: Confirmed → Won't Fix

Also since the change was identified to be in the kernel, set qemu to Won't Fix

Seth Forshee (sforshee) wrote :

Bionic should have been fixed for a while now, updating the status.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers