Bug #1730717 “Some VMs fail to reboot with “watchdog: BUG: soft ...” : Bionic (18.04) : Bugs : linux package : Ubuntu

Revision history for this message

Iain Lane (laney) wrote on 2017-11-07:

#1

bad run console-log Edit (80.8 KiB, text/plain)

Revision history for this message

Iain Lane (laney) wrote on 2017-11-07:

#2

good run console-log Edit (83.7 KiB, text/plain)

Revision history for this message

Iain Lane (laney) wrote on 2017-11-07:

#3

Oh also see https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1713751 which has some superficially similar symptoms (cpu stuck on shutdown).

description:

updated

Revision history for this message

Iain Lane (laney) wrote on 2017-11-07:

#4

I tried 28 (then my quota ran out) xenial guests BTW and none of those failed.

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2017-11-07: Missing required logs.

#5

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1730717

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: xenial

Joseph Salisbury (jsalisbury) on 2017-11-07

Changed in linux (Ubuntu):
importance:	Undecided → High
tags:	added: kernel-key

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-11-07:

#6

Do you know if this bug is also happening with Zesty, or just Artful and Bionic(>4.13)?

I'm going to working on bisecting bug 1713751 in case it's related.

Also, it would be good to know if this bug is already fixed in the latest mainline kernel. Do you have a way to reproduce this bug? If so, could you give 4.14-rc8 a try? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc8

Changed in linux (Ubuntu):
status:	Incomplete → Triaged
Changed in linux (Ubuntu Artful):
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-11-08:

#7

Out of the IRC discussions documenting potentially related issues:
- this bug: KVM: Host-Kernel: Xenial-GA, Qemu: Xenial-Ocata, Guest: Bionic
- bug 1722311 KVM: Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some relation to cache pressure
- bug 1713751 AWS: triggered by Xenial kernel update, supposed fixed but shown up again and again
- bug 1655842 Host-Kernel: Xenial-GA, Qemu: Xenial, Guest: Artful - some relation to cache pressure
These might after all just run into the same soft lockup symptom, but I thought it was worth to mention for thos enot reading the IRC log.

These cases seem to somewhat agree on:
- Recent guest kernel
- Xenial Host kernel
- some memory pressure

To get further I thought some sort of local reproducer for the kernel Team to work on easier than needing a full cloud.
But so far I failed at setting such a local case up (http://paste.ubuntu.com/25916781/).

Thanks Laney for the openstack based repro description.
@Laney I found it interesting that you essentially only needed to start+reboot.
I assume on the host you had other workload goes on in the background (since it is lcy01)?
If you'd have any sort of non-busy but otherwise comparable system - could you check to confirm the assumption we have so far that there all is fine?
If yes - then the memory pressure theory gets more likely, if not we can focus on simpler reproducers - so we can only win by that check.

Crossing fingers for jsalisbury's hope that 4.14 might already have a fix.

Revision history for this message

Iain Lane (laney) wrote on 2017-11-08: Re: [Bug 1730717] Re: Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

#8

On Wed, Nov 08, 2017 at 11:08:05AM -0000, ChristianEhrhardt wrote:
> @Laney I found it interesting that you essentially only needed to start+reboot.
> I assume on the host you had other workload goes on in the background (since it is lcy01)?

I don't have visibility into what else the hosts are doing - I'm just a
client here. But I do know that *I* had other workloads running at the
same time. I assume that there were a lot of buildd jobs too. These
compute nodes are probably loaded most of the time.

> If you'd have any sort of non-busy but otherwise comparable system - could you check to confirm the assumption we have so far that there all is fine?
> If yes - then the memory pressure theory gets more likely, if not we can focus on simpler reproducers - so we can only win by that check.

Afraid not, sorry. I did try on my artful+artful host but didn't
reproduce the problem - it probably wasn't under enough stress anyway.

> Crossing fingers for jsalisbury's hope that 4.14 might already have a
> fix.

Right now I'm trying to build a new cloud image with this kernel that I
can try.

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-11-08:

#9

Torkoal (our Jenkins node) was idle atm and Ryan reported he had seen the issues there before, so trying there as well.
This is LTS + HWE - Kernel 4.10.0-38-generic, qemu: 1:2.5+dfsg-5ubuntu10

I thought about your case since you seem just to start a lot of them and reboot,
this shouldn't be so much different to:
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=artful
$ for i in {1..30}; do uvt-kvm create --log-console-output --password=ubuntu artful-${i}-bug1730717 release=artful arch=amd64 label=daily; done
$ for i in {1..30}; do uvt-kvm wait --insecure artful-${i}-bug1730717; done
$ for i in {1..30}; do uvt-kvm ssh --insecure artful-${i}-bug1730717 "sudo reboot"; done
$ sudo grep "soft lockup" /var/log/libvirt/qemu/artful-*-bug1730717.log

But this works for me :-/

Waiting for your feedback if you can trigger the same issue on a non-busy openstack system (could after all be some openstack magic at work that makes it behave differently).

Revision history for this message

Iain Lane (laney) wrote on 2017-11-08:

#10

On Wed, Nov 08, 2017 at 11:56:02AM -0000, ChristianEhrhardt wrote:
> Torkoal (our Jenkins node) was idle atm and Ryan reported he had seen the issues there before, so trying there as well.
> This is LTS + HWE - Kernel 4.10.0-38-generic, qemu: 1:2.5+dfsg-5ubuntu10
>
> I thought about your case since you seem just to start a lot of them and reboot,
> this shouldn't be so much different to:
> $ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=artful
> $ for i in {1..30}; do uvt-kvm create --log-console-output --password=ubuntu artful-${i}-bug1730717 release=artful arch=amd64 label=daily; done
> $ for i in {1..30}; do uvt-kvm wait --insecure artful-${i}-bug1730717; done
> $ for i in {1..30}; do uvt-kvm ssh --insecure artful-${i}-bug1730717 "sudo reboot"; done
> $ sudo grep "soft lockup" /var/log/libvirt/qemu/artful-*-bug1730717.log

Sounds like it's similar, but maybe you have to put the system under
load - you might need more instances, or maybe start a whole bunch first
and get them to run something memory intensive before running that same
test again. In the cloud there will be buildds and tests running on the
compute nodes too, as well as these 'empty' instances that I use to
reproduce the problem.

> Waiting for your feedback if you can trigger the same issue on a non-
> busy openstack system (could after all be some openstack magic at work
> that makes it behave differently).

I don't have access to a non busy cloud I'm afraid.

ANYWAY! My results are in. I created an image by booting the stock
artful cloud image and installing the mainline kernel v4.14-rc8
(39dae59d66acd86d1de24294bd2f343fd5e7a625) packages, on lcy01 (the busy
cloud that exhibits this problem).

I started 34 (17 × 2 in two runs - that's all I could squeeze in before
I hit my quota) instances, and they were all good. This isn't definitive
proof, but it looks like that kernel might be good.

Cheers,

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

On Wed, Nov 08, 2017 at 11:56:02AM -0000, ChristianEhrhardt wrote:
> Torkoal (our Jenkins node) was idle atm and Ryan reported he had seen the issues there before, so trying there as well.
> This is LTS + HWE - Kernel 4.10.0-38-generic, qemu: 1:2.5+dfsg-5ubuntu10
> 
> I thought about your case since you seem just to start a lot of them and reboot,
> this shouldn't be so much different to:
> $ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=artful
> $ for i in {1..30}; do uvt-kvm create --log-console-output --password=ubuntu artful-${i}-bug1730717 release=artful arch=amd64 label=daily; done
> $ for i in {1..30}; do uvt-kvm wait --insecure artful-${i}-bug1730717; done
> $ for i in {1..30}; do uvt-kvm ssh --insecure artful-${i}-bug1730717 "sudo reboot"; done
> $ sudo grep "soft lockup" /var/log/libvirt/qemu/artful-*-bug1730717.log

Sounds like it's similar, but maybe you have to put the system under
load - you might need more instances, or maybe start a whole bunch first
and get them to run something memory intensive before running that same
test again. In the cloud there will be buildds and tests running on the
compute nodes too, as well as these 'empty' instances that I use to
reproduce the problem.

> Waiting for your feedback if you can trigger the same issue on a non-
> busy openstack system (could after all be some openstack magic at work
> that makes it behave differently).

I don't have access to a non busy cloud I'm afraid.

ANYWAY! My results are in. I created an image by booting the stock
artful cloud image and installing the mainline kernel v4.14-rc8
(39dae59d66acd86d1de24294bd2f343fd5e7a625) packages, on lcy01 (the busy
cloud that exhibits this problem).

I started 34 (17 × 2 in two runs - that's all I could squeeze in before
I hit my quota) instances, and they were all good. This isn't definitive
proof, but it looks like that kernel might be good.

Cheers,

-- 
Iain Lane                                  [ iain@orangesquash.org.uk ]
Debian Developer                                   [ laney@debian.org ]
Ubuntu Developer                                   [ laney@ubuntu.com ]

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-11-08:

#11

@Laney, thanks for testing the mainline kernel. It's promising that a fix might be in that kernel. The time consuming part will be identifying what commit in that kernel is the actual fix. We could perform a "Reverse" kernel bisect, which would required testing 12 or so test kernels. However, it sounds like to set up a reproducer is time consuming as well.

The easiest thing to try next would be to test the latest upstream 4.13 stable kernels. It's possible the fix that is in 4.14-rc8 was also cc'd to upstream stable and made it's way into 4.13 thorough the normal stable update process.

If possible to test, the latest 4.13 upstream kernel is 4.13.12 and is available here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.12/

Revision history for this message

Iain Lane (laney) wrote on 2017-11-08:

#12

On Wed, Nov 08, 2017 at 03:16:29PM -0000, Joseph Salisbury wrote:
> @Laney, thanks for testing the mainline kernel. It's promising that a
> fix might be in that kernel. The time consuming part will be
> identifying what commit in that kernel is the actual fix. We could
> perform a "Reverse" kernel bisect, which would required testing 12 or so
> test kernels. However, it sounds like to set up a reproducer is time
> consuming as well.
>
> The easiest thing to try next would be to test the latest upstream 4.13
> stable kernels. It's possible the fix that is in 4.14-rc8 was also cc'd
> to upstream stable and made it's way into 4.13 thorough the normal
> stable update process.
>
> If possible to test, the latest 4.13 upstream kernel is 4.13.12 and is available here:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.12/

20 instances, all good.

Maybe I should try one that we think is 'bad' to confirm that the way
I'm doing this is actually capable of reproducing the issue? If you
think that's sensible, could you recommend me a kernel to try?

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-11-08:

#13

Maybe give 4.13.0-16 a try:

https://launchpad.net/~canonical-kernel-security-team/+archive/ubuntu/ppa2/+build/13567624

It could also be the bug is being triggered by a Ubuntu specific SAUCE patch, so it won't happen with upstream kernels.

Revision history for this message

Iain Lane (laney) wrote on 2017-11-09:

#14

On Wed, Nov 08, 2017 at 06:57:46PM -0000, Joseph Salisbury wrote:
> Maybe give 4.13.0-16 a try:
>
> https://launchpad.net/~canonical-kernel-security-
> team/+archive/ubuntu/ppa2/+build/13567624
>
> It could also be the bug is being triggered by a Ubuntu specific SAUCE
> patch, so it won't happen with upstream kernels.

It looked to me like that's the same kernel that's in artful release, so
instead I tried with the kernel in artful-proposed (4.13.0-17.20) and
managed to reproduce the bug on 1/6 instances after a few reboot cycles.
So I think my method is okay to check candidate kernels. Feel free to
throw some more at me if you want to bisect.

(I think IS took some of the slower machines out of rotation so the
problem might become slightly harder to reproduce - definitely is still
happening though.)

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message

Adam Conrad (adconrad) wrote on 2017-11-10:

#15

Note that if it *is* the same bug as #1713751, that reporter already mentioned that using mainline kernels (and he was hitting this on 4.10) fixed it for him, so it seems more plausible not that 4.14 has a fix, but that Ubuntu's sauce has the breakage. Of course, they may well not be the same bug, cause the symptom itself is pretty generic.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-11-16:

#16

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu-kvm (Ubuntu Artful):
status:	New → Confirmed
Changed in qemu-kvm (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-11-16:

#18

The bug reporter in bug 1713751 has been unable to reproduce the bug with the 4.13.0-16-generic kernel. He's re-testing with the original kernel that exhibited the bug to ensure he can reproduce it consistently. If he finds that 4.13.0-16-generic is really good he really might be hitting a different bug. I'll update this bug with that answer.

laney, we could start a bisect in this bug at the same time if you want. The first thing we would have to do is identify the last "Good" Ubuntu kernel and first "Bad" Ubuntu kernel. That would mean trying some of the earlier 4.13 kernels, based on that, some 4.12 kernels, etc. Bug 1713751 suggests that he is unable to reproduce the bug with 4.10.0-33, so maybe you could try that one next? It's available from:

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/13234886

Based on what you see with that, we can decide jump to the middle of the versions and try a 4.11 or 4.12 based kernel?

Changed in linux (Ubuntu Artful):
assignee:	nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee:	nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Artful):
status:	Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status:	Triaged → In Progress
Changed in linux (Ubuntu Zesty):
status:	New → Incomplete
importance:	Undecided → High
assignee:	nobody → Joseph Salisbury (jsalisbury)

Revision history for this message

Iain Lane (laney) wrote on 2017-11-17:

#19

OK a few days ago apw pointed me at http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13.8/ which is the mainline kernel that the artful-proposed one I identified as bad is based on.

I ran 35 instances and rebooted them 30 times - all successful. So I think that says this kernel is good.

Will try another one later on.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-11-18:

#20

The fact that 4.13.8 doesn't reproduce the bug might be another indicator that the bug was introduced by a SAUCE patch.

Revision history for this message

Iain Lane (laney) wrote on 2017-11-23:

#21

Just a heads up - this apparently became much harder for me to reproduce at will. We're still seeing it in actual workloads but I'm having trouble recreating manually.

My current strategy is to start stress-ng on a number of machines and then constantly reboot them, with the idea that this will stress the cloud and make the bug more likely, which relies on the assumption that this bug is somehow to do with busyness on the underlying machines.

Revision history for this message

Andy Whitcroft (apw) wrote on 2017-12-04:

#22

I would note that the kernel watchdog timeouts here are always at 20 odd seconds. They are not increasing so whatever is occuring is progressing at least as far as the kernel is concerned. If we assume the systemd log is still working (and it was shortly before the event when it reported reaching shutdown state) then we would expect it to be in the process of attempting to deconstruct the system before calling reboot. Most of the deconstructors it calls are reported before calling. There is one, cg_trim(), which is not announced. Looking at the implementation of that it is doing a hierachical remove of the /sys/fs/cgroup hierachy. On my system this is some 15000 files in 1300 directories. If there was a performance issue in there we could easily spend hours in this call with nothing logged.

Joseph Salisbury (jsalisbury) on 2017-12-04

tags:

added: kernel-da-key
removed: kernel-key

Revision history for this message

Seth Forshee (sforshee) wrote on 2017-12-07:

#23

We've traced the problem to "UBUNTU: SAUCE: exec: ensure file system accounting in check_unsafe_exec is correct." cking has a fix which will be used for zesty and artful. I've reverted the patch in bionic since there's a fix available for golang and we do not want Ubuntu userspace to become reliant on the non-standard kernel behavior in the patch.

Changed in linux (Ubuntu Bionic):
assignee:	Joseph Salisbury (jsalisbury) → Seth Forshee (sforshee)
status:	In Progress → Fix Committed

Colin Ian King (colin-king) on 2017-12-07

description:

updated

Stefan Bader (smb) on 2018-01-23

Changed in linux (Ubuntu Zesty):
status:	Incomplete → Won't Fix

Khaled El Mously (kmously) on 2018-02-01

Changed in linux (Ubuntu Artful):
status:	In Progress → Fix Committed

Revision history for this message

Glyn M Burton (modiford) wrote on 2018-02-16:

#24

Hello Everyone,

Google led me here with a search for a "soft lockup CPU#" error I am experiencing.

I run Ubuntu Server on Citrix XenServer and have never had this issue with Ubuntu Server 12.04, 14.04 or 16.04. I am here because 18.04 has this issue upon using 'reboot', although 'poweroff' does as intended.

If I be of any assistance to yourselves in retrieving log files, testing or what-ever to remedy this bug for the good of everyone then please let me know. Keen to improve Ubuntu for all.

Thank you and best regards,

Modiford.

Revision history for this message

Stefan Bader (smb) wrote on 2018-03-19:

#25

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-artful

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-04-03:

#26

Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

* linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

* [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
- platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

* fails to dump with latest kpti fixes (LP: #1750021)
- kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

* i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
- i2c: octeon: Prevent error message on bus error

* hisi_sas: Add disk LED support (LP: #1752695)
- scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

* DVB Card with SAA7146 chipset not working (LP: #1742316)
- vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

* ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
- ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

* lpfc.ko module doesn't work (LP: #1746970)
- scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

* ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
- net...

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

* linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

* Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

* [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

* fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

* headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

* CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

* i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

* hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

* EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

* [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

* DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

* [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

* ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

* support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

* lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

* Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

* ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net: hns: add ACPI mode support for ethtool -p

* CVE-2017-17807
    - KEYS: add missing permission check for request_key() destination

* [Artful SRU] Fix capsule update regression (LP: #1746019)
    - efi/capsule-loader: Reinstate virtual capsule mapping

* [Artful/Bionic] [Config] enable EDAC_GHES for ARM64 (LP: #1747746)
    - Ubuntu: [Config] enable EDAC_GHES for ARM64

* linux-tools: perf incorrectly linking libbfd (LP: #1748922)
    - SAUCE: tools -- add ability to disable libbfd
    - [Packaging] correct disablement of libbfd

* Cherry pick c96f5471ce7d for delayacct fix (LP: #1747769)
    - delayacct: Account blkio completion on the correct task

* Error in CPU frequency reporting when nominal and min pstates are same
    (cpufreq) (LP: #1746174)
    - cpufreq: powernv: Dont assume distinct pstate values for nominal and pmin

* retpoline abi files are empty on i386 (LP: #1751021)
    - [Packaging] retpoline-extract -- instantiate retpoline files for i386
    - [Packaging] final-checks -- sanity checking ABI contents
    - [Packaging] final-checks -- check for empty retpoline files

* [P9,Power NV][WSP][Ubuntu 1804] : "Kernel access of bad area " when grouping
    different pmu events using perf fuzzer . (perf:) (LP: #1746225)
    - powerpc/perf: Fix oops when grouping different pmu events

* bnx2x_attn_int_deasserted3:4323 MC assert! (LP: #1715519) //
    CVE-2018-1000026
    - net: create skb_gso_validate_mac_len()
    - bnx2x: disable GSO where gso_size is too big for hardware

* Ubuntu16.04.03: ISAv3 initialize MMU registers before setting partition
    table (LP: #1736145)
    - powerpc/64s: Initialize ISAv3 MMU registers before setting partition table

* powerpc/powernv: Flush console before platform error reboot (LP: #1735159)
    - powerpc/powernv: Flush console before platform error reboot

* Touchpad stops working after a few seconds in Lenovo ideapad 320
    (LP: #1732056)
    - pinctrl/amd: fix masking of GPIO interrupts

* [Artful][Wyse 3040] System hang when trying to enable an offlined CPU core
    (LP: #1736393)
    - SAUCE: drm/i915:Don't set chip specific data
    - SAUCE: drm/i915: make previous commit affects Wyse 3040 only

* ppc64el: Do not call ibm,os-term on panic (LP: #1736954)
    - powerpc: Do not call ppc_md.panic in fadump panic notifier

* Artful update to 4.13.16 stable release (LP: #1744213)
    - tcp_nv: fix division by zero in tcpnv_acked()
    - net: vrf: correct FRA_L3MDEV encode type
    - tcp: do not mangle skb->cb[] in tcp_make_synack()
    - net: systemport: Correct IPG length settings
    - netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
    - l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6
    - bonding: discard lowest hash bit for 802.3ad layer3+4
    - net: cdc_ether: fix divide by 0 on bad descriptors
    - net: qmi_wwan: fix divide by 0 on bad descriptors
    - qmi_wwan: Add missing skb_reset_mac_header-call
    - net: usb: asix: fill null-ptr-deref in asix_suspend
    - tcp: gso: avoid refcount_t warning from tcp_gso_segment()
    - tcp: fix tcp_fastretrans_alert warning
    - vlan: fix a use-after-free in vlan_device_event()
    - net/mlx5: Cancel health poll before sending panic teardown command
    - net/mlx5e: Set page to null in case dma mapping fails
    - af_netlink: ensure that NLMSG_DONE never fails in dumps
    - vxlan: fix the issue that neigh proxy blocks all icmpv6 packets
    - net: cdc_ncm: GetNtbFormat endian fix
    - fealnx: Fix building error on MIPS
    - net/sctp: Always set scope_id in sctp_inet6_skb_msgname
    - ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
    - serial: omap: Fix EFR write on RTS deassertion
    - serial: 8250_fintek: Fix finding base_port with activated SuperIO
    - tpm-dev-common: Reject too short writes
    - rcu: Fix up pending cbs check in rcu_prepare_for_idle
    - ocfs2: fix cluster hang after a node dies
    - ocfs2: should wait dio before inode lock in ocfs2_setattr()
    - ipmi: fix unsigned long underflow
    - mm/page_alloc.c: broken deferred calculation
    - mm/page_ext.c: check if page_ext is not prepared
    - x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
    - coda: fix 'kernel memory exposure attempt' in fsync
    - Linux 4.13.16

* Artful update to 4.13.15 stable release (LP: #1744212)
    - media: imon: Fix null-ptr-deref in imon_probe
    - media: dib0700: fix invalid dvb_detach argument
    - crypto: dh - Fix double free of ctx->p
    - crypto: dh - Don't permit 'p' to be 0
    - crypto: dh - Don't permit 'key' or 'g' size longer than 'p'
    - USB: early: Use new USB product ID and strings for DbC device
    - USB: usbfs: compute urb->actual_length for isochronous
    - USB: Add delay-init quirk for Corsair K70 LUX keyboards
    - usb: gadget: f_fs: Fix use-after-free in ffs_free_inst
    - USB: serial: metro-usb: stop I/O after failed open
    - USB: serial: Change DbC debug device binding ID
    - USB: serial: qcserial: add pid/vid for Sierra Wireless EM7355 fw update
    - USB: serial: garmin_gps: fix I/O after failed probe and remove
    - USB: serial: garmin_gps: fix memory leak on probe errors
    - x86/MCE/AMD: Always give panic severity for UC errors in kernel context
    - platform/x86: peaq-wmi: Add DMI check before binding to the WMI interface
    - platform/x86: peaq_wmi: Fix missing terminating entry for peaq_dmi_table
    - HID: cp2112: add HIDRAW dependency
    - HID: wacom: generic: Recognize WACOM_HID_WD_PEN as a type of pen collection
    - staging: wilc1000: Fix bssid buffer offset in Txq
    - staging: ccree: fix 64 bit scatter/gather DMA ops
    - staging: greybus: spilib: fix use-after-free after deregistration
    - staging: vboxvideo: Fix reporting invalid suggested-offset-properties
    - staging: rtl8188eu: Revert 4 commits breaking ARP
    - Linux 4.13.15

* time drifting on linux-hwe kernels (LP: #1744988)
    - x86/tsc: Future-proof native_calibrate_tsc()
    - x86/tsc: Fix erroneous TSC rate on Skylake Xeon
    - x86/tsc: Print tsc_khz, when it differs from cpu_khz

* Please backport vmd suspend/resume patches to 16.04 hwe (LP: #1745508)
    - PCI: vmd: Free up IRQs on suspend path

* CVE-2017-17448
    - netfilter: nfnetlink_cthelper: Add missing permission checks

* Dell XPS 13 9360 bluetooth (Atheros) won't connect after resume
    (LP: #1744712)
    - Bluetooth: btusb: Restore QCA Rome suspend/resume fix with a "rewritten"
      version

* [SRU] TrackPoint: middle button doesn't work on TrackPoint-compatible
    device. (LP: #1746002)
    - Input: trackpoint - force 3 buttons if 0 button is reported

* TB16 dock ethernet corrupts data with hw checksum silently failing
    (LP: #1729674)
    - r8152: disable RX aggregation on Dell TB16 dock

* [Artful] Realtek ALC225: 2 secs noise when a headset plugged in
    (LP: #1744058)
    - Revert "UBUNTU: SAUCE: ALSA: hda/realtek - Add support headset mode for DELL
      WYSE"
    - SAUCE: ALSA: hda/realtek - Add support headset mode for DELL WYSE
    - ALSA: hda/realtek - update ALC225 depop optimize

* [A] skb leak in vhost_net / tun / tap (LP: #1738975)
    - vhost: fix skb leak in handle_rx()
    - tap: free skb if flags error
    - tun: free skb in early errors

* Commit d9018976cdb6 missing in Kernels <4.14.x preventing lasting fix of
    Intel SPI bug on certain serial flash (LP: #1742696)
    - mfd: lpc_ich: Do not touch SPI-NOR write protection bit on Haswell/Broadwell
    - spi-nor: intel-spi: Fix broken software sequencing codes

* CVE-2018-5332
    - RDS: Heap OOB write in rds_message_alloc_sgs()

* [A] KVM Windows BSOD on 4.13.x (LP: #1738972)
    - KVM: x86: fix APIC page invalidation

* elantech touchpad of Lenovo L480/580 failed to detect hw_version
    (LP: #1733605)
    - Input: elantech - add new icbody type 15

* [SRU] External HDMI monitor failed to show screen on Lenovo X1 series
    (LP: #1738523)
    - SAUCE: drm/i915: Disable writing of TMDS_OE on Lenovo ThinkPad X1 series

* ubuntu/xr-usb-serial didn't get built in zesty and artful (LP: #1733281)
    - SAUCE: make sure ubuntu/xr-usb-serial builds for x86

* Disabling zfs does not always disable module checks for the zfs modules
    (LP: #1737176)
    - [Packaging] disable zfs module checks when zfs is disabled

* CVE-2017-17806
    - crypto: hmac - require that the underlying hash algorithm is unkeyed

* CVE-2017-17805
    - crypto: salsa20 - fix blkcipher_walk API usage

* CVE-2017-16994
    - mm/pagewalk.c: report holes in hugetlb ranges

* CVE-2017-17450
    - netfilter: xt_osf: Add missing permission checks

* apparmor profile load in stacked policy container fails (LP: #1746463)
    - SAUCE: apparmor: fix display of .ns_name for containers

* CVE-2017-15129
    - net: Fix double free and memory corruption in get_net_ns_by_id()

* CVE-2018-5344
    - loop: fix concurrent lo_open/lo_release

* CVE-2017-1000407
    - KVM: VMX: remove I/O port 0x80 bypass on Intel hosts

* CVE-2017-0861
    - ALSA: pcm: prevent UAF in snd_pcm_info

* perf stat segfaults on uncore events w/o -a (LP: #1745246)
    - perf xyarray: Save max_x, max_y
    - perf evsel: Fix buffer overflow while freeing events

* Support cppc-cpufreq driver on ThunderX2 systems (LP: #1745007)
    - mailbox: PCC: Move the MAX_PCC_SUBSPACES definition to header file
    - ACPI / CPPC: Make CPPC ACPI driver aware of PCC subspace IDs
    - ACPI / CPPC: Fix KASAN global out of bounds warning
    - ACPI: CPPC: remove initial assignment of pcc_ss_data

* P-state not working in kernel 4.13 (LP: #1743269)
    - x86 / CPU: Avoid unnecessary IPIs in arch_freq_get_on_cpu()
    - x86 / CPU: Always show current CPU frequency in /proc/cpuinfo

* Regression: KVM no longer supports Intel CPUs without Virtual NMI
    (LP: #1741655)
    - kvm: vmx: Reinstate support for CPUs without virtual NMI

* System hang with Linux kernel due to mainline commit 24247aeeabe
    (LP: #1733662)
    - x86/intel_rdt/cqm: Prevent use after free

* $(LOCAL_ENV_CC) and $(LOCAL_ENV_DISTCC_HOSTS) should be properly quoted
    (LP: #1744077)
    - [Debian] pass LOCAL_ENV_CC and LOCAL_ENV_DISTCC_HOSTS properly

* the wifi driver is always hard blocked on a lenovo laptop (LP: #1743672)
    - ACPI: EC: Fix possible issues related to EC initialization order

* text VTs are unavailable on desktop after upgrade to Ubuntu 17.10
    (LP: #1724911)
    - drm/i915/fbdev: Always forward hotplug events

* Samsung SSD 960 EVO 500GB refused to change power state (LP: #1705748)
    - nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-A

* [0cf3:e010] QCA6174A XR failed to pair with bt 4.0 device  (LP: #1741166)
    - Bluetooth: btusb: Add support for 0cf3:e010

* CVE-2017-17741
    - KVM: Fix stack-out-of-bounds read in write_mmio

* CVE-2018-5333
    - RDS: null pointer dereference in rds_atomic_free_op

* [800 G3 SFF] [800 G3 DM]External microphone of headset(3-ring) is working,
    2-ring mic not working, both not shown in sound settings  (LP: #1740974)
    - ALSA: hda - Add MIC_NO_PRESENCE fixup for 2 HP machines

* Two front mics can't work on a lenovo machine (LP: #1740973)
    - ALSA: hda - change the location for one mic on a Lenovo machine

* No external microphone be detected via headset jack on a dell machine
    (LP: #1740972)
    - ALSA: hda - fix headset mic detection issue on a Dell machine

*  Can't detect external headset via line-out jack on some Dell machines
    (LP: #1740971)
    - ALSA: hda/realtek - Fix Dell AIO LineOut issue

* Support realtek new codec alc257 in the alsa hda driver  (LP: #1738911)
    - ALSA: hda/realtek - New codec support for ALC257

* Add support for 16g huge pages on Ubuntu 16.04.2 PowerNV (LP: #1706247)
    - powerpc/mm/hugetlb: Allow runtime allocation of 16G.
    - powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel
      command line
    - mm/hugetlb: Allow arch to override and call the weak function

* the kernel is blackholing IPv6 packets to linkdown nexthops (LP: #1738219)
    - ipv6: Do not consider linkdown nexthops during multipath

* e1000e in 4.4.0-97-generic breaks 82574L under heavy load. (LP: #1730550)
    - e1000e: Avoid receiver overrun interrupt bursts
    - e1000e: Separate signaling for link check/link up

* Ubuntu 17.10: Include patch "crypto: vmx - Use skcipher for ctr fallback"
    (LP: #1732978)
    - crypto: vmx - Use skcipher for ctr fallback

* QCA Rome bluetooth can not wakeup after USB runtime suspended.
    (LP: #1737890)
    - Bluetooth: btusb: driver to enable the usb-wakeup feature

* /dev/bcache/by-uuid links not created after reboot (LP: #1729145)
    - SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent

* Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for
    22s! [systemd:1]" (LP: #1730717)
    - SAUCE: exec: fix lockup because retry loop may never exit

* Request to backport cxlflash patches to 16.04 HWE Kernel (LP: #1730515)
    - scsi: cxlflash: Use derived maximum write same length
    - scsi: cxlflash: Allow cards without WWPN VPD to configure
    - scsi: cxlflash: Derive pid through accessors

* vagrant artful64 box filesystem too small (LP: #1726818)
    - block: factor out __blkdev_issue_zero_pages()
    - block: cope with WRITE ZEROES failing in blkdev_issue_zeroout()

* Artful update to 4.13.14 stable release (LP: #1744121)
    - ppp: fix race in ppp device destruction
    - gso: fix payload length when gso_size is zero
    - ipv4: Fix traffic triggered IPsec connections.
    - ipv6: Fix traffic triggered IPsec connections.
    - netlink: do not set cb_running if dump's start() errs
    - net: call cgroup_sk_alloc() earlier in sk_clone_lock()
    - macsec: fix memory leaks when skb_to_sgvec fails
    - l2tp: check ps->sock before running pppol2tp_session_ioctl()
    - netlink: fix netlink_ack() extack race
    - sctp: add the missing sock_owned_by_user check in sctp_icmp_redirect
    - tcp/dccp: fix ireq->opt races
    - packet: avoid panic in packet_getsockopt()
    - geneve: Fix function matching VNI and tunnel ID on big-endian
    - net: bridge: fix returning of vlan range op errors
    - soreuseport: fix initialization race
    - ipv6: flowlabel: do not leave opt->tot_len with garbage
    - sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
    - tcp/dccp: fix lockdep splat in inet_csk_route_req()
    - tcp/dccp: fix other lockdep splats accessing ireq_opt
    - net: dsa: check master device before put
    - net/unix: don't show information about sockets from other namespaces
    - tap: double-free in error path in tap_open()
    - net/mlx5: Fix health work queue spin lock to IRQ safe
    - net/mlx5e: Properly deal with encap flows add/del under neigh update
    - ipip: only increase err_count for some certain type icmp in ipip_err
    - ip6_gre: only increase err_count for some certain type icmpv6 in ip6gre_err
    - ip6_gre: update dst pmtu if dev mtu has been updated by toobig in
      __gre6_xmit
    - tcp: refresh tp timestamp before tcp_mtu_probe()
    - tap: reference to KVA of an unloaded module causes kernel panic
    - sctp: reset owner sk for data chunks on out queues when migrating a sock
    - net_sched: avoid matching qdisc with zero handle
    - l2tp: hold tunnel in pppol2tp_connect()
    - ipv6: addrconf: increment ifp refcount before ipv6_del_addr()
    - tcp: fix tcp_mtu_probe() vs highest_sack
    - mac80211: accept key reinstall without changing anything
    - mac80211: use constant time comparison with keys
    - mac80211: don't compare TKIP TX MIC key in reinstall prevention
    - usb: usbtest: fix NULL pointer dereference
    - Input: ims-psu - check if CDC union descriptor is sane
    - EDAC, sb_edac: Don't create a second memory controller if HA1 is not present
    - dmaengine: dmatest: warn user when dma test times out
    - Linux 4.13.14

-- Stefan Bader <stefan.bader@canonical.com>  Wed, 14 Mar 2018 11:38:23 +0100

Changed in linux (Ubuntu Artful):
status:	Fix Committed → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-04-03:

#27

Thanks that this gets into releases now.
I wonder about Bionic's status - any update on that?

Changed in qemu-kvm (Ubuntu Zesty):
status:	New → Won't Fix
Changed in qemu-kvm (Ubuntu Bionic):
status:	Confirmed → Won't Fix
Changed in qemu-kvm (Ubuntu Artful):
status:	Confirmed → Won't Fix

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-04-03:

#28

Also since the change was identified to be in the kernel, set qemu to Won't Fix

Revision history for this message

Seth Forshee (sforshee) wrote on 2018-04-03:

#29

Bionic should have been fixed for a while now, updating the status.

Changed in linux (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Ubuntu
linux package

Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Bug Description

CVE References

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	High	Seth Forshee
Zesty	Won't Fix	High	Unassigned
Artful	Fix Released	High	Unassigned
Bionic	Fix Released	High	Seth Forshee
qemu-kvm (Ubuntu)	Confirmed	Undecided	Unassigned
Zesty	Won't Fix	Undecided	Unassigned
Artful	Won't Fix	Undecided	Unassigned
Bionic	Won't Fix	Undecided	Unassigned

Ubuntulinux package

Some VMs fail to reboot with "watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]"

Bug Description

CVE References

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package