Boot/Installation crash of Ubuntu-16.04.3 HWE kernel on R940

Bug #1719697 reported by Sujith Pandel
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dellserver
Fix Released
Undecided
Joseph Salisbury
linux (Ubuntu)
Fix Released
Medium
Joseph Salisbury
Zesty
Fix Released
Medium
Joseph Salisbury

Bug Description

== SRU Justification ==
Kernel crashes when installation of Ubuntu-16.04.3 with HWE (ISO).
Same observation while booting to 4.10.0-28 HWE kerenl of Ubuntu-16.04.3
and 4.10.0-33 HWE as well.

Seen only with 4.10 HWE kernels of Ubuntu-16.04.3. 4.4 kernels of
Ubuntu-16.04.3 works fine. Daily builds of Ubuntu Server 17.10 works fine.

Reducing the core count to <26 cores helps here. Boot & installation of
HWE-kernel works fine.

This bug was introduced by commit:
dc6db24d2476 ("x86/acpi: Set persistent cpuid <-> nodeid mapping when booting")

It is resolved by reverting commit dc6db24d2476, which was done in mainline by
commit c962cff17df as of v4.11-rc3.

There are three additiona commits introduced by the same patch author when
commit c962cff17df was submitted. However, it was confirmed that only the single
revert is needed to fix this particular bug. Upstream thread:
 https://lkml.org/lkml/2017/2/20/66

== Fix ==
commit c962cff17dfa11f4a8227ac16de2b28aea3312e4
Author: Dou Liyang <email address hidden>
Date: Fri Mar 3 16:02:23 2017 +0800

    Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"

== Regression Potential ==
This is reverting a commit that introduced a bug. This commit has also
been reverted upstream.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

Setup:
Dell PowerEdge R940 having 2 socket populated with 28 cores.

Impact:
This is a boot and installation failure for R940 users having 2 socket x 28 cores.
Requesting an SRU for the fix.

Steps:
1. Setup a Dell PowerEdge R940 with 28 cores CPUs on 2 sockets.
2. Start installation of Ubuntu-16.04.3 with HWE kernel.
3. Observe the screen remains blank. Console logs indicate kernel crash.

Additional Info:
* Seen only with 4.10 HWE kernels of Ubuntu-16.04.3.
  4.4 kernels of Ubuntu-16.04.3 works fine. Daily builds of Ubuntu Server 17.10 works fine.

* Reducing the core count to <26 cores helps here. Boot & installation of HWE-kernel works fine.

* Attaching the console log and acpidump from the setup.

* Patch causing this failure: https://github.com/torvalds/linux/commit/dc6db24d2476cd09c0ecf2b8d80313539f737a89
x86/acpi: Set persistent cpuid <-> nodeid mapping when booting

* Fix patch series: https://lkml.org/lkml/2017/2/20/66

Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"
https://github.com/torvalds/linux/commit/c962cff17dfa11f4a8227ac16de2b28aea3312e4#diff-6bd7ef719bca1a2a56b9ebf4bd0bd88d

Revert"x86/acpi: Enable MADT APIs to return disabled apicids"
https://github.com/torvalds/linux/commit/09c3f2bd5c7e5f18687663acb6adc6b167484ca5

acpi/processor: Implement DEVICE operator for processor enumeration
https://github.com/torvalds/linux/commit/8c8cb30f49b86333d8e036e1945cf1a78c03577e

acpi/processor: Check for duplicate processor ids at hotplug time
https://github.com/torvalds/linux/commit/a77d6cd968497792e072b74dff45b891ba778ddb

Revision history for this message
Sujith Pandel (sujithpandel) wrote :
Revision history for this message
Sujith Pandel (sujithpandel) wrote :

acpidump from the server

Revision history for this message
Narinder Gupta (narindergupta) wrote :

As a rough approach, I also tried upstream kernel with and without this one patch (Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting").
Booting was fine with it and booting failed without this. This gave me a direction to look for the patch set of it.​
Hope it is correct.

tags: added: kernel-da-key zesty
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a 17.10(Zesty) test kernel with the four patches mentioned in the description. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1719697/

Can you test this kernel and see if it resolves this bug?

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Thank you, Joseph.
At first I checked that previous HWE kernel (4.10.0-28-generic) still hits the issue in the R940 setup.
Post-that I verified the test kernel shared above and confirm that the issue is not seen with it.

Can you please let me know if there are plans to include this fix in the Ubuntu-16.04.3 HWE kernel?
Fix is already present in hwe-edge kernel (4.11+) and it works fine in this setup.
But this fix release (as hwe of Ubuntu-16.04.4) is sometime in Feb.
Would help us a lot to get the fix soon.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I will submit an SRU request to get this fix in to the Ubuntu-16.04.3 HWE kernel.

I'd first like to see if all four of the requested patches are needed or just commit c962cff17. I built a Zesty test kernel with just commit c962cff17 this time. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1719697/c962cff17

Can you see if this kernel resolves the bug, or it the other three patches are indeed also required?

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Confirmed no crash-issue with kernel @ http://kernel.ubuntu.com/~jsalisbury/lp1719697/c962cff17

$ uname -a
Linux dhcp-165-167 4.10.0-37-generic #41~lp1719697JustCommitc962cff17

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. So it sounds like I just need to SRU commit c962cff17 and not all four patches.

Changed in dellserver:
status: New → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Zesty):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
description: updated
Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Hi Joseph,
Thanks for taking this up.
Can you let us know if this is confirmed for upcoming SRU?

From the schedule, I see we are already past the freeze date:
 * Next cycle: 27-Oct through 18-Nov
               27-Oct Last day for kernel commits for this cycle.
      30-Oct - 04-Nov Kernel prep week.
      05-Nov - 17-Nov Bug verification & Regression testing.
               20-Nov Release to -updates.

Thanks,
Sujith

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Yes, that should be correct. This bug will change to "Fix Commited" when the fix gets into the -proposed repository.

Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Released
status: Fix Released → Fix Committed
Revision history for this message
Khaled El Mously (kmously) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Thank you.
We will revert back with the findings on xenial-proposed kernel.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

After adding xenial-proposed to /etc/apt/sources.list,

#apt-get install linux-generic-hwe-16.04/xenial-proposed

Confirmed Booting pass with:

# uname -a
Linux dhcp-165-167 4.10.0-41-generic #45~16.04.1-Ubuntu SMP Fri Nov 24 15:06:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Thanks for including the fix.
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/commit/?h=hwe&id=dcde44490772d2059ecc5035a3b6c994176c6212

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

>If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'.

Can someone help me here?
Is this action on me? Problem is solved with the above kernel.

tags: added: verification-done-zesty
removed: kernel-da-key verification-needed-zesty zesty
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hi @sujithpandel,

Thank you for verifying the fix and changing the tag!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.10.0-42.46

---------------
linux (4.10.0-42.46) zesty; urgency=low

  * linux: 4.10.0-42.46 -proposed tracker (LP: #1736152)

  * CVE-2017-1000405
    - mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()

  * CVE-2017-16939
    - ipsec: Fix aborted xfrm policy dump crash

linux (4.10.0-41.45) zesty; urgency=low

  * linux: 4.10.0-41.45 -proposed tracker (LP: #1733524)

  * tar -x sometimes fails on overlayfs (LP: #1728489)
    - ovl: check if all layers are on the same fs
    - ovl: persistent inode number for directories

  * CVE-2017-12146
    - driver core: platform: fix race condition with driver_override

  * NVMe timeout is too short (LP: #1729119)
    - nvme: update timeout module parameter type

  * Set PANIC_TIMEOUT=10 on Power Systems (LP: #1730660)
    - [Config]: Set PANIC_TIMEOUT=10 on ppc64el

  * Cannot pair BLE remote devices when using combo BT SoC (LP: #1731467)
    - Bluetooth: increase timeout for le auto connections

  * Plantronics P610 does not support sample rate reading (LP: #1719853)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics P610

  * Invalid btree pointer causes the kernel NULL pointer dereference
    (LP: #1729256)
    - xfs: reinit btree pointer on attr tree inactivation walk

  * Samba mount/umount in docker container triggers kernel Oops (LP: #1729637)
    - ipv6: only call ip6_route_dev_notify() once for NETDEV_UNREGISTER
    - ipv6: fix NULL dereference in ip6_route_dev_notify()

  * Device hotplugging with MPT SAS cannot work for VMWare ESXi (LP: #1730852)
    - scsi: mptsas: Fixup device hotplug for VMWare ESXi

  * Boot/Installation crash of Ubuntu-16.04.3 HWE kernel on R940 (LP: #1719697)
    - Revert "x86/acpi: Set persistent cpuid <-> nodeid mapping when booting"

 -- Stefan Bader <email address hidden> Mon, 04 Dec 2017 15:04:01 +0100

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Michael Reed (mreed8855)
Changed in dellserver:
status: In Progress → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.