Kernel 4.15.0-15 breaks Dell PowerEdge 12th Gen servers

Bug #1765232 reported by Nikolas Britton on 2018-04-18
50
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Seth Forshee
Bionic
Critical
Seth Forshee

Bug Description

SRU Justification

Impact: Since 4.15.0-15 some machines have been failing to boot due to IO hangs. This is caused by patches applied for LP #1759723, which assigned managed interrupt vectors and reply queues for all possible CPUs, not just present CPUs. Some drivers were not prepared to cope with this and end up selecting reply queues not mapped to an online CPU, causing IO hangs during boot.

Fix: There are driver fixes available upstream, but there are 8-ish patches in total and we're extremely close to release, so the safer bet it to just revert the patches for LP #1759723. We can consider reintroducing them with required fixes at a later time.

Regression Potential: This is obviously going to reintroduce the problem the patches were intended to fix. These are less serious than the problems which the patches introduced, and IBM has given their okay to revert them as well.

Test Case: Verified to fix affected hardware on LP #1765232.

---

For Ubuntu 18.04 amd64 server, I updated the kernel from 4.15.0-13 to 4.15.0-15. The system hangs at boot up now. Rolling back to the old kernel everything works as expected. It appears to hang while trying to enumerate an SD card device, which isn't even installed. I have tried this on a Dell R620 and R820 and both have the exact same problem. I even downloaded a new installer iso from the daily builds to reinstall the OS and the installer hangs too now.

Support for Dell PowerEdge 12th gen servers appears to be broken with kernel 4.15.0-15.

https://imgur.com/VT9zd0w
https://imgur.com/4BQPCXo

CVE References

Nikolas Britton (nbritton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1765232/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Paul White (paulw2u) on 2018-04-19
affects: ubuntu → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1765232

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Tyler Doherty (tylerjd) wrote :

Looks similar to the issue I am having on my HP ProLiant, where it hangs then panics at boot on 4.15.0-15. In my case, it's hanging on the HP Smart Array driver, here's the bug in question https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765105

tags: added: bionic
Nikolas Britton (nbritton) wrote :

Re bot, the stated command can't be run because the system will not boot to a bash prompt.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the proposed kernel and post back if it resolves this bug?
See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed.

Thank you in advance!

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: High → Critical
tags: added: kernel-key
Joseph Salisbury (jsalisbury) wrote :

If the bug still exists in proposed, we can perform a kernel bisect to identify the commit that introduced this regression.

Also, if this bug still exists in proposed, it might be good to see if this bug also exists in the latest mainline kernel, which can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc1/

Nikolas Britton (nbritton) wrote :

I just tried the proposed kernel (4.15.0-18.19) and it still has the same problem.

Joseph Salisbury (jsalisbury) wrote :

Can you also give the mainline kernel a try:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc1/

If the mainline kernel does not have the bug, please test the latest upstream 4.15 kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15.18/

Seth Forshee (sforshee) wrote :

I'd suggest prioritizing testing of the v4.15.17 mainline kernel as this is the latest update bionic has incorporated. If that is broken then try v4.15.18 to see if there's a fix there, otherwise it points to some Ubuntu-specific changes or backports.

Seth Forshee (sforshee) wrote :

I've been investigating, and I strongly suspect these commits for bug 1759723 are to blame.

 f0aff9ccc834 genirq/affinity: assign vectors to all possible CPUs
 9403a13fd07e blk-mq: simplify queue mapping & schedule with each possisble CPU

As far as I can tell there are 8 other fixes we'd need to look at including, either addressing the same original commit that these patches addressed or addressing bugs related to these two patches:

 16ccfff28976 nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors
 8b834bff1b73 scsi: hpsa: fix selection of reply queue
 adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
 b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
 7bed45954b95 blk-mq: make sure hctx->next_cpu is set correctly
 a1c735fb7907 blk-mq: make sure that correct hctx->next_cpu is set
 bffa9909a6b4 blk-mq: don't keep offline CPUs mapped to hctx 0
 d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible

Or we can revert those two patches. I'm considering both options, will provide one or two test kernels soon.

Seth Forshee (sforshee) wrote :

Here's a test kernel with the reverts, please let us know ASAP whether or not this fixes the issue.

http://people.canonical.com/~sforshee/lp1765232/linux-4.15.0-18.19+lp1765232v201804211006/

Thanks!

Seth Forshee (sforshee) wrote :

And now here's another test build with the upstream fixes. Note that the last patch needed too much backporting to do in short order, and I had to include one additional prerequisite:

http://people.canonical.com/~sforshee/lp1765232/linux-4.15.0-18.19+lp1765232v201804211053/

Please test the kernel from comment #12 first though, as I'm more inclined to revert the patches for the moment. We may consider adding them back later though with these additional fixes if they are confirmed to resolve these hangs. Thanks!

Tyler Doherty (tylerjd) wrote :

I tried both these kernels, and unfortunately they are both still leading to hangs on my HP ProLiant (from bug #1765105).

That being said I've not installed an out-of-tree kernel on Ubuntu before, I assume it's just installing linux-image and linux-modules? I just want to verify that I'm not messing something up on my end

Seth Forshee (sforshee) wrote :

Make sure you got linux-image-unsigned, linux-modules, *and* linux-modules-extra. linux-modules-extra has the hpsa driver.

Tyler Doherty (tylerjd) wrote :

Seth, thanks for the guidance, that did it. Looks like the kernel from #12 works without issue. Glad you were able to identify the commits that did it.

Seth Forshee (sforshee) wrote :

Thanks for testing, I'll get those reverts sent to the list.

Nikolas Britton (nbritton) wrote :

The kernel from post #12 also works without issue for me too.

Seth Forshee (sforshee) on 2018-04-21
description: updated
Changed in linux (Ubuntu Bionic):
assignee: nobody → Seth Forshee (sforshee)
status: Confirmed → In Progress
Seth Forshee (sforshee) on 2018-04-21
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (35.7 KiB)

This bug was fixed in the package linux - 4.15.0-19.20

---------------
linux (4.15.0-19.20) bionic; urgency=medium

  * linux: 4.15.0-19.20 -proposed tracker (LP: #1766021)

  * Kernel 4.15.0-15 breaks Dell PowerEdge 12th Gen servers (LP: #1765232)
    - Revert "blk-mq: simplify queue mapping & schedule with each possisble CPU"
    - Revert "genirq/affinity: assign vectors to all possible CPUs"

linux (4.15.0-18.19) bionic; urgency=medium

  * linux: 4.15.0-18.19 -proposed tracker (LP: #1765490)

  * [regression] Ubuntu 18.04:[4.15.0-17-generic #18] KVM Guest Kernel:
    meltdown: rfi/fallback displacement flush not enabled bydefault (kvm)
    (LP: #1765429)
    - powerpc/pseries: Fix clearing of security feature flags

  * signing: only install a signed kernel (LP: #1764794)
    - [Packaging] update to Debian like control scripts
    - [Packaging] switch to triggers for postinst.d postrm.d handling
    - [Packaging] signing -- switch to raw-signing tarballs
    - [Packaging] signing -- switch to linux-image as signed when available
    - [Config] signing -- enable Opal signing for ppc64el
    - [Packaging] printenv -- add signing options

  * [18.04 FEAT] Sign POWER host/NV kernels (LP: #1696154)
    - [Packaging] signing -- add support for signing Opal kernel binaries

  * Please cherrypick s390 unwind fix (LP: #1765083)
    - s390/compat: fix setup_frame32

  * Ubuntu 18.04 installer does not detect any IPR based HDD/RAID array [S822L]
    [ipr] (LP: #1751813)
    - d-i: move ipr to storage-core-modules on ppc64el

  * drivers/gpu/drm/bridge/adv7511/adv7511.ko missing (LP: #1764816)
    - SAUCE: (no-up) rename the adv7511 drm driver to adv7511_drm

  * Miscellaneous Ubuntu changes
    - [Packaging] Add linux-oem to rebuild test blacklist.

linux (4.15.0-17.18) bionic; urgency=medium

  * linux: 4.15.0-17.18 -proposed tracker (LP: #1764498)

  * Eventual OOM with profile reloads (LP: #1750594)
    - SAUCE: apparmor: fix memory leak when duplicate profile load

linux (4.15.0-16.17) bionic; urgency=medium

  * linux: 4.15.0-16.17 -proposed tracker (LP: #1763785)

  * [18.04] [bug] CFL-S(CNP)/CNL GPIO testing failed (LP: #1757346)
    - [Config]: Set CONFIG_PINCTRL_CANNONLAKE=y

  * [Ubuntu 18.04] USB Type-C test failed on GLK (LP: #1758797)
    - SAUCE: usb: typec: ucsi: Increase command completion timeout value

  * Fix trying to "push" an already active pool VP (LP: #1763386)
    - SAUCE: powerpc/xive: Fix trying to "push" an already active pool VP

  * hisi_sas: Revert and replace SAUCE patches w/ upstream (LP: #1762824)
    - Revert "UBUNTU: SAUCE: scsi: hisi_sas: export device table of v3 hw to
      userspace"
    - Revert "UBUNTU: SAUCE: scsi: hisi_sas: config for hip08 ES"
    - scsi: hisi_sas: modify some register config for hip08
    - scsi: hisi_sas: add v3 hw MODULE_DEVICE_TABLE()

  * Realtek card reader - RTS5243 [VEN_10EC&DEV_5260] (LP: #1737673)
    - misc: rtsx: Move Realtek Card Reader Driver to misc
    - updateconfigs for Realtek Card Reader Driver
    - misc: rtsx: Add support for RTS5260
    - misc: rtsx: Fix symbol clashes

  * Mellanox [mlx5] [bionic] UBSAN: Undefined behaviour in
    ./include/linux/net_dim.h (LP: #1...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Nikolas Britton (nbritton) wrote :

I just did a fresh install of the April 24 daily build with kernel 4.15.0-19 on my Dell R820 and I can confirm this has been fix. Thank you!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers