System randomly hangs during suspend when mei_wdt is loaded

Bug #1803942 reported by Robert Liu on 2018-11-19
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
HWE Next
Undecided
Unassigned
linux (Ubuntu)
Medium
Unassigned
Bionic
Medium
Unassigned

Bug Description

Problem description:
System randomly hangs during suspend when mei_wdt is loaded.

Platform:
Intel Dawson Canyon I5 (NUC7i5DNHE) and I7 (NUC7i7DNHE)

Test procedure:
1. Install Ubuntu 18.04 and do apt full-upgrade

2. Enable mei_wdt:
$ sudo modprobe mei_wdt

3. Do system S3 test:
$ sudo systemctl suspend
or
$ sudo rtcwake -v -m mem -s 15

4. Afterward, I noticed that somehow system will hang during suspend/resume.

Expect result:
Suspend/resume won't impact system.

Actual result:
System will hang during S3 test.

Additional info:
- BIOS version: V57
- Another i3 platform (NUC7i3DNHNC) doesn't have this issue.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1803942

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Robert Liu (robertliu) wrote :

I tried mainline kernel 4.20-rc2, 4.18, 4.17, 4.16 and 4.15. They all passed a 50 run test.
To validate when the issue is introduced, I installed several Ubuntu kernel packages and address the issue/regression happens between 4.15.0-34 and 4.15.0-36.

Finished the kernel bisect and probably found the first bad commit.
$ git bisect log
# bad: [fd01374000c83b59d3ce234fbb970cb91404bd42] UBUNTU: Ubuntu-4.15.0-36.39
# good: [ffaad0a9f46742f6d71af975a6a061ffe6963aca] UBUNTU: Ubuntu-4.15.0-34.37
git bisect start 'Ubuntu-4.15.0-36.39' 'Ubuntu-4.15.0-34.37'
# good: [003ae88ae88d48643e71dc69c18d4eda598339d5] Revert "UBUNTU: SAUCE: net: hns3: Fix for VF mailbox receiving unknown message"
git bisect good 003ae88ae88d48643e71dc69c18d4eda598339d5
# good: [d3d4b20788eee72dcb1ed5ace7dbee6aafbe65cf] net: hns3: Fix for mac pause not disable in pfc mode
git bisect good d3d4b20788eee72dcb1ed5ace7dbee6aafbe65cf
# good: [8e499f213175b65bcc08a3c685ea6717e7668cec] arm64: ssbd: Introduce thread flag to control userspace mitigation
git bisect good 8e499f213175b65bcc08a3c685ea6717e7668cec
# good: [11cdaf61c1986ea682398b48e238fd915393b2e6] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
git bisect good 11cdaf61c1986ea682398b48e238fd915393b2e6
# good: [db6800337d38a5b2adbaa78fcb2c299362222e1d] s390: fix br_r1_trampoline for machines without exrl
git bisect good db6800337d38a5b2adbaa78fcb2c299362222e1d
# good: [f1f016ed54582502d59de991ddbecfe2373722c0] x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+
git bisect good f1f016ed54582502d59de991ddbecfe2373722c0
# good: [f1f016ed54582502d59de991ddbecfe2373722c0] x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+
git bisect good f1f016ed54582502d59de991ddbecfe2373722c0
# bad: [cb3b0751997c8bd45c76a7401e2edb019cdaaab3] UBUNTU: Start new release
git bisect bad cb3b0751997c8bd45c76a7401e2edb019cdaaab3
# bad: [cb3b0751997c8bd45c76a7401e2edb019cdaaab3] UBUNTU: Start new release
git bisect bad cb3b0751997c8bd45c76a7401e2edb019cdaaab3
# bad: [d5bad4136d14cda66cb7c06b10b80c64ae695c78] UBUNTU: Ubuntu-4.15.0-35.38
git bisect bad d5bad4136d14cda66cb7c06b10b80c64ae695c78
# bad: [876dcb5f4576934a1a11b091b40ce548f048340e] UBUNTU: SAUCE: vfio -- release device lock before userspace requests
git bisect bad 876dcb5f4576934a1a11b091b40ce548f048340e
# first bad commit: [876dcb5f4576934a1a11b091b40ce548f048340e] UBUNTU: SAUCE: vfio -- release device lock before userspace requests

Robert Liu (robertliu) wrote :

I commented out this patch of 876dcb and cannot reproduce the issue afterward.

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 2c964f5..37c0105 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -868,6 +868,13 @@ static void __device_release_driver(struct device *dev, struct device *parent)
    dev->bus->remove(dev);
   else if (drv->remove)
    drv->remove(dev);
+ /*
+ * A concurrent invocation of the same function might
+ * have released the driver successfully while this one
+ * was waiting, so check for that.
+ */
+ if (dev->driver != drv)
+ return;

   device_links_driver_cleanup(dev);
   dma_deconfigure(dev);

Jesse Sung (wenchien) on 2018-11-19
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Jesse Sung (wenchien) on 2018-11-19
tags: added: dawson originate-from-1803076
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
Changed in linux (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → Medium
tags: added: kernel-da-key
Jesse Sung (wenchien) wrote :

This is caused by 876dcb relies on dev->driver to tell if __device_release_driver() was completed by other callers while calling remove(). But drivers like mei_wdt will set dev->driver to NULL after its remove() completes, thus it would do return instead of continuing __device_release_driver().

Changed in linux (Ubuntu Bionic):
status: Triaged → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Robert Liu (robertliu) wrote :

With the proposed kernel, my system passed a 30-run stress.

1. Enable the -proposed repository
2. Install the proposed kernel and reboot
3. uname -r
  4.15.0-43-generic
4. sudo modprobe mei_wdt
5. for s in $(seq 30); do sudo rtcwake -v -m mem -s 15; sleep 15; done
6. verify the result is success

tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.15.0-43.46

---------------
linux (4.15.0-43.46) bionic; urgency=medium

  * linux: 4.15.0-43.46 -proposed tracker (LP: #1806659)

  * System randomly hangs during suspend when mei_wdt is loaded (LP: #1803942)
    - SAUCE: base/dd: limit release function changes to vfio driver only

  * Workaround CSS timeout on AMD SNPS 3.0 xHC (LP: #1806838)
    - xhci: Allow more than 32 quirks
    - xhci: workaround CSS timeout on AMD SNPS 3.0 xHC

  * linux-buildinfo: pull out ABI information into its own package
    (LP: #1806380)
    - [Packaging] limit preparation to linux-libc-dev in headers
    - [Packaging] commonise debhelper invocation
    - [Packaging] ABI -- accumulate abi information at the end of the build
    - [Packaging] buildinfo -- add basic build information
    - [Packaging] buildinfo -- add firmware information to the flavour ABI
    - [Packaging] buildinfo -- add compiler information to the flavour ABI
    - [Packaging] buildinfo -- add buildinfo support to getabis
    - [Config] buildinfo -- add retpoline version markers

  * linux packages should own /usr/lib/linux/triggers (LP: #1770256)
    - [Packaging] own /usr/lib/linux/triggers

  * CVE-2018-12896
    - posix-timers: Sanitize overrun handling

  * CVE-2018-16276
    - USB: yurex: fix out-of-bounds uaccess in read handler

  * CVE-2018-10902
    - ALSA: rawmidi: Change resized buffers atomically

  * CVE-2018-18710
    - cdrom: fix improper type cast, which can leat to information leak.

  * CVE-2018-18690
    - xfs: don't fail when converting shortform attr to long form during
      ATTR_REPLACE

  * CVE-2018-14734
    - infiniband: fix a possible use-after-free bug

  * CVE-2018-18445
    - bpf: 32-bit RSH verification must truncate input before the ALU op

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

 -- Kleber Sacilotto de Souza <email address hidden> Thu, 06 Dec 2018 13:52:12 +0000

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers