Kernel Oops - general protection fault: 0000 [#1] SMP PTI after disconnecting thunderbolt docking station

Bug #1864754 reported by Piotr Morgwai Kotarbiński on 2020-02-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Andrea Righi
Eoan
Medium
Andrea Righi
Focal
Medium
Andrea Righi
linux-signed-hwe (Ubuntu)
Undecided
Unassigned

Bug Description

[Impact]

Disconnecting a thunderbolt docking station on a Dell Inc. XPS 13 9360/0D4J15 can cause a general protection fault (with kernel 5.3.0-40 and above).

The bug has been introduced by this upstream commit:

  ffe3bcaf02c4 ptp: fix the race between the release of ptp_clock and cdev

Reverting the commit is not a viable option, because we would re-introduce another bug.

The proper fix is to do something similar to this:

  75718584cb3c64e6269109d4d54f888ac5a5fd15 "ptp: free ptp device pin descriptors properly"

and call pps_unregister_source() in ptp_clock_release(). NOTE: this bug is also present upstream.

[Test case]

Bug reported provided the test case by physically disconnecting the docking station. The problems can be easily reproduced and it doesn't seem to happen anymore with the fix applied.

[Fix]

Call call pps_unregister_source() from ptp_clock_release() instead of ptp_clock_unregister().

[Regression potential]

Minimal regression potential, the change is limited to the ptp clock unregistering code path.

[Original report]

happens each time on kernel 5.3.0-40 when I disconnect thunderbolt docking station.
Does *not* happen on 5.3.0-28.
After it happens almost everything still works ok: the only malfunction I've noticed is that after reconnecting the dock, USB devices (mouse, keyboard) connected via the dock don't work at all. However monitor connected via the dock still works fine.

Hardware essentials:
Dell Inc. XPS 13 9360/0D4J15, BIOS 2.13.0
Plugable TBT3-UDV Docking Station

Does not happen when disconnecting "simpler" devices like apple's usb-c to hdmi/usb/usb-c adapter.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-5.3.0-40-generic 5.3.0-40.32~18.04.1
ProcVersionSignature: Ubuntu 5.3.0-40.32~18.04.1-generic 5.3.18
Uname: Linux 5.3.0-40-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.11
Architecture: amd64
Date: Wed Feb 26 05:23:00 2020
InstallationDate: Installed on 2020-01-01 (55 days ago)
InstallationMedia: Ubuntu-MATE 18.04.3 LTS "Bionic Beaver" - Release amd64 (20190805)
ProcEnviron:
 LANGUAGE=en_IE:en
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_IE.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-hwe
UpgradeStatus: No upgrade log present (probably fresh install)

CVE References

Andrea Righi (arighi) wrote :

I'm wondering if this commit introduced this issue:

  ffe3bcaf02c4 ptp: fix the race between the release of ptp_clock and cdev

Can you do a test with the following kernel and check if it fixes the problem?

  https://kernel.ubuntu.com/~arighi/lp1864754/

Thanks!

Hi!
Thank you very much for looking into this and preparing the kernel for me: much appreciated!

I've just tried it and upon disconnecting the dock I now get an Oops still, but a completely different one than before. Also, after reconnecting the dock, the USB hub in it works ok now.
So my guess is that your changes (reverting the commit you mentioned, I presume) fixed my previous problem, but kernel 5.3.0-42 introduced some other independently.
The other option is that both problems were already present in 5.3.0-40 kernel, but the first one somehow suppressed the one I've just noticed in "your" kernel. Not sure if such suppressing of one Oops by anther is possible though: I'm not a kernel expert in any way.

Please kindly have a look at the attached dmesg.log and advise me if I should file another bug against 5.3.0-42 kernel.

Andrea Righi (arighi) wrote :

It looks like we also need this fix:

  94bc1e522b32c866d85b5af0ede55026b585ae73 igb/igc: Don't warn on fatal read failures when the device is removed

I've prepared another test kernel (5.3.0-42.34~18.04.1+lp1864754v2), you can find it at the same location:

  https://kernel.ubuntu.com/~arighi/lp1864754/

Can you try this one and see if it fixes also the new kernel oops? Thanks!

Changed in linux-signed-hwe (Ubuntu):
assignee: nobody → Andrea Righi (arighi)
importance: Undecided → Medium

Hi,
Now everything seems to work ok :) I've connected and disconnected the dock multiple times with various stuff connected to it (usb-3, usb-c, ethernet) and no Oops occurred.

Many thanks for your help with this! :)

Have you just reverted the problematic commits or have you managed to fix them somehow maybe? I'm asking in the context whether your changes can make it to the next official release ;)

Thanks again!

Andrea Righi (arighi) wrote :

Great! Thanks for testing it!

About the changes I made, I reverted the ptp fix and applied the missing igb/igc fix. Unfortunately reverting the ptp fix would re-introduce another bug, so it's not a viable option I think, however now we know where the problem is.

I think I have an idea how to fix both bugs without reverting the ptp commit. I'll upload a new test kernel soon.

Andrea Righi (arighi) wrote :

I've uploaded a new kernel (5.3.0-42.34~18.04.1+lp1864754v3) here:

  https://kernel.ubuntu.com/~arighi/lp1864754/

It may not fix the initial bug, so make sure you don't have anything important running if you repeat the test! :)

Thanks.

congratulations: everything seems to work fine with this v3 kernel too! :) I've attached again all kind of stuff that I could find to the dock and connected and disconnected it 2 times: no Oops and all attached stuff seems to work ok :)

Many thanks for fixing it! Looking forward to the next official release :)

Andrea Righi (arighi) on 2020-03-04
description: updated
Changed in linux-signed-hwe (Ubuntu Bionic):
importance: Undecided → High
assignee: nobody → Andrea Righi (arighi)
Changed in linux-signed-hwe (Ubuntu Focal):
importance: Medium → High
Andrea Righi (arighi) wrote :

Patch set sent to the kernel-team mailing list:

https://lists.ubuntu.com/archives/kernel-team/2020-March/108018.html

Andrea Righi (arighi) wrote :

Hi Piotr, it looks like my fix is introducing a resource leak to prevent the kernel oops, so it's not an ideal fix (see this thread for more details https://lkml.org/lkml/2020/3/4/805).

I have another fix that would be interesting to test, so I've uploaded a new kernel here (5.3.0-42.34~18.04.1+lp1864754v4):

  https://kernel.ubuntu.com/~arighi/lp1864754/

If you have time it would be great if you could test also this one. Thanks again!

huh, so it's getting more complicated... thanks to Vladis for catching the leak!

I've just tested v4 kernel and it also works fine :) same test as always: connected various stuff to the dock, then disconnect and reconnected the dock: no Oops and all connected devices work ok :)
Hope it's a proper solution now, but I guess some more feedback from LKML would be useful...

Many many thanks for your help!

Hi Andrea,
Does recently released kernel 5.3.0.42.99 contains your fix or not yet?
(unfortunately changelog is not very helpful: it only says:

Version 5.3.0.42.99:
  * Bump ABI 5.3.0-42
Version 5.3.0.41.98:
  * Bump ABI 5.3.0-41
  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log

)

it seems the fix was not included (at least not fully) as I'm still experiencing Oops when disconnecting (but everything seems to work fine after it regardless). attached please find the dmesg output.

Andrea Righi (arighi) on 2020-03-27
description: updated
Andrea Righi (arighi) wrote :

Posted a new patch for focal:

https://lists.ubuntu.com/archives/kernel-team/2020-March/108562.html

Considering that we are going to rebase bionic:linux-hwe-edge to the focal kernel (5.4) this fix should naturally land also into this kernel.

Stefan Bader (smb) on 2020-04-03
Changed in linux-signed-hwe (Ubuntu Focal):
status: New → Fix Committed

I've just tried today's (20200404) daily build of Ubuntu 20.04 with kernel 5.4.0-21.25: no Oops there after disconnecting the dock and everything seems to be working fine.

@Stefan is this the kernel to which Adrea's fix was already applied? uname says it was built on 20200328, few days before your message about committing the fix, but maybe your message was delayed?

Thanks!

per Andrea's request, I've tried kernel 5.3.0-46 that he provided here: https://kernel.ubuntu.com/~arighi/LP-1864754/
an Oops does occur on this kernel after disconnecting my dock. Attached please find dmesg output

I've just tried kernel 5.3.0-46 prepared by Andrea with an additional commit added:

94bc1e522b32c ("igb/igc: Don't warn on fatal read failures when the device is removed")

no Oops there after disconnecting the dock :)

Andrea Righi (arighi) on 2020-04-14
Changed in linux (Ubuntu Focal):
status: New → Fix Committed
no longer affects: linux-signed-hwe (Ubuntu)
Changed in linux (Ubuntu Eoan):
status: New → Fix Committed
no longer affects: linux (Ubuntu Bionic)
no longer affects: linux-signed-hwe (Ubuntu Bionic)
no longer affects: linux-signed-hwe (Ubuntu Eoan)
no longer affects: linux-signed-hwe (Ubuntu Focal)
Changed in linux (Ubuntu Eoan):
importance: Undecided → Medium
Changed in linux (Ubuntu Focal):
importance: Undecided → Medium
Changed in linux (Ubuntu Eoan):
assignee: nobody → Andrea Righi (arighi)
Changed in linux (Ubuntu Focal):
assignee: nobody → Andrea Righi (arighi)
Launchpad Janitor (janitor) wrote :
Download full text (35.2 KiB)

This bug was fixed in the package linux - 5.4.0-24.28

---------------
linux (5.4.0-24.28) focal; urgency=medium

  * focal/linux: 5.4.0-24.28 -proposed tracker (LP: #1871939)

  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] CONTEXT_TRACKING_FORCE policy should be unset

  * 12d1:1038 Dual-Role OTG device on non-HNP port - unable to enumerate USB
    device on port 1 (LP: #1047527)
    - [Config] USB_OTG_FSM policy not needed

  * Add DCPD backlight support for HP CML system (LP: #1871589)
    - SAUCE: drm/i915: Force DPCD backlight mode for HP CML 2020 system

  * Backlight brightness cannot be adjusted using keys (LP: #1860303)
    - SAUCE drm/i915: Force DPCD backlight mode for HP Spectre x360 Convertible
      13t-aw100

  * CVE-2020-11494
    - slcan: Don't transmit uninitialized stack data in padding

  * Ubuntu Kernel Support for OpenPOWER NV Secure & Trusted Boot (LP: #1866909)
    - powerpc: Detect the secure boot mode of the system
    - powerpc/ima: Add support to initialize ima policy rules
    - powerpc: Detect the trusted boot state of the system
    - powerpc/ima: Define trusted boot policy
    - ima: Make process_buffer_measurement() generic
    - certs: Add wrapper function to check blacklisted binary hash
    - ima: Check against blacklisted hashes for files with modsig
    - powerpc/ima: Update ima arch policy to check for blacklist
    - powerpc/ima: Indicate kernel modules appended signatures are enforced
    - powerpc/powernv: Add OPAL API interface to access secure variable
    - powerpc: expose secure variables to userspace via sysfs
    - x86/efi: move common keyring handler functions to new file
    - powerpc: Load firmware trusted keys/hashes into kernel keyring
    - x86/efi: remove unused variables

  * [roce-0227]sync mainline kernel 5.6rc3 roce patchset into ubuntu HWE kernel
    branch (LP: #1864950)
    - RDMA/hns: Cleanups of magic numbers
    - RDMA/hns: Optimize eqe buffer allocation flow
    - RDMA/hns: Add the workqueue framework for flush cqe handler
    - RDMA/hns: Delayed flush cqe process with workqueue
    - RDMA/hns: fix spelling mistake: "attatch" -> "attach"
    - RDMA/hns: Initialize all fields of doorbells to zero
    - RDMA/hns: Treat revision HIP08_A as a special case
    - RDMA/hns: Use flush framework for the case in aeq
    - RDMA/hns: Stop doorbell update while qp state error
    - RDMA/hns: Optimize qp destroy flow
    - RDMA/hns: Optimize qp context create and destroy flow
    - RDMA/hns: Optimize qp number assign flow
    - RDMA/hns: Optimize qp buffer allocation flow
    - RDMA/hns: Optimize qp param setup flow
    - RDMA/hns: Optimize kernel qp wrid allocation flow
    - RDMA/hns: Optimize qp doorbell allocation flow
    - RDMA/hns: Check if depth of qp is 0 before configure

  * [hns3-0316]sync mainline kernel 5.6rc4 hns3 patchset into ubuntu HWE kernel
    branch (LP: #1867586)
    - net: hns3: modify an unsuitable print when setting unknown duplex to fibre
    - net: hns3: add enabled TC numbers and DWRR weight info in debugfs
    - net: hns3: add support for dump MAC ID and loopback status in debugfs
    - net: hns3: add missing help info for QS shaper...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan

tested on kernel 5.3.0-52: everything works fine :) no Oops, and after reconnecting the dock, all connected devices work ok.

tags: added: verification-done-eoan
removed: verification-needed-eoan
Changed in linux-signed-hwe (Ubuntu):
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (38.1 KiB)

This bug was fixed in the package linux - 5.3.0-53.47

---------------
linux (5.3.0-53.47) eoan; urgency=medium

  * eoan/linux: 5.3.0-53.47 -proposed tracker (LP: #1877257)

  * Intermittent display blackouts on event (LP: #1875254)
    - drm/i915: Limit audio CDCLK>=2*BCLK constraint back to GLK only

  * Unable to handle kernel pointer dereference in virtual kernel address space
    on Eoan (LP: #1876645)
    - SAUCE: overlayfs: fix shitfs special-casing

linux (5.3.0-52.46) eoan; urgency=medium

  * eoan/linux: 5.3.0-52.46 -proposed tracker (LP: #1874752)

  * alsa: make the dmic detection align to the mainline kernel-5.6
    (LP: #1871284)
    - ALSA: hda: add Intel DSP configuration / probe code
    - ALSA: hda: fix intel DSP config
    - ALSA: hda: Allow non-Intel device probe gracefully
    - ALSA: hda: More constifications
    - ALSA: hda: Rename back to dmic_detect option
    - [Config] SND_INTEL_DSP_CONFIG=m
    - [packaging] Remove snd-intel-nhlt from modules

  * built-using constraints preventing uploads (LP: #1875601)
    - temporarily drop Built-Using data

  * ubuntu/focal64 fails to mount Vagrant shared folders (LP: #1873506)
    - [Packaging] Move virtualbox modules to linux-modules
    - [Packaging] Remove vbox and zfs modules from generic.inclusion-list

  * linux-image-5.0.0-35-generic breaks checkpointing of container
    (LP: #1857257)
    - SAUCE: overlayfs: use shiftfs hacks only with shiftfs as underlay

  * shiftfs: broken shiftfs nesting (LP: #1872094)
    - SAUCE: shiftfs: record correct creator credentials

  * Add debian/rules targets to compile/run kernel selftests (LP: #1874286)
    - [Packaging] add support to compile/run selftests

  * shiftfs: O_TMPFILE reports ESTALE (LP: #1872757)
    - SAUCE: shiftfs: fix dentry revalidation

  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] CONTEXT_TRACKING_FORCE policy should be unset

  * 5.3.0-46-generic - i915 - frequent GPU hangs / resets rcs0 (LP: #1872001)
    - drm/i915/execlists: Preempt-to-busy
    - drm/i915/gt: Detect if we miss WaIdleLiteRestore
    - drm/i915/execlists: Always force a context reload when rewinding RING_TAIL

  * alsa/sof: external mic can't be deteced on Lenovo and HP laptops
    (LP: #1872569)
    - SAUCE: ASoC: intel/skl/hda - set autosuspend timeout for hda codecs

  * Eoan update: upstream stable patchset 2020-04-22 (LP: #1874325)
    - ARM: dts: sun8i-a83t-tbs-a711: HM5065 doesn't like such a high voltage
    - bus: sunxi-rsb: Return correct data when mixing 16-bit and 8-bit reads
    - net: vxge: fix wrong __VA_ARGS__ usage
    - hinic: fix a bug of waitting for IO stopped
    - hinic: fix wrong para of wait_for_completion_timeout
    - cxgb4/ptp: pass the sign of offset delta in FW CMD
    - qlcnic: Fix bad kzalloc null test
    - i2c: st: fix missing struct parameter description
    - cpufreq: imx6q: Fixes unwanted cpu overclocking on i.MX6ULL
    - media: venus: hfi_parser: Ignore HEVC encoding for V1
    - firmware: arm_sdei: fix double-lock on hibernate with shared events
    - null_blk: Fix the null_add_dev() error path
    - null_blk: Handle null_add_dev() failures properly
    - null_blk: fix spuri...

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers