Hang on network interface removal in Xen virtual machine

Bug #1771620 reported by aegiap on 2018-05-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Joseph Salisbury
Bionic
Medium
Joseph Salisbury

Bug Description

== SRU Justification ==
Upstream commit 5b5971df3bc2 introduced a regression in v4.15-rc2. This
regression causes a hang on network interface removal in Xen virtual machine.

This regression is fixed by commit c2d2e6738a209 in v4.16-rc4.

== Fix ==
c2d2e6738a20 ("xen-netfront: Fix hang on device removal")

== Regression Potential ==
Low. Fixes a current regression and specific to Xen.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

On a hosting platform running Xen hypervisor, in a virtual machine with Ubuntu 18.04 system and the default kernel from Ubuntu, I try to detach a virtual network interface. On the Xen side, the virtual interface is removed from the VM but the kernel still has the interface. Then a couple of minutes afterwards, the kernel log show this kernel trace:

INFO: task xenwatch:108 blocked for more than 120 seconds.
      Tainted: G W 4.15.0-20-generic #21-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
xenwatch D 0 108 2 0x80000000
Call Trace:
 __schedule+0x297/0x8b0
 schedule+0x2c/0x80
 xennet_remove+0xda/0x1c0
 ? wait_woken+0x80/0x80
 xenbus_dev_remove+0x54/0xa0
 device_release_driver_internal+0x15b/0x220
 device_release_driver+0x12/0x20
 bus_remove_device+0xec/0x160
 ? xenbus_otherend_changed+0x110/0x110
 device_del+0x13d/0x360
 ? xenbus_otherend_changed+0x110/0x110
 ? xenbus_otherend_changed+0x110/0x110
 device_unregister+0x1a/0x60
 xenbus_dev_changed+0xa3/0x1e0
 ? xenwatch_thread+0xcc/0x160
 frontend_changed+0x21/0x50
 xenwatch_thread+0xc4/0x160
 ? wait_woken+0x80/0x80
 kthread+0x121/0x140
 ? find_watch+0x40/0x40
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x35/0x40

In the git repository of Linux, the commit c2d2e6738a209f0f9dffa2dc8e7292fc45360d61 (xen-netfront: Fix hang on device removal) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c2d2e6738a209f0f9dffa2dc8e7292fc45360d61 seems to be related to this situation.

I rebuilded the Ubuntu kernel from the package source and applied this patch. Once the VM has booted with the new kernel, I was able to remove network interface without hangs from the kernel.

I also booted the VM with the Ubuntu kernel 4.13.0-42-generic and was able to remove the network interface with success.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-20-generic 4.15.0-20.21
ProcVersionSignature: Ubuntu 4.15.0-20.21-generic 4.15.17
Uname: Linux 4.15.0-20-generic x86_64
ApportVersion: 2.20.9-0ubuntu7
Architecture: amd64
Date: Wed May 16 16:36:06 2018
ProcEnviron:
 TERM=rxvt-unicode
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed
UpgradeStatus: No upgrade log present (probably fresh install)

CVE References

aegiap (nicolas-chipaux) wrote :
affects: linux-signed (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Changed in linux (Ubuntu):
status: Triaged → In Progress
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit c2d2e6738a209f0f9dffa2dc8e7292fc45360d61. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1771620

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-image-unsigned, linux-modules and linux-modules-extra .deb packages.

Thanks in advance!

aegiap (nicolas-chipaux) wrote :

I booted the VM with your kernel at http://kernel.ubuntu.com/~jsalisbury/lp1771620 by installing image-unsigned, linux-modules and linux-modules-extra package.

I could remove the network interface in the VM with success, no task hangs. Thank you.

Joseph Salisbury (jsalisbury) wrote :
description: updated
Stefan Bader (smb) on 2018-05-23
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
aegiap (nicolas-chipaux) wrote :

I can confirm that the kernel 4.15.0-23-generic from bionic-proposed/main is solving the network inteface detach issue on Xen platform.

description: updated
tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (11.4 KiB)

This bug was fixed in the package linux - 4.15.0-23.25

---------------
linux (4.15.0-23.25) bionic; urgency=medium

  * linux: 4.15.0-23.25 -proposed tracker (LP: #1772927)

  * arm64 SDEI support needs trampoline code for KPTI (LP: #1768630)
    - arm64: mmu: add the entry trampolines start/end section markers into
      sections.h
    - arm64: sdei: Add trampoline code for remapping the kernel

  * Some PCIe errors not surfaced through rasdaemon (LP: #1769730)
    - ACPI: APEI: handle PCIe AER errors in separate function
    - ACPI: APEI: call into AER handling regardless of severity

  * qla2xxx: Fix page fault at kmem_cache_alloc_node() (LP: #1770003)
    - scsi: qla2xxx: Fix session cleanup for N2N
    - scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
    - scsi: qla2xxx: Serialize session deletion by using work_lock
    - scsi: qla2xxx: Serialize session free in qlt_free_session_done
    - scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
    - scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
    - scsi: qla2xxx: Prevent relogin trigger from sending too many commands
    - scsi: qla2xxx: Fix double free bug after firmware timeout
    - scsi: qla2xxx: Fixup locking for session deletion

  * Several hisi_sas bug fixes (LP: #1768974)
    - scsi: hisi_sas: dt-bindings: add an property of signal attenuation
    - scsi: hisi_sas: support the property of signal attenuation for v2 hw
    - scsi: hisi_sas: fix the issue of link rate inconsistency
    - scsi: hisi_sas: fix the issue of setting linkrate register
    - scsi: hisi_sas: increase timer expire of internal abort task
    - scsi: hisi_sas: remove unused variable hisi_sas_devices.running_req
    - scsi: hisi_sas: fix return value of hisi_sas_task_prep()
    - scsi: hisi_sas: Code cleanup and minor bug fixes

  * [bionic] machine stuck and bonding not working well when nvmet_rdma module
    is loaded (LP: #1764982)
    - nvmet-rdma: Don't flush system_wq by default during remove_one
    - nvme-rdma: Don't flush delete_wq by default during remove_one

  * Warnings/hang during error handling of SATA disks on SAS controller
    (LP: #1768971)
    - scsi: libsas: defer ata device eh commands to libata

  * Hotplugging a SATA disk into a SAS controller may cause crash (LP: #1768948)
    - ata: do not schedule hot plug if it is a sas host

  * ISST-LTE:pKVM:Ubuntu1804: rcu_sched self-detected stall on CPU follow by CPU
    ATTEMPT TO RE-ENTER FIRMWARE! (LP: #1767927)
    - powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write()
    - powerpc/64s: return more carefully from sreset NMI
    - powerpc/64s: sreset panic if there is no debugger or crash dump handlers

  * fsnotify: Fix fsnotify_mark_connector race (LP: #1765564)
    - fsnotify: Fix fsnotify_mark_connector race

  * Hang on network interface removal in Xen virtual machine (LP: #1771620)
    - xen-netfront: Fix hang on device removal

  * HiSilicon HNS NIC names are truncated in /proc/interrupts (LP: #1765977)
    - net: hns: Avoid action name truncation

  * Ubuntu 18.04 kernel crashed while in degraded mode (LP: #1770849)
    - SAUCE: powerpc/perf: Fix memory allocation for...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers