Hang on network interface removal in Xen virtual machine

Bug #1771620 reported by aegiap
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Joseph Salisbury
Bionic
Fix Released
Medium
Joseph Salisbury

Bug Description

== SRU Justification ==
Upstream commit 5b5971df3bc2 introduced a regression in v4.15-rc2. This
regression causes a hang on network interface removal in Xen virtual machine.

This regression is fixed by commit c2d2e6738a209 in v4.16-rc4.

== Fix ==
c2d2e6738a20 ("xen-netfront: Fix hang on device removal")

== Regression Potential ==
Low. Fixes a current regression and specific to Xen.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

On a hosting platform running Xen hypervisor, in a virtual machine with Ubuntu 18.04 system and the default kernel from Ubuntu, I try to detach a virtual network interface. On the Xen side, the virtual interface is removed from the VM but the kernel still has the interface. Then a couple of minutes afterwards, the kernel log show this kernel trace:

INFO: task xenwatch:108 blocked for more than 120 seconds.
      Tainted: G W 4.15.0-20-generic #21-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
xenwatch D 0 108 2 0x80000000
Call Trace:
 __schedule+0x297/0x8b0
 schedule+0x2c/0x80
 xennet_remove+0xda/0x1c0
 ? wait_woken+0x80/0x80
 xenbus_dev_remove+0x54/0xa0
 device_release_driver_internal+0x15b/0x220
 device_release_driver+0x12/0x20
 bus_remove_device+0xec/0x160
 ? xenbus_otherend_changed+0x110/0x110
 device_del+0x13d/0x360
 ? xenbus_otherend_changed+0x110/0x110
 ? xenbus_otherend_changed+0x110/0x110
 device_unregister+0x1a/0x60
 xenbus_dev_changed+0xa3/0x1e0
 ? xenwatch_thread+0xcc/0x160
 frontend_changed+0x21/0x50
 xenwatch_thread+0xc4/0x160
 ? wait_woken+0x80/0x80
 kthread+0x121/0x140
 ? find_watch+0x40/0x40
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x35/0x40

In the git repository of Linux, the commit c2d2e6738a209f0f9dffa2dc8e7292fc45360d61 (xen-netfront: Fix hang on device removal) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c2d2e6738a209f0f9dffa2dc8e7292fc45360d61 seems to be related to this situation.

I rebuilded the Ubuntu kernel from the package source and applied this patch. Once the VM has booted with the new kernel, I was able to remove network interface without hangs from the kernel.

I also booted the VM with the Ubuntu kernel 4.13.0-42-generic and was able to remove the network interface with success.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-20-generic 4.15.0-20.21
ProcVersionSignature: Ubuntu 4.15.0-20.21-generic 4.15.17
Uname: Linux 4.15.0-20-generic x86_64
ApportVersion: 2.20.9-0ubuntu7
Architecture: amd64
Date: Wed May 16 16:36:06 2018
ProcEnviron:
 TERM=rxvt-unicode
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed
UpgradeStatus: No upgrade log present (probably fresh install)

CVE References

Revision history for this message
aegiap (nicolas-chipaux) wrote :
affects: linux-signed (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Changed in linux (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit c2d2e6738a209f0f9dffa2dc8e7292fc45360d61. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1771620

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-image-unsigned, linux-modules and linux-modules-extra .deb packages.

Thanks in advance!

Revision history for this message
aegiap (nicolas-chipaux) wrote :

I booted the VM with your kernel at http://kernel.ubuntu.com/~jsalisbury/lp1771620 by installing image-unsigned, linux-modules and linux-modules-extra package.

I could remove the network interface in the VM with success, no task hangs. Thank you.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
description: updated
Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
aegiap (nicolas-chipaux) wrote :

I can confirm that the kernel 4.15.0-23-generic from bionic-proposed/main is solving the network inteface detach issue on Xen platform.

description: updated
tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.4 KiB)

This bug was fixed in the package linux - 4.15.0-23.25

---------------
linux (4.15.0-23.25) bionic; urgency=medium

  * linux: 4.15.0-23.25 -proposed tracker (LP: #1772927)

  * arm64 SDEI support needs trampoline code for KPTI (LP: #1768630)
    - arm64: mmu: add the entry trampolines start/end section markers into
      sections.h
    - arm64: sdei: Add trampoline code for remapping the kernel

  * Some PCIe errors not surfaced through rasdaemon (LP: #1769730)
    - ACPI: APEI: handle PCIe AER errors in separate function
    - ACPI: APEI: call into AER handling regardless of severity

  * qla2xxx: Fix page fault at kmem_cache_alloc_node() (LP: #1770003)
    - scsi: qla2xxx: Fix session cleanup for N2N
    - scsi: qla2xxx: Remove unused argument from qlt_schedule_sess_for_deletion()
    - scsi: qla2xxx: Serialize session deletion by using work_lock
    - scsi: qla2xxx: Serialize session free in qlt_free_session_done
    - scsi: qla2xxx: Don't call dma_free_coherent with IRQ disabled.
    - scsi: qla2xxx: Fix warning in qla2x00_async_iocb_timeout()
    - scsi: qla2xxx: Prevent relogin trigger from sending too many commands
    - scsi: qla2xxx: Fix double free bug after firmware timeout
    - scsi: qla2xxx: Fixup locking for session deletion

  * Several hisi_sas bug fixes (LP: #1768974)
    - scsi: hisi_sas: dt-bindings: add an property of signal attenuation
    - scsi: hisi_sas: support the property of signal attenuation for v2 hw
    - scsi: hisi_sas: fix the issue of link rate inconsistency
    - scsi: hisi_sas: fix the issue of setting linkrate register
    - scsi: hisi_sas: increase timer expire of internal abort task
    - scsi: hisi_sas: remove unused variable hisi_sas_devices.running_req
    - scsi: hisi_sas: fix return value of hisi_sas_task_prep()
    - scsi: hisi_sas: Code cleanup and minor bug fixes

  * [bionic] machine stuck and bonding not working well when nvmet_rdma module
    is loaded (LP: #1764982)
    - nvmet-rdma: Don't flush system_wq by default during remove_one
    - nvme-rdma: Don't flush delete_wq by default during remove_one

  * Warnings/hang during error handling of SATA disks on SAS controller
    (LP: #1768971)
    - scsi: libsas: defer ata device eh commands to libata

  * Hotplugging a SATA disk into a SAS controller may cause crash (LP: #1768948)
    - ata: do not schedule hot plug if it is a sas host

  * ISST-LTE:pKVM:Ubuntu1804: rcu_sched self-detected stall on CPU follow by CPU
    ATTEMPT TO RE-ENTER FIRMWARE! (LP: #1767927)
    - powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write()
    - powerpc/64s: return more carefully from sreset NMI
    - powerpc/64s: sreset panic if there is no debugger or crash dump handlers

  * fsnotify: Fix fsnotify_mark_connector race (LP: #1765564)
    - fsnotify: Fix fsnotify_mark_connector race

  * Hang on network interface removal in Xen virtual machine (LP: #1771620)
    - xen-netfront: Fix hang on device removal

  * HiSilicon HNS NIC names are truncated in /proc/interrupts (LP: #1765977)
    - net: hns: Avoid action name truncation

  * Ubuntu 18.04 kernel crashed while in degraded mode (LP: #1770849)
    - SAUCE: powerpc/perf: Fix memory allocation for...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.