xen-netfront: potential deadlock in xennet_remove()

Bug #1888510 reported by Andrea Righi on 2020-07-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned
Focal
High
Unassigned
linux-aws-5.3 (Ubuntu)
Undecided
Unassigned
Bionic
High
Unassigned
Focal
Undecided
Unassigned

Bug Description

[Impact]

During our AWS testing we were experiencing deadlocks on hibernate across all Xen instance types.
The trace was showing that the system was stuck in xennet_remove():

[ 358.109087] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 358.115102] modprobe D 0 4892 4833 0x00004004
[ 358.115104] Call Trace:
[ 358.115112] __schedule+0x2a8/0x670
[ 358.115115] schedule+0x33/0xa0
[ 358.115118] xennet_remove+0x1f0/0x230 [xen_netfront]
[ 358.115121] ? wait_woken+0x80/0x80
[ 358.115124] xenbus_dev_remove+0x51/0xa0
[ 358.115126] device_release_driver_internal+0xe0/0x1b0
[ 358.115127] driver_detach+0x49/0x90
[ 358.115129] bus_remove_driver+0x59/0xd0
[ 358.115131] driver_unregister+0x2c/0x40
[ 358.115132] xenbus_unregister_driver+0x12/0x20
[ 358.115134] netif_exit+0x10/0x7aa [xen_netfront]
[ 358.115137] __x64_sys_delete_module+0x146/0x290
[ 358.115140] do_syscall_64+0x5a/0x130
[ 358.115142] entry_SYSCALL_64_after_hwframe+0x44/0xa9

This prevented hibernation to complete.

The reason of this problem is a race condition in xennet_remove(): the system is reading the current state of the bus, it's requesting to change the state to "Closing", and it's waiting for the state to be changed to "Closing". However, if the state becomes "Closed" between reading the state and requesting the state change, we are stuck forever, because the state will never change from "Closed" back to "Closing".

[Test case]

Create any Xen-based instance in AWS, hibernate/resume multiple times. Some times the system gets stuck (hung task timeout).

[Fix]

Prevent the deadlock by changing the wait condition to check also for state == Closed.

[Regression potential]

Minimal, this change affects only Xen, more exactly only the xen-netfront driver.

Stefan Bader (smb) on 2020-07-23
no longer affects: linux-aws (Ubuntu Eoan)
no longer affects: linux-aws-5.3 (Ubuntu Eoan)
Changed in linux-aws-5.3 (Ubuntu Focal):
status: New → Invalid
Changed in linux-aws-5.3 (Ubuntu):
status: New → Invalid
Stefan Bader (smb) on 2020-07-24
Changed in linux-aws (Ubuntu):
status: New → Triaged
status: Triaged → Invalid
Changed in linux-aws (Ubuntu Bionic):
status: New → Incomplete
Changed in linux-aws (Ubuntu Focal):
status: New → Fix Committed
importance: Undecided → High
Changed in linux-aws-5.3 (Ubuntu Bionic):
status: New → Fix Committed
importance: Undecided → High
Launchpad Janitor (janitor) wrote :
Download full text (31.1 KiB)

This bug was fixed in the package linux-aws-5.3 - 5.3.0-1032.34~18.04.2

---------------
linux-aws-5.3 (5.3.0-1032.34~18.04.2) bionic; urgency=medium

  * bionic/linux-aws-5.3: 5.3.0-1032.34~18.04.2 -proposed tracker (LP: #1888815)

  * xen-netfront: potential deadlock in xennet_remove() (LP: #1888510)
    - SAUCE: xen-netfront: fix potential deadlock in xennet_remove()

linux-aws-5.3 (5.3.0-1032.34~18.04.1) bionic; urgency=medium

  * bionic/linux-aws-5.3: 5.3.0-1032.34~18.04.1 -proposed tracker (LP: #1887074)

  [ Ubuntu: 5.3.0-1032.34 ]

  * eoan/linux-aws: 5.3.0-1032.34 -proposed tracker (LP: #1887075)
  * eoan/linux: 5.3.0-64.58 -proposed tracker (LP: #1887088)
  * linux 4.15.0-109-generic network DoS regression vs -108 (LP: #1886668)
    - SAUCE: Revert "netprio_cgroup: Fix unlimited memory leak of v2 cgroups"

linux-aws-5.3 (5.3.0-1031.33~18.04.1) bionic; urgency=medium

  * bionic/linux-aws-5.3: 5.3.0-1031.33~18.04.1 -proposed tracker (LP: #1885480)

  [ Ubuntu: 5.3.0-1031.33 ]

  * eoan/linux-aws: 5.3.0-1031.33 -proposed tracker (LP: #1885481)
  * eoan/linux: 5.3.0-63.57 -proposed tracker (LP: #1885495)
  * seccomp_bpf fails on powerpc (LP: #1885757)
    - SAUCE: selftests/seccomp: fix ptrace tests on powerpc
  * The thread level parallelism would be a bottleneck when searching for the
    shared pmd by using hugetlbfs (LP: #1882039)
    - hugetlbfs: take read_lock on i_mmap for PMD sharing
  * Eoan update: upstream stable patchset 2020-06-30 (LP: #1885775)
    - ipv6: fix IPV6_ADDRFORM operation logic
    - net_failover: fixed rollback in net_failover_open()
    - bridge: Avoid infinite loop when suppressing NS messages with invalid
      options
    - vxlan: Avoid infinite loop when suppressing NS messages with invalid options
    - tun: correct header offsets in napi frags mode
    - Input: mms114 - fix handling of mms345l
    - ARM: 8977/1: ptrace: Fix mask for thumb breakpoint hook
    - sched/fair: Don't NUMA balance for kthreads
    - Input: synaptics - add a second working PNP_ID for Lenovo T470s
    - drivers/net/ibmvnic: Update VNIC protocol version reporting
    - powerpc/xive: Clear the page tables for the ESB IO mapping
    - ath9k_htc: Silence undersized packet warnings
    - RDMA/uverbs: Make the event_queue fds return POLLERR when disassociated
    - x86/cpu/amd: Make erratum #1054 a legacy erratum
    - perf probe: Accept the instance number of kretprobe event
    - mm: add kvfree_sensitive() for freeing sensitive data objects
    - aio: fix async fsync creds
    - x86_64: Fix jiffies ODR violation
    - x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs
    - x86/speculation: Prevent rogue cross-process SSBD shutdown
    - x86/reboot/quirks: Add MacBook6,1 reboot quirk
    - efi/efivars: Add missing kobject_put() in sysfs entry creation error path
    - ALSA: es1688: Add the missed snd_card_free()
    - ALSA: hda/realtek - add a pintbl quirk for several Lenovo machines
    - ALSA: usb-audio: Fix inconsistent card PM state after resume
    - ALSA: usb-audio: Add vendor, product and profile name for HP Thunderbolt
      Dock
    - ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile()
    -...

Changed in linux-aws-5.3 (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-aws - 5.4.0-1021.21

---------------
linux-aws (5.4.0-1021.21) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1021.21 -proposed tracker (LP: #1888811)

  * xen-netfront: potential deadlock in xennet_remove() (LP: #1888510)
    - SAUCE: xen-netfront: fix potential deadlock in xennet_remove()

 -- Stefan Bader <email address hidden> Fri, 24 Jul 2020 11:24:21 +0200

Changed in linux-aws (Ubuntu Focal):
status: Fix Committed → Fix Released
Changed in linux-aws (Ubuntu):
status: Invalid → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers