Case [Azure] Fix VM crash/hang issues due to fast VF add/remove events

Bug #2023594 reported by Tim Gardner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
New
Undecided
Unassigned
Focal
Fix Released
Medium
Tim Gardner
Jammy
Fix Released
Medium
Tim Gardner
Lunar
Fix Released
Medium
Tim Gardner

Bug Description

SRU Justification

[Impact]

A Linux guest on Hyper-V/Azure can occasionally crash during early Linux kernel boot due to a strange host behavior:
1. The host assigns a VF to the guest;
2. The host immediately unassigns the VF from the guest; //Dexuan: due to some race conditions bug in Linux vPCI driver, Linux can crash.
3. The host assigns the VF to the guest again.

Starting late 2022 (around Nov 2022), Linux guests on Azure started to crash more frequently due to a host side update at that time: a new host/hypervisor feature of handling "correctable memory errors" can cause a lot of successive VF remove/add events, so the race conditions bug in Linux vPCI driver can surface much more easily. The Hyper-V team is implementing a batching mechanism so that the guest will get much less VF remove/add events (ETA: June 2023), but meanwhile we should also get the Linux race condition bugs fixed so that Linux guests won't crash even if it receives the successive VF remove/add events.

[Test Plan]

Microsoft tested

[Regression potential]

PCI devices may not get registered, or VMs may crash.

[Other Info]

SF: #00349076

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Focal):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu Jammy):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu Lunar):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Medium
status: New → In Progress
affects: linux (Ubuntu) → linux-azure (Ubuntu)
Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu Jammy):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu Lunar):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.15.0-1043.50 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/6.2.0-1009.9 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lunar' to 'verification-done-lunar'. If the problem still exists, change the tag 'verification-needed-lunar' to 'verification-failed-lunar'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-lunar-linux-azure verification-needed-lunar
Tim Gardner (timg-tpi)
tags: added: verification-done-jammy verification-done-lunar
removed: verification-needed-jammy verification-needed-lunar
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.4.0-1113.119 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-azure verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (223.9 KiB)

This bug was fixed in the package linux-azure - 6.2.0-1009.9

---------------
linux-azure (6.2.0-1009.9) lunar; urgency=medium

  * lunar/linux-azure: 6.2.0-1009.9 -proposed tracker (LP: #2026476)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis

  * Azure: Fix lockup in swiotlb when used as a CVM (LP: #2026736)
    - swiotlb: remove swiotlb_max_segment
    - swiotlb: fix the deadlock in swiotlb_do_find_slots
    - swiotlb: use wrap_area_index() instead of open-coding it
    - swiotlb: fix slot alignment checks
    - swiotlb: fix a braino in the alignment check fix

  * [Azure] Fix VM crash/hang issues due to fast VF add/remove events
    (LP: #2023071) // Case [Azure] Fix VM crash/hang issues due to fast VF
    add/remove events (LP: #2023594)
    - PCI: hv: Fix a race condition bug in hv_pci_query_relations()
    - PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
    - PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
    - Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
    - PCI: hv: Add a per-bus mutex state_lock
    - PCI: hv: Use async probing to reduce boot time

  * Azure: Fix perf regression: remove rx_cqes, tx_cqes counters for MANA
    (LP: #2022940)
    - net: mana: Fix perf regression: remove rx_cqes, tx_cqes counters

  * [Azure][MANA][VLANTagging] Support for VLAN Tagging for MANA (LP: #2023695)
    - net: mana: Add support for vlan tagging

  [ Ubuntu: 6.2.0-27.28 ]

  * lunar/linux: 6.2.0-27.28 -proposed tracker (LP: #2026488)
  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] update annotations scripts
  * CVE-2023-2640 // CVE-2023-32629
    - Revert "UBUNTU: SAUCE: overlayfs: handle idmapped mounts in
      ovl_do_(set|remove)xattr"
    - Revert "UBUNTU: SAUCE: overlayfs: Skip permission checking for
      trusted.overlayfs.* xattrs"
    - SAUCE: overlayfs: default to userxattr when mounted from non initial user
      namespace
  * UNII-4 5.9G Band support request on 8852BE (LP: #2023952)
    - wifi: rtw89: 8851b: add 8851B basic chip_info
    - wifi: rtw89: introduce realtek ACPI DSM method
    - wifi: rtw89: regd: judge UNII-4 according to BIOS and chip
    - wifi: rtw89: support U-NII-4 channels on 5GHz band
  * Disable hv-kvp-daemon if /dev/vmbus/hv_kvp is not present (LP: #2024900)
    - [Packaging] disable hv-kvp-daemon if needed
  * A deadlock issue in scsi rescan task while resuming from S3 (LP: #2018566)
    - ata: libata-scsi: Avoid deadlock on rescan after device resume
  * [SRU] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU (LP: #2008745)
    - [Config] Intel Sapphire Rapids HBM support needs CONFIG_NUMA_EMU
  * Lunar update: v6.2.15 upstream stable release (LP: #2025067)
    - ASOC: Intel: sof_sdw: add quirk for Intel 'Rooks County' NUC M15
    - ASoC: Intel: soc-acpi: add table for Intel 'Rooks County' NUC M15
    - ASoC: soc-pcm: fix hw->formats cleared by soc_pcm_hw_init() for dpcm
    - x86/hyperv: Block root partition functionality in a Confidential VM
    - ASoC: amd: yc: Add DMI entries to support Victus by HP Laptop 16-e1xxx
      (8A22)
    - iio:...

Changed in linux-azure (Ubuntu Lunar):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (406.3 KiB)

This bug was fixed in the package linux-azure - 5.15.0-1044.51

---------------
linux-azure (5.15.0-1044.51) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1044.51 -proposed tracker (LP: #2029291)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] update variants

linux-azure (5.15.0-1043.50) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1043.50 -proposed tracker (LP: #2026495)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] resync getabis

  * kdump fails on big arm64 systems when offset is not specified (LP: #2024479)
    - arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef
    - arm64: kdump: Reimplement crashkernel=X
    - docs: kdump: Update the crashkernel description for arm64
    - arm64: kdump: Do not allocate crash low memory if not needed
    - arm64/mm: Define defer_reserve_crashkernel()
    - arm64: kdump: Provide default size when crashkernel=Y, low is not specified
    - arm64: kdump: Support crashkernel=X fall back to reserve region above DMA
      zones

  * Azure: MANA: Fix doorbell access for receives (LP: #2027615)
    - SAUCE: net: mana: Batch ringing RX queue doorbell on receiving packets
    - SAUCE: net: mana: Use the correct WQE count for ringing RQ doorbell

  * [Azure][MANA][InfinitiBand] Features Support and InfiniBand for MANA
    (LP: #2024917)
    - bpf: Let bpf_warn_invalid_xdp_action() report more info
    - PCI: Move PCI_VENDOR_ID_MICROSOFT/PCI_DEVICE_ID_HYPERV_VIDEO definitions to
      pci_ids.h
    - net: mana: Assign interrupts to CPUs based on NUMA nodes
    - net: mana: Add support for auxiliary device
    - net: mana: Record the physical address for doorbell page region
    - net: mana: Handle vport sharing between devices
    - net: mana: Set the DMA device max segment size
    - net: mana: Export Work Queue functions for use by RDMA driver
    - net: mana: Record port number in netdev
    - net: mana: Move header files to a common location
    - net: mana: Define max values for SGL entries
    - net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
    - net: mana: Define data structures for allocating doorbell page from GDMA
    - net: mana: Define data structures for protection domain and memory
      registration
    - net: mana: Fix return type of mana_start_xmit()
    - RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter
    - RDMA/mana: Remove redefinition of basic u64 type
    - RDMA/mana_ib: Prevent array underflow in mana_ib_create_qp_raw()
    - net: mana: Fix accessing freed irq affinity_hint
    - [Config] azure: Enable MANA_INFINIBAND

  * [Azure] Fix VM crash/hang issues due to fast VF add/remove events
    (LP: #2023071) // Case [Azure] Fix VM crash/hang issues due to fast VF
    add/remove events (LP: #2023594)
    - PCI: hv: Fix a race condition bug in hv_pci_query_relations()
    - PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
    - PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
    - Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
    - ...

Changed in linux-azure (Ubuntu Jammy):
status: Fix Committed → Fix Released
Tim Gardner (timg-tpi)
tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (222.7 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1113.119

---------------
linux-azure (5.4.0-1113.119) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1113.119 -proposed tracker (LP: #2026551)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - [Packaging] resync getabis
    - debian/dkms-versions -- update from kernel-versions (main/s2023.06.12)

  * Case [Azure] Fix VM crash/hang issues due to fast VF add/remove events
    (LP: #2023594)
    - SAUCE: PCI: hv: Fix a race condition bug in hv_pci_query_relations()
    - SAUCE: PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
    - SAUCE: PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
    - SAUCE: PCI: hv: Add a per-bus mutex state_lock

  * [Azure][MANA][VLANTagging] Support for VLAN Tagging for MANA (LP: #2023695)
    - net: mana: Add support for vlan tagging

  [ Ubuntu: 5.4.0-156.173 ]

  * focal/linux: 5.4.0-156.173 -proposed tracker (LP: #2026585)
  * CVE-2023-3390
    - netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE
  * Focal update: v5.4.241 upstream stable release (LP: #2023930)
    - scsi: ses: Handle enclosure with just a primary component gracefully
    - x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot
    - cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
    - treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()
    - smb3: fix problem with null cifs super block with previous patch
    - pinctrl: amd: Use irqchip template
    - pinctrl: amd: disable and mask interrupts on probe
    - pinctrl: amd: Disable and mask interrupts on resume
    - pwm: cros-ec: Explicitly set .polarity in .get_state()
    - pwm: sprd: Explicitly set .polarity in .get_state()
    - wifi: mac80211: fix invalid drv_sta_pre_rcu_remove calls for non-uploaded
      sta
    - icmp: guard against too small mtu
    - net: don't let netpoll invoke NAPI if in xmit context
    - sctp: check send stream number after wait_for_sndbuf
    - ipv6: Fix an uninit variable access bug in __ip6_make_skb()
    - gpio: davinci: Add irq chip flag to skip set wake
    - sunrpc: only free unix grouplist after RCU settles
    - NFSD: callback request does not use correct credential for AUTH_SYS
    - xhci: also avoid the XHCI_ZERO_64B_REGS quirk with a passthrough iommu
    - USB: serial: cp210x: add Silicon Labs IFS-USB-DATACABLE IDs
    - usb: typec: altmodes/displayport: Fix configure initial pin assignment
    - USB: serial: option: add Telit FE990 compositions
    - USB: serial: option: add Quectel RM500U-CN modem
    - iio: adc: ti-ads7950: Set `can_sleep` flag for GPIO chip
    - iio: dac: cio-dac: Fix max DAC write value check for 12-bit
    - tty: serial: sh-sci: Fix transmit end interrupt handler
    - tty: serial: sh-sci: Fix Rx on RZ/G2L SCI
    - tty: serial: fsl_lpuart: avoid checking for transfer complete when
      UARTCTRL_SBK is asserted in lpuart32_tx_empty
    - nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread()
    - nilfs2: fix sysfs interface lifetime
    - ALSA: hda/realtek: Add quirk for Clevo X370SNW
...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.