Azure: multi-MSI patches break fio tests on NVMe

Bug #1982613 reported by Tim Gardner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
High
Tim Gardner
Jammy
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification

[Impact]

Recent additions to focal/azure (5.4.0-1087.92) to support multi-MSI on the Azure hypervisor break fio tests on NVMe disks by causing a hung thread. The instance type used to demonstrate the failure was Standard_L64s_v2.

The initial fix is to revert the 4 patches.

06ad0de407e7ab357f5be5c0dc2290e39ddbf936 PCI: hv: Fix interrupt mapping for multi-MSI
6f7265bdd6161702cc1530ce7a2d126c49977d3d PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
c6a7a02229c2acd4e19602f7b2fcd95ea163eada PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
9dc3b7f1d164efda04c4756002011e2ddc9f1c73 PCI: hv: Fix multi-MSI to allow more than one MSI vector

[Test Plan]

Run the attached reproducer script.

[Where things could go wrong]

Focal will not have support for multi-MSI, but since this is such a new feature there is likely little consequence.

[Other]

SF: #00339521

Revision history for this message
Tim Gardner (timg-tpi) wrote :
Download full text (9.3 KiB)

Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 382.545090] watchdog: BUG: soft lockup - CPU#1 stuck for 21s! [ksoftirqd/1:17]
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] Modules linked in: nls_iso8859_1 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 hv_balloon xt_owner xt_tcpudp serio_raw iptable_security bpfilter joydev sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx4_en mlx4_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic aesni_intel crypto_simd hid_hyperv cryptd pata_acpi glue_helper hyperv_keyboard hyperv_fb hid hv_netvsc hv_utils
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 5.4.0-1088-azure #93
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] RIP: 0010:find_next_and_bit+0x1c/0x70
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] Code: 47 c2 c3 c3 66 2e 0f 1f 84 00 00 00 00 00 48 89 d0 48 39 ca 76 63 49 89 c8 49 c1 e8 06 4a 8b 14 c7 48 85 f6 74 04 4a 23 14 c6 <49> c7 c0 ff ff ff ff 49 d3 e0 48 83 e1 c0 49 21 d0 75 2d 48 83 c1
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] RSP: 0018:ffffb409cc6fbac8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff12
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] RAX: 0000000000000040 RBX: ffffb409cc6fbb28 RCX: 0000000000000001
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] RDX: 0000000000000001 RSI: ffff89239f429050 RDI: ffff89239d4d44e0
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] RBP: ffffb409cc6fbad0 R08: 0000000000000000 R09: 0000000000005000
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] R10: 00000000ffffffff R11: 0000000000000002 R12: ffffb409cc6fbd28
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] R13: 0000000000000003 R14: ffff89239d4d44c0 R15: 0000000000000000
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] FS: 0000000000000000(0000) GS:ffff89239fa40000(0000) knlGS:0000000000000000
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] CR2: 00007efe0d000000 CR3: 000000106e436000 CR4: 00000000003506e0
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] Call Trace:
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] ? cpumask_next_and+0x1e/0x20
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] update_sd_lb_stats+0x14a/0x7d0
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] find_busiest_group+0x49/0x520
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.001461] load_balance+0x16f/0xaf0
Jul 22 19:47:31 selfprovisioned-rtg-bionic kernel: [ 390.00...

Read more...

description: updated
Changed in linux-azure (Ubuntu Focal):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → High
status: New → In Progress
Changed in linux-azure (Ubuntu):
status: New → Fix Released
Revision history for this message
Tim Gardner (timg-tpi) wrote :
description: updated
description: updated
Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (10.8 KiB)

This bug was fixed in the package linux-azure - 5.15.0-1017.20

---------------
linux-azure (5.15.0-1017.20) jammy; urgency=medium

  [ Ubuntu: 5.15.0-46.49 ]

  * CVE-2022-2585
    - SAUCE: posix-cpu-timers: Cleanup CPU timers before freeing them during exec
  * CVE-2022-2586
    - SAUCE: netfilter: nf_tables: do not allow SET_ID to refer to another table
    - SAUCE: netfilter: nf_tables: do not allow CHAIN_ID to refer to another table
    - SAUCE: netfilter: nf_tables: do not allow RULE_ID to refer to another chain
  * CVE-2022-2588
    - SAUCE: net_sched: cls_route: remove from list when handle is 0

linux-azure (5.15.0-1016.19) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1016.19 -proposed tracker (LP: #1982619)

  * Azure: multi-MSI patches break fio tests on NVMe (LP: #1982613)
    - Revert "PCI: hv: Fix interrupt mapping for multi-MSI"
    - Revert "PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()"
    - Revert "PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI"
    - Revert "PCI: hv: Remove unused hv_set_msi_entry_from_desc()"
    - Revert "PCI: hv: Avoid the retarget interrupt hypercall in irq_unmask() on
      ARM64"
    - Revert "PCI: hv: Fix multi-MSI to allow more than one MSI vector"
    - Revert "genirq/msi, treewide: Use a named struct for PCI/MSI attributes"
    - Revert "PCI/MSI: Remove msi_desc_to_pci_sysdata()"
    - Revert "PCI/MSI: Make pci_msi_domain_write_msg() static"
    - Revert "genirq/msi: Fixup includes"
    - Revert "genirq/msi: Remove unused domain callbacks"
    - Revert "genirq/msi: Guard sysfs code"

linux-azure (5.15.0-1015.18) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1015.18 -proposed tracker (LP: #1982272)

  * Azure: Add support for multi-MSI (LP: #1981577)
    - genirq/msi: Guard sysfs code
    - genirq/msi: Remove unused domain callbacks
    - genirq/msi: Fixup includes
    - PCI/MSI: Make pci_msi_domain_write_msg() static
    - PCI/MSI: Remove msi_desc_to_pci_sysdata()
    - genirq/msi, treewide: Use a named struct for PCI/MSI attributes
    - PCI: hv: Fix multi-MSI to allow more than one MSI vector
    - PCI: hv: Avoid the retarget interrupt hypercall in irq_unmask() on ARM64
    - PCI: hv: Remove unused hv_set_msi_entry_from_desc()
    - PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
    - PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
    - PCI: hv: Fix interrupt mapping for multi-MSI

  * AMD ACP 6.x DMIC Supports (LP: #1949245)
    - [Config] azure: Disable AMD ACP 6 DMIC Support

  * Ubuntu 22.04 and 20.04 DPC Fixes for Failure Cases of DownPort Containment
    events (LP: #1965241)
    - [Config] azure: Enable config option CONFIG_PCIE_EDR

  * CVE-2022-29900 // CVE-2022-29901
    - [Config]: azure: Enable speculation mitigations

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2022.07.11)

  [ Ubuntu: 5.15.0-45.48 ]

  * CVE-2022-29900 // CVE-2022-29901
    - x86/lib/atomic64_386_32: Rename things
    - x86: Prepare asm files for straight-line-speculation
    - x86: Prepare inline-asm for straight-line-speculation
    - x86/alternative: Relax text_poke_bp() constraint
    - kbuild: move objtool_ar...

Changed in linux-azure (Ubuntu Jammy):
status: New → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (9.8 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1089.94

---------------
linux-azure (5.4.0-1089.94) focal; urgency=medium

  [ Ubuntu: 5.4.0-124.140 ]

  * CVE-2022-2586
    - SAUCE: netfilter: nf_tables: do not allow SET_ID to refer to another table
    - SAUCE: netfilter: nf_tables: do not allow RULE_ID to refer to another chain
  * CVE-2022-2588
    - SAUCE: net_sched: cls_route: remove from list when handle is 0
  * CVE-2022-34918
    - netfilter: nf_tables: stricter validation of element data

linux-azure (5.4.0-1088.93) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1088.93 -proposed tracker (LP: #1982614)

  * Azure: multi-MSI patches break fio tests on NVMe (LP: #1982613)
    - Revert "PCI: hv: Fix interrupt mapping for multi-MSI"
    - Revert "PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()"
    - Revert "PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI"
    - Revert "PCI: hv: Fix multi-MSI to allow more than one MSI vector"

linux-azure (5.4.0-1087.92) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1087.92 -proposed tracker (LP: #1981257)

  * Azure: Add support for multi-MSI (LP: #1981577)
    - PCI: hv: Fix multi-MSI to allow more than one MSI vector
    - PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
    - PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
    - PCI: hv: Fix interrupt mapping for multi-MSI

  [ Ubuntu: 5.4.0-123.139 ]

  * focal/linux: 5.4.0-123.139 -proposed tracker (LP: #1981284)
  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2022.07.11)
  * Hairpin traffic does not work with centralized NAT gw (LP: #1967856)
    - net: openvswitch: fix misuse of the cached connection on tuple changes
  * [UBUNTU 20.04] Include patches to avoid self-detected stall with Secure
    Execution (LP: #1979296)
    - KVM: s390: pv: add macros for UVC CC values
    - KVM: s390: pv: avoid stalls when making pages secure
    - KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
  * Focal update: v5.4.195 upstream stable release (LP: #1980407)
    - batman-adv: Don't skb_split skbuffs with frag_list
    - hwmon: (tmp401) Add OF device ID table
    - mac80211: Reset MBSSID parameters upon connection
    - net: Fix features skip in for_each_netdev_feature()
    - ipv4: drop dst in multicast routing path
    - drm/nouveau: Fix a potential theorical leak in nouveau_get_backlight_name()
    - netlink: do not reset transport header in netlink_recvmsg()
    - mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
    - dim: initialize all struct fields
    - hwmon: (ltq-cputemp) restrict it to SOC_XWAY
    - s390/ctcm: fix variable dereferenced before check
    - s390/ctcm: fix potential memory leak
    - s390/lcs: fix variable dereferenced before check
    - net/sched: act_pedit: really ensure the skb is writable
    - net/smc: non blocking recvmsg() return -EAGAIN when no data and
      signal_pending
    - net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
    - gfs2: Fix filesystem block deallocation for short writes
    - hwmon: (f71882fg) Fix negative temperature
    - ASoC: max98090: Reject invalid values in custom control put()
...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.