Drivers: hv: vmbus: Fix duplicate CPU assignments within a device

Bug #1937078 reported by Tim Gardner
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Undecided
Unassigned
Focal
Undecided
Tim Gardner
Hirsute
Undecided
Tim Gardner

Bug Description

SRU Justification

[Impact]

Customers have degraded network performance on Hyper-V/Azure

This is a request to pick up a patch from the upstream, the patch fixes an issue with Ubuntu as a hyper-v and Azure guest. This patch need to get picked up for 20.04, 18.04. The link to the upstream patch follows:

https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=hyperv-fixes&id=7c9ff3deeee61b253715dcf968a6307af148c9b2

Description of issue and solution:

The vmbus module uses a rotational algorithm to assign target CPUs to
a device's channels. Depending on the timing of different device's channel
offers, different channels of a device may be assigned to the same CPU.

For example on a VM with 2 CPUs, if NIC A and B's channels are offered
in the following order, NIC A will have both channels on CPU0, and
NIC B will have both channels on CPU1 -- see below. This kind of
assignment causes RSS load that is spreading across different channels
to end up on the same CPU.

Timing of channel offers:
NIC A channel 0
NIC B channel 0
NIC A channel 1
NIC B channel 1

VMBUS ID 14: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter
Device_ID = {cab064cd-1f31-47d5-a8b4-9d57e320cccd}
Sysfs path: /sys/bus/vmbus/devices/cab064cd-1f31-47d5-a8b4-9d57e320cccd
Rel_ID=14, target_cpu=0
Rel_ID=17, target_cpu=0

VMBUS ID 16: Class_ID = {f8615163-df3e-46c5-913f-f2d2f965ed0e} - Synthetic network adapter
Device_ID = {244225ca-743e-4020-a17d-d7baa13d6cea}
Sysfs path: /sys/bus/vmbus/devices/244225ca-743e-4020-a17d-d7baa13d6cea
Rel_ID=16, target_cpu=1
Rel_ID=18, target_cpu=1

Update the vmbus CPU assignment algorithm to avoid duplicate CPU
assignments within a device.

The new algorithm iterates num_online_cpus + 1 times.
The existing rotational algorithm to find "next NUMA & CPU" is still here.
But if the resulting CPU is already used by the same device, it will try
the next CPU.
In the last iteration, it assigns the channel to the next available CPU
like the existing algorithm. This is not normally expected, because
during device probe, we limit the number of channels of a device to
be <= number of online CPUs.

[Test Plan]

This could be tough to test as the patch fixes a race condition.

[Where problems could occur]

Network performance issues could persist.

[Other Info]

SF:#00315347

CVE References

Revision history for this message
Tim Gardner (timg-tpi) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1937078

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Focal):
status: New → In Progress
Changed in linux (Ubuntu Hirsute):
status: New → In Progress
Changed in linux (Ubuntu Focal):
assignee: nobody → Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Hirsute):
assignee: nobody → Tim Gardner (timg-tpi)
Stefan Bader (smb)
affects: linux (Ubuntu) → linux-azure (Ubuntu)
Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Hirsute):
status: In Progress → Fix Committed
Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-hirsute' to 'verification-done-hirsute'. If the problem still exists, change the tag 'verification-needed-hirsute' to 'verification-failed-hirsute'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-hirsute
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Validated by Microsoft on test kernels. Tagging as verification done.

tags: added: verification-done-hirsute
removed: verification-needed-hirsute
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Hi, Tim. I'm dropping the 5.4 backport from focal/linux-azure in the current cycle because this commit is causing the boot to hang:

http://10.246.72.46:8080/view/all/job/focal-linux-azure-azure-amd64-5.4.0-Standard_A2_v2-ubuntu_boot/4/console

I managed to reproduce the problem manually and reverting the commit solved the problem.

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (69.4 KiB)

This bug was fixed in the package linux-azure - 5.11.0-1017.18

---------------
linux-azure (5.11.0-1017.18) hirsute; urgency=medium

  * hirsute/linux-azure: 5.11.0-1017.18 -proposed tracker (LP: #1944161)

  * Hirsute update: upstream stable patchset 2021-08-12 (LP: #1939738)
    - [Config] updateconfigs for RESET_INTEL_GW, RESET_BRCMSTB_RESCAL

  * binderfs support is not enabled (LP: #1925246)
    - [Config] linux-azure: CONFIG_ANDROID_BINDERFS=m

  * Drivers: hv: vmbus: Fix duplicate CPU assignments within a device
    (LP: #1937078)
    - Drivers: hv: vmbus: Fix duplicate CPU assignments within a device

  [ Ubuntu: 5.11.0-37.41 ]

  * hirsute/linux: 5.11.0-37.41 -proposed tracker (LP: #1944180)
  * CVE-2021-41073
    - io_uring: ensure symmetry in handling iter types in loop_rw_iter()
  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2021.09.06)
  * LRMv5: switch primary version handling to kernel-versions data set
    (LP: #1928921)
    - [Packaging] switch to kernel-versions
  * disable “CONFIG_HISI_DMA” config for ubuntu version (LP: #1936771)
    - Disable CONFIG_HISI_DMA
    - [Config] Record hisi_dma no longer built for arm64
  * ubunut_kernel_selftests: memory-hotplug: avoid spamming logs with
    dump_page() (LP: #1941829)
    - selftests: memory-hotplug: avoid spamming logs with dump_page(), ratio limit
      hot-remove error test
  * alsa: the soundwire audio doesn't work on the Dell TGL-H machines
    (LP: #1941669)
    - ASoC: SOF: allow soundwire use desc->default_fw_filename
    - ASoC: Intel: tgl: remove sof_fw_filename set for tgl_3_in_1_default
  * e1000e blocks the boot process when it tried to write checksum to its NVM
    (LP: #1936998)
    - e1000e: Do not take care about recovery NVM checksum
  * Dell XPS 17 (9710) PCI/internal sound card not detected (LP: #1935850)
    - ASoC: Intel: sof_sdw: include rt711.h for RT711 JD mode
    - ASoC: Intel: sof_sdw: add quirk for Dell XPS 9710
  * mute/micmute LEDs no function on HP ProBook 650 G8 (LP: #1939473)
    - ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 650 G8 Notebook PC
  * Fix mic noise on HP ProBook 445 G8 (LP: #1940610)
    - ALSA: hda/realtek: Limit mic boost on HP ProBook 445 G8
  * GPIO error logs in start and dmesg after update of kernel (LP: #1937897)
    - ODM: mfd: Check AAEON BFPI version before adding device
  * External displays not working on Thinkpad T490 with ThinkPad Thunderbolt 3
    Dock (LP: #1938999)
    - drm/i915/ilk-glk: Fix link training on links with LTTPRs
  * Fix kernel panic caused by legacy devices on AMD platforms (LP: #1936682)
    - SAUCE: iommu/amd: Keep swiotlb enabled to ensure devices with 32bit DMA
      still work
  * Hirsute update: upstream stable patchset 2021-08-30 (LP: #1942123)
    - drm/i915: Revert "drm/i915/gem: Asynchronous cmdparser"
    - Revert "drm/i915: Propagate errors on awaiting already signaled fences"
    - regulator: rtmv20: Fix wrong mask for strobe-polarity-high
    - regulator: rt5033: Fix n_voltages settings for BUCK and LDO
    - spi: stm32h7: fix full duplex irq handler handling
    - ASoC: tlv320aic31xx: fix reversed bclk/wclk master b...

Changed in linux-azure (Ubuntu Hirsute):
status: Fix Committed → Fix Released
Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.4.0-1064.67 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Microsoft verified the patch in (https://lists.ubuntu.com/archives/kernel-team/2021-November/125347.html) fixes the bug.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (12.6 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1064.67

---------------
linux-azure (5.4.0-1064.67) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1064.67 -proposed tracker (LP: #1949816)

  * Packaging resync (LP: #1786013)
    - [Packaging] update Ubuntu.md

  * Drivers: hv: vmbus: Fix duplicate CPU assignments within a device
    (LP: #1937078)
    - Drivers: hv: vmbus: Fix duplicate CPU assignments within a device

  * linux-azure: make mana.ko built-in (LP: #1949357)
    - [Config] CONFIG_MICROSOFT_MANA=y

  [ Ubuntu: 5.4.0-91.102 ]

  * focal/linux: 5.4.0-91.102 -proposed tracker (LP: #1949840)
  * Packaging resync (LP: #1786013)
    - [Packaging] update Ubuntu.md
    - debian/dkms-versions -- update from kernel-versions (main/2021.11.08)
  * KVM emulation failure when booting into VM crash kernel with multiple CPUs
    (LP: #1948862)
    - KVM: x86: Properly reset MMU context at vCPU RESET/INIT
  * aufs: kernel bug with apparmor and fuseblk (LP: #1948470)
    - SAUCE: aufs: bugfix, stop omitting path->mnt
  * ebpf: bpf_redirect fails with ip6 gre interfaces (LP: #1947164)
    - net: handle ARPHRD_IP6GRE in dev_is_mac_header_xmit()
  * require CAP_NET_ADMIN to attach N_HCI ldisc (LP: #1949516)
    - Bluetooth: hci_ldisc: require CAP_NET_ADMIN to attach N_HCI ldisc
  * ACL updates on OCFS2 are not revalidated (LP: #1947161)
    - ocfs2: fix remounting needed after setfacl command
  * ppc64 BPF JIT mod by 1 will not return 0 (LP: #1948351)
    - powerpc/bpf: Fix BPF_MOD when imm == 1
  * Drop "UBUNTU: SAUCE: cachefiles: Page leaking in
    cachefiles_read_backing_file while vmscan is active" (LP: #1947709)
    - Revert "UBUNTU: SAUCE: cachefiles: Page leaking in
      cachefiles_read_backing_file while vmscan is active"
  * Reassign I/O Path of ConnectX-5 Port 1 before Port 2 causes NULL dereference
    (LP: #1943464)
    - s390/pci: fix leak of PCI device structure
    - s390/pci: fix use after free of zpci_dev
    - s390/pci: fix zpci_zdev_put() on reserve
  * [SRU][F] USB: serial: pl2303: add support for PL2303HXN (LP: #1948377)
    - USB: serial: pl2303: add support for PL2303HXN
    - USB: serial: pl2303: fix line-speed handling on newer chips
  * Focal update: v5.4.151 upstream stable release (LP: #1947888)
    - tty: Fix out-of-bound vmalloc access in imageblit
    - cpufreq: schedutil: Use kobject release() method to free sugov_tunables
    - cpufreq: schedutil: Destroy mutex before kobject_put() frees the memory
    - usb: cdns3: fix race condition before setting doorbell
    - fs-verity: fix signed integer overflow with i_size near S64_MAX
    - hwmon: (w83793) Fix NULL pointer dereference by removing unnecessary
      structure field
    - hwmon: (w83792d) Fix NULL pointer dereference by removing unnecessary
      structure field
    - hwmon: (w83791d) Fix NULL pointer dereference by removing unnecessary
      structure field
    - scsi: ufs: Fix illegal offset in UPIU event trace
    - mac80211: fix use-after-free in CCMP/GCMP RX
    - x86/kvmclock: Move this_cpu_pvti into kvmclock.h
    - drm/amd/display: Pass PCI deviceid into DC
    - ipvs: check that ip_vs_conn_tab_bits is between 8 and 20
    - hwmon: (mlxre...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers