netvsc may cause CPU lockup

Bug #1924314 reported by Tim Gardner
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Bionic
Invalid
Medium
Unassigned
linux-azure-4.15 (Ubuntu)
New
Undecided
Unassigned
Bionic
Fix Released
Medium
Tim Gardner

Bug Description

[SRU Justification]

SF#00308239

[Impact]
A soft lockup is happening to Azure cloud customers due to a bug in netvsc_poll.

Ubuntu 4.15.0-1103 has CONFIG_NET_POLL_CONTROLLER=y in kernel config file, there is a bug in netvsc that may cause CPU lockup with POLL_CONTROLLER. This bug has been fixed by the following patch:

commit 2a7f8c3b1d3feedee3aa319ac220cbde3725b5d5 ("hv_netvsc: remove ndo_poll_controller")

This commit is in mainline as of 4.20-rc1.

Microsoft would like to request this patch in 4.15 based Azure kernels.

[Test Plan]
Run high stress network traffic on an Azure instance.

[Where problems could occur]
There could be a decrease in performance or dropped packets due to NAPI not being able to keep up.

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Bionic):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Tim Gardner (timg-tpi)
tags: added: bot-stop-nagging
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1924314

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Test kernels at https://launchpad.net/~timg-tpi/+archive/ubuntu/netvsc-lp1924314:

sudo add-apt-repository ppa:timg-tpi/netvsc-lp1924314
sudo apt-get update

Revision history for this message
Tim Gardner (timg-tpi) wrote :
Tim Gardner (timg-tpi)
Changed in linux-azure-4.15 (Ubuntu Bionic):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Tim Gardner (timg-tpi)
description: updated
Tim Gardner (timg-tpi)
Changed in linux-azure-4.15 (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Bionic):
status: In Progress → Invalid
assignee: Tim Gardner (timg-tpi) → nobody
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Microsoft says, "We have not seen this issue in our regular testing. The issue only showed up in a customer’s environment. For this patch, regression tests should be good enough. Regression tests are performed before a new SRU kernel is sign-off on."

Given that no regressions have been reported I am marking this verification done.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (22.0 KiB)

This bug was fixed in the package linux-azure-4.15 - 4.15.0-1115.128

---------------
linux-azure-4.15 (4.15.0-1115.128) bionic; urgency=medium

  * bionic/linux-azure-4.15: 4.15.0-1115.128 -proposed tracker (LP: #1927631)

  * Fix kdump failures (LP: #1927518)
    - video: hyperv_fb: Add ratelimit on error message
    - Drivers: hv: vmbus: Increase wait time for VMbus unload
    - Drivers: hv: vmbus: Initialize unload_event statically

  * netvsc may cause CPU lockup (LP: #1924314)
    - hv_netvsc: remove ndo_poll_controller

  [ Ubuntu: 4.15.0-144.148 ]

  * bionic/linux: 4.15.0-144.148 -proposed tracker (LP: #1927648)
  * Introduce the 465 driver series, fabric-manager, and libnvidia-nscq
    (LP: #1925522)
    - debian/dkms-versions -- add NVIDIA 465 and migrate 450 to 460
  * xfrm_policy.sh / pmtu.sh / udpgso_bench.sh from net in
    ubuntu_kernel_selftests will fail if running the whole suite (LP: #1856010)
    - selftests/net: bump timeout to 5 minutes
  * locking/qrwlock: Fix ordering in queued_write_lock_slowpath() (LP: #1926184)
    - locking/barriers: Introduce smp_cond_load_relaxed() and
      atomic_cond_read_relaxed()
    - locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
  * Bionic update: upstream stable patchset 2021-04-30 (LP: #1926808)
    - net: fec: ptp: avoid register access when ipg clock is disabled
    - powerpc/4xx: Fix build errors from mfdcr()
    - atm: eni: dont release is never initialized
    - atm: lanai: dont run lanai_dev_close if not open
    - Revert "r8152: adjust the settings about MAC clock speed down for RTL8153"
    - ixgbe: Fix memleak in ixgbe_configure_clsu32
    - net: tehuti: fix error return code in bdx_probe()
    - sun/niu: fix wrong RXMAC_BC_FRM_CNT_COUNT count
    - gpiolib: acpi: Add missing IRQF_ONESHOT
    - nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
    - NFS: Correct size calculation for create reply length
    - net: hisilicon: hns: fix error return code of hns_nic_clear_all_rx_fetch()
    - net: wan: fix error return code of uhdlc_init()
    - atm: uPD98402: fix incorrect allocation
    - atm: idt77252: fix null-ptr-dereference
    - sparc64: Fix opcode filtering in handling of no fault loads
    - u64_stats,lockdep: Fix u64_stats_init() vs lockdep
    - drm/radeon: fix AGP dependency
    - nfs: we don't support removing system.nfs4_acl
    - ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
    - ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
    - squashfs: fix inode lookup sanity checks
    - squashfs: fix xattr id and id lookup sanity checks
    - arm64: dts: ls1046a: mark crypto engine dma coherent
    - arm64: dts: ls1012a: mark crypto engine dma coherent
    - arm64: dts: ls1043a: mark crypto engine dma coherent
    - ARM: dts: at91-sama5d27_som1: fix phy address to 7
    - dm ioctl: fix out of bounds array access when no devices
    - bus: omap_l3_noc: mark l3 irqs as IRQF_NO_THREAD
    - libbpf: Fix INSTALL flag order
    - macvlan: macvlan_count_rx() needs to be aware of preemption
    - net: dsa: bcm_sf2: Qualify phydev->dev_flags based on port
    - e1000e: add rtnl_lock() to e1000_reset_task
    - e1000e: Fix error handling in e1000_set_...

Changed in linux-azure-4.15 (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.