Azure: Mellanox VF NIC crashes when removed

Bug #1973758 reported by Tim Gardner
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Fix Released
Medium
Tim Gardner

Bug Description

SRU Justification

[Impact]

The 5.4.0-1075-azure and newer kernels are broken in that the VM can easily panic when the Mellanox VF NIC is removed and added due to Azure host servicing events or the below manual "unbind/bind" test (here the GUID can be different in different VMs):

for i in `seq 1 1000`;
do
    cd /sys/bus/vmbus/drivers/hv_pci;
    echo abdc2107-402e-4704-8c88-c2b850696c3c > unbind;
    echo abdc2107-402e-4704-8c88-c2b850696c3c > bind;
done

A sample panic call-trace is:
[ 107.359954] kernel BUG at /build/linux-azure-5.4-4I3kFs/linux-azure-5.4-5.4.0/mm/slub.c:4020!
[ 107.363858] invalid opcode: 0000 [#1] SMP NOPTI
[ 107.365870] CPU: 0 PID: 334 Comm: kworker/0:2 Not tainted 5.4.0-1077-azure #80~18.04.1-Ubuntu
[ 107.369589] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 107.373811] Workqueue: events vmbus_onmessage_work
[ 107.375909] RIP: 0010:kfree+0x1d2/0x240

[ 107.413789] Call Trace:
[ 107.414867] kobject_uevent_env+0x1b5/0x7e0
[ 107.416747] kobject_uevent+0xb/0x10
[ 107.418327] device_release_driver_internal+0x191/0x1c0
[ 107.420653] device_release_driver+0x12/0x20
[ 107.422523] bus_remove_device+0xe1/0x150
[ 107.424279] device_del+0x167/0x380
[ 107.425824] device_unregister+0x1a/0x60
[ 107.427536] vmbus_device_unregister+0x27/0x50
[ 107.429528] vmbus_onoffer_rescind+0x1d0/0x1f0
[ 107.431474] vmbus_onmessage+0x2c/0x70
[ 107.433104] vmbus_onmessage_work+0x22/0x30
[ 107.434919] process_one_work+0x209/0x400
[ 107.436661] worker_thread+0x34/0x40

It turns out there is a bug in https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/bionic/commit/?id=16a3c750a78d8, which misses the second hunk of the upstream patch https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=877b911a5ba0.

Please apply the below patch to fix the issue:

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3653,7 +3653,7 @@ static int hv_pci_remove(struct hv_device *hdev)

        hv_put_dom_num(hbus->bridge->domain_nr);

- free_page((unsigned long)hbus);
+ kfree(hbus);
        return ret;
 }

BTW, please apply this patch as well (Note: this patch is not really required as it's only for error handling path, which is usually unlikely):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42c3d41832ef4fcf60aaa6f748de01ad99572adf

[Test Case]

Microsoft tested

CVE References

Tim Gardner (timg-tpi)
affects: linux (Ubuntu) → linux-azure (Ubuntu)
Changed in linux-azure (Ubuntu):
status: New → Invalid
Changed in linux-azure (Ubuntu Focal):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Disregard the previous comment. Wrong bug.

Patch submitted: https://lists.ubuntu.com/archives/kernel-team/2022-May/130376.html

Revision history for this message
Matthew G McGovern (mcgovern) wrote :

I don't this this fix is correct, I am able to reproduce the crash in a hyper-v VM and this generates a different KASAN bad free with the repro case.

If you want to test (and have an environment with an SRIOV interface), you can toggle SRIOV rapidly with this powershell snippet:
$vmname= 'name of vm'
while (1){
 Set-VMNetworkAdapter -VMName $vmname -IovWeight 100
 Set-VMNetworkAdapter -VMName $vmname -IovWeight 0
}

Revision history for this message
Dexuan Cui (decui) wrote :

I checked with Matthew and found Matthew only applied the first patch [1]; after I applied the second patch [2], I'm no longer seeing any crash or memory corruption issue in Matthew's VM.

BTW, the Windows Server 2019 host running Matthew's VM doesn't work with NIC SR-IOV correctly: when SR-IOV is enabled, the host offers an Intel VF NIC to the VM, then immediately removes/rescinds the VF (this causes hv_pci_probe() to fail and the bug on its error handling path is triggered), and never re-offers the VF, i.e. NIC SR-IOV doesn't work on this host, but that's a host bug and the host team needs to investigate that.

[0] https://lists.ubuntu.com/archives/kernel-team/2022-May/130378.html
[1] https://lists.ubuntu.com/archives/kernel-team/2022-May/130379.html
[2] https://lists.ubuntu.com/archives/kernel-team/2022-May/130380.html

Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.4.0-1081.84 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Microsoft tested. Marking verification done.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 5.4.0-1083.87

---------------
linux-azure (5.4.0-1083.87) focal; urgency=medium

  [ Ubuntu: 5.4.0-117.132 ]

  * CVE-2022-1966
    - netfilter: nf_tables: add nft_set_elem_expr_alloc()
    - netfilter: nf_tables: disallow non-stateful expression in sets earlier

 -- Thadeu Lima de Souza Cascardo <email address hidden> Wed, 01 Jun 2022 21:52:55 -0300

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.