Ubuntu
linux package

Removing legacy virtio-pci devices causes kernel panic

Noble (24.04)
Bug #2067862

Bug #2067862 reported by Dong Liang on 2024-06-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	Undecided	Unassigned
	Noble	Fix Committed	Medium	Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/2067862

[Impact]

If you detach a legacy virtio-pci device from a current Noble system, it will cause a null pointer dereference, and panic the system. This is an issue if you force noble to use legacy virtio-pci devices, or run noble on very old hypervisors that only support legacy virtio-pci devices, e.g. trusty and older.

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
CPU: 2 PID: 358 Comm: kworker/u8:3 Kdump: loaded Not tainted 6.8.0-31-generic #31-Ubuntu
Workqueue: kacpi_hotplug acpi_hotplug_work_fn
RIP: 0010:0x0
...
Call Trace:
<TASK>
? show_regs+0x6d/0x80
? __die+0x24/0x80
? page_fault_oops+0x99/0x1b0
? do_user_addr_fault+0x2ee/0x6b0
? exc_page_fault+0x83/0x1b0
? asm_exc_page_fault+0x27/0x30
vp_del_vqs+0x6e/0x2a0
remove_vq_common+0x166/0x1a0
virtnet_remove+0x61/0x80
virtio_dev_remove+0x3f/0xc0
device_remove+0x40/0x80
device_release_driver_internal+0x20b/0x270
device_release_driver+0x12/0x20
bus_remove_device+0xcb/0x140
device_del+0x161/0x3e0
? pci_bus_generic_read_dev_vendor_id+0x2c/0x1a0
device_unregister+0x17/0x60
unregister_virtio_device+0x16/0x40
virtio_pci_remove+0x43/0xa0
pci_device_remove+0x36/0xb0
device_remove+0x40/0x80
device_release_driver_internal+0x20b/0x270
device_release_driver+0x12/0x20
pci_stop_bus_device+0x7a/0xb0
pci_stop_and_remove_bus_device+0x12/0x30
disable_slot+0x4f/0xa0
acpiphp_disable_and_eject_slot+0x1c/0xa0
hotplug_event+0x11b/0x280
? __pfx_acpiphp_hotplug_notify+0x10/0x10
acpiphp_hotplug_notify+0x27/0x70
acpi_device_hotplug+0xb6/0x300
acpi_hotplug_work_fn+0x1e/0x40
process_one_work+0x16c/0x350
worker_thread+0x306/0x440
? _raw_spin_lock_irqsave+0xe/0x20
? __pfx_worker_thread+0x10/0x10
kthread+0xef/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x44/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>

The issue was introduced in:

commit fd27ef6b44bec26915c5b2b22c13856d9f0ba17a
Author: Feng Liu <email address hidden>
Date: Tue Dec 19 11:32:40 2023 +0200
Subject: virtio-pci: Introduce admin virtqueue
Link: https://github.com/torvalds/linux/commit/fd27ef6b44bec26915c5b2b22c13856d9f0ba17a

Modern virtio-pci devices are not affected. If the device is a legacy virtio device, the is_avq function pointer is not assigned in the virtio_pci_device structure of the legacy virtio device, resulting in a NULL pointer dereference when the code calls if (vp_dev->is_avq(vdev, vq->index)).

There is no workaround. If you are affected, then not detaching devices for the time being is the only solution.

[Fix]

This was fixed in 6.9-rc1 by:

commit c8fae27d141a32a1624d0d0d5419d94252824498
From: Li Zhang <email address hidden>
Date: Sat, 16 Mar 2024 13:25:54 +0800
Subject: virtio-pci: Check if is_avq is NULL
Link: https://github.com/torvalds/linux/commit/c8fae27d141a32a1624d0d0d5419d94252824498

This is a clean cherry pick to noble. The commit just adds a basic NULL pointer check before it dereferences the pointer.

[Testcase]

Start a fresh Noble VM.

Edit the grub kernel command line:

1) sudo vim /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="virtio_pci.force_legacy=1"
2) sudo update-grub
3) sudo reboot

Outside the VM, on the host:

$ qemu-img create -f qcow2 /root/share-device.qcow2 2G
$ cat >> share-device.xml << EOF
disk type='file' device='disk'>
    <driver name='qemu' type='qcow2' cache='writeback' io='threads'/>
    <source file='/root/share-device.qcow2'/>
    <target dev='vdc' bus='virtio'/>
</disk>
EOF
$ sudo -s
# virsh attach-device noble-test share-device.xml --config --live
# virsh detach-device noble-test share-device.xml --config --live

A kernel panic should occur.

There is a test kernel available in:

https://launchpad.net/~mruffell/+archive/ubuntu/lp2067862-test

If you install it, the panic should no longer occur.

[Where problems could occur]

We are adding a basic null pointer check right before the pointer is about to be used, which is quite low risk.

If a regression were to occur, it would only affect VMs using legacy virtio-pci devices, which is not the default. It would potentially have large impacts on fleets of very old hypervisors running trusty, precise or lucid, but that is very unlikely in this day and age.

[Other Info]

Upstream mailing list discussion and author testcase:
https://lore.kernel<email address hidden>/T/#m167335bf7ab09b12fec3bdc5d46a30bc2e26cac7

See original description

Tags:

Revision history for this message

Dong Liang (qidong-ld) wrote on 2024-06-03:

Download full text (3.5 KiB)

the backtrace as follows：
[ 72.571019] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 72.571084] #PF: supervisor instruction fetch in kernel mode
[ 72.571128] #PF: error_code(0x0010) - not-present page
[ 72.571167] PGD 0 P4D 0
[ 72.571190] Oops: 0010 [#1] PREEMPT SMP NOPTI
[ 72.571225] CPU: 2 PID: 358 Comm: kworker/u8:3 Kdump: loaded Not tainted 6.8.0-31-generic #31-Ubuntu
[ 72.571344] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[ 72.571386] RIP: 0010:0x0
[ 72.571417] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 72.571468] RSP: 0018:ffffb0c880307a80 EFLAGS: 00010216
[ 72.571508] RAX: 0000000000000000 RBX: ffff8af8c1b08800 RCX: 0000000000000000
[ 72.571561] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8af8c1b08800
[ 72.571616] RBP: ffffb0c880307ab8 R08: 0000000000000000 R09: 0000000000000000
[ 72.571667] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8af8c550c700
[ 72.571717] R13: ffff8af8c1b08b28 R14: ffff8af8c550c200 R15: 0000000000000080
[ 72.571768] FS: 0000000000000000(0000) GS:ffff8af9e8100000(0000) knlGS:0000000000000000
[ 72.571825] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 72.571867] CR2: ffffffffffffffd6 CR3: 000000014f23c006 CR4: 00000000007706f0
[ 72.571921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 72.571972] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 72.572023] PKRU: 55555554
[ 72.572046] Call Trace:
[ 72.572068] <TASK>
[ 72.572087] ? show_regs+0x6d/0x80
[ 72.572117] ? __die+0x24/0x80
[ 72.572144] ? page_fault_oops+0x99/0x1b0
[ 72.572177] ? do_user_addr_fault+0x2ee/0x6b0
[ 72.572211] ? exc_page_fault+0x83/0x1b0
[ 72.572244] ? asm_exc_page_fault+0x27/0x30
[ 72.572279] vp_del_vqs+0x6e/0x2a0
[ 72.572308] remove_vq_common+0x166/0x1a0
[ 72.572341] virtnet_remove+0x61/0x80
[ 72.572370] virtio_dev_remove+0x3f/0xc0
[ 72.572402] device_remove+0x40/0x80
[ 72.572433] device_release_driver_internal+0x20b/0x270
[ 72.572477] device_release_driver+0x12/0x20
[ 72.572510] bus_remove_device+0xcb/0x140
[ 72.572542] device_del+0x161/0x3e0
[ 72.572571] ? pci_bus_generic_read_dev_vendor_id+0x2c/0x1a0
[ 72.572617] device_unregister+0x17/0x60
[ 72.572648] unregister_virtio_device+0x16/0x40
[ 72.572684] virtio_pci_remove+0x43/0xa0
[ 72.572714] pci_device_remove+0x36/0xb0
[ 72.572746] device_remove+0x40/0x80
[ 72.572919] device_release_driver_internal+0x20b/0x270
[ 72.573083] device_release_driver+0x12/0x20
[ 72.573241] pci_stop_bus_device+0x7a/0xb0
[ 72.573394] pci_stop_and_remove_bus_device+0x12/0x30
[ 72.573552] disable_slot+0x4f/0xa0
[ 72.573705] acpiphp_disable_and_eject_slot+0x1c/0xa0
[ 72.573860] hotplug_event+0x11b/0x280
[ 72.574006] ? __pfx_acpiphp_hotplug_notify+0x10/0x10
[ 72.574159] acpiphp_hotplug_notify+0x27/0x70
[ 72.574304] acpi_device_hotplug+0xb6/0x300
[ 72.574452] acpi_hotplug_work_fn+0x1e/0x40
[ 72.574598] process_one_work+0x16c/0x350
[ 72.574742] worker_thread+0x306/0x440
[ 72.574878] ? _raw_spin_lock_irqsave+0xe/0x20
[ 72.575017] ? __pfx_worker_thread+0x10/0x10
[ 72.5...

the backtrace as follows：
[   72.571019] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   72.571084] #PF: supervisor instruction fetch in kernel mode
[   72.571128] #PF: error_code(0x0010) - not-present page
[   72.571167] PGD 0 P4D 0 
[   72.571190] Oops: 0010 [#1] PREEMPT SMP NOPTI
[   72.571225] CPU: 2 PID: 358 Comm: kworker/u8:3 Kdump: loaded Not tainted 6.8.0-31-generic #31-Ubuntu
[   72.571344] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   72.571386] RIP: 0010:0x0
[   72.571417] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   72.571468] RSP: 0018:ffffb0c880307a80 EFLAGS: 00010216
[   72.571508] RAX: 0000000000000000 RBX: ffff8af8c1b08800 RCX: 0000000000000000
[   72.571561] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8af8c1b08800
[   72.571616] RBP: ffffb0c880307ab8 R08: 0000000000000000 R09: 0000000000000000
[   72.571667] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8af8c550c700
[   72.571717] R13: ffff8af8c1b08b28 R14: ffff8af8c550c200 R15: 0000000000000080
[   72.571768] FS:  0000000000000000(0000) GS:ffff8af9e8100000(0000) knlGS:0000000000000000
[   72.571825] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   72.571867] CR2: ffffffffffffffd6 CR3: 000000014f23c006 CR4: 00000000007706f0
[   72.571921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   72.571972] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   72.572023] PKRU: 55555554
[   72.572046] Call Trace:
[   72.572068]  <TASK>
[   72.572087]  ? show_regs+0x6d/0x80
[   72.572117]  ? __die+0x24/0x80
[   72.572144]  ? page_fault_oops+0x99/0x1b0
[   72.572177]  ? do_user_addr_fault+0x2ee/0x6b0
[   72.572211]  ? exc_page_fault+0x83/0x1b0
[   72.572244]  ? asm_exc_page_fault+0x27/0x30
[   72.572279]  vp_del_vqs+0x6e/0x2a0
[   72.572308]  remove_vq_common+0x166/0x1a0
[   72.572341]  virtnet_remove+0x61/0x80
[   72.572370]  virtio_dev_remove+0x3f/0xc0
[   72.572402]  device_remove+0x40/0x80
[   72.572433]  device_release_driver_internal+0x20b/0x270
[   72.572477]  device_release_driver+0x12/0x20
[   72.572510]  bus_remove_device+0xcb/0x140
[   72.572542]  device_del+0x161/0x3e0
[   72.572571]  ? pci_bus_generic_read_dev_vendor_id+0x2c/0x1a0
[   72.572617]  device_unregister+0x17/0x60
[   72.572648]  unregister_virtio_device+0x16/0x40
[   72.572684]  virtio_pci_remove+0x43/0xa0
[   72.572714]  pci_device_remove+0x36/0xb0
[   72.572746]  device_remove+0x40/0x80
[   72.572919]  device_release_driver_internal+0x20b/0x270
[   72.573083]  device_release_driver+0x12/0x20
[   72.573241]  pci_stop_bus_device+0x7a/0xb0
[   72.573394]  pci_stop_and_remove_bus_device+0x12/0x30
[   72.573552]  disable_slot+0x4f/0xa0
[   72.573705]  acpiphp_disable_and_eject_slot+0x1c/0xa0
[   72.573860]  hotplug_event+0x11b/0x280
[   72.574006]  ? __pfx_acpiphp_hotplug_notify+0x10/0x10
[   72.574159]  acpiphp_hotplug_notify+0x27/0x70
[   72.574304]  acpi_device_hotplug+0xb6/0x300
[   72.574452]  acpi_hotplug_work_fn+0x1e/0x40
[   72.574598]  process_one_work+0x16c/0x350
[   72.574742]  worker_thread+0x306/0x440
[   72.574878]  ? _raw_spin_lock_irqsave+0xe/0x20
[   72.575017]  ? __pfx_worker_thread+0x10/0x10
[   72.575152]  kthread+0xef/0x120
[   72.575285]  ? __pfx_kthread+0x10/0x10
[   72.575414]  ret_from_fork+0x44/0x70
[   72.575548]  ? __pfx_kthread+0x10/0x10
[   72.575677]  ret_from_fork_asm+0x1b/0x30
[   72.575803]  </TASK>

When the code reaches if (vp_dev->is_avq(vdev, vq->index)), due to is_avq not being correctly initialized in the legacy virtio device, it becomes a null pointer, resulting in an exception being thrown.

Dong Liang (qidong-ld) on 2024-06-13

tags:	added: linux-image-generic
tags:	added: kernel-bug removed: linux-image-generic

Matthew Ruffell (mruffell) on 2024-06-14

Changed in linux (Ubuntu):
status:	New → Fix Released

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-14:

Hi Dong,

I have been reading:

https://lore.kernel<email address hidden>/T/#m167335bf7ab09b12fec3bdc5d46a30bc2e26cac7

and I tried to reproduce the problem with 23.10's userspace, but I can't see the same crash.

By virtio legacy devices, you mean 0.97 virtio devices from ~2014 right? Would I need a 14.04 hypervisor to see this issue? I can try deploy one.

In any case, I have started building the below commit into a test kernel for you to try.

This kernel is going to take about 3 hours to compile, so wait 3 hours after this message before installing. You can also check status here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2067862-test

Please note this package is NOT SUPPORTED by Canonical, and is for TESTING
PURPOSES ONLY. ONLY Install in a dedicated test environment.

Instructions to Install (On a noble system):
1) sudo add-apt-repository ppa:mruffell/lp2067862-test
2) sudo apt update
3) sudo apt install linux-image-unsigned-6.8.0-31-generic linux-modules-6.8.0-31-generic linux-modules-extra-6.8.0-31-generic linux-headers-6.8.0-31-generic
4) sudo reboot
5) uname -rv
Look for "6.8.0-31.31+TEST2067862v20240614b1".

You might be asked to abort removing the currently running kernel. Say no.

Does the test kernel fix your issue?

Can you help me define the testcase so I can write a SRU template? How do you reproduce the problem?

Thanks,
Matthew

Revision history for this message

Dong Liang (qidong-ld) wrote on 2024-06-14:

Hi Matthew Ruffell
First of all, thank you for your response.

If you want to reproduce this issue, you can try the following steps:

1. Boot the Ubuntu 24 system and modify the kernel boot parameters to include virtio_pci.force_legacy=1.
2. Use perf-tools or ftrace to trace the return value of the virtio_pci_legacy_probe function.
3. Insert a virtio network card or disk, and observe the return value of the virtio_pci_legacy_probe function within the guest OS to confirm a successful return of 0x0, indicating that the virtio device has been enabled using virtio_pci_legacy_probe.
4. Remove the network card or disk, and you should be able to reproduce the issue.

I have installed and tested the test kernel you built on Ubuntu 24, and it's working fine. Additionally, I have also backported the patch from https://github.com/torvalds/linux/commit/c8fae27d141a32a1624d0d0d5419d94252824498 (virtio-pci: Check if is_avq is NULL) and compiled the Ubuntu kernel, which also has no issues.

Matthew Ruffell (mruffell) on 2024-06-18

summary:	- remove virtio legacy device make kernel Oops + Removing legacy virtio-pci devices causes kernel panic
Changed in linux (Ubuntu Noble):
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → Matthew Ruffell (mruffell)

Matthew Ruffell (mruffell) on 2024-06-18

description:

updated

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-18:

Hi Dong,

Thanks for trying the test kernel and letting me know it works. And for the help with the testcase.

I have submitted the patch to the Ubuntu Kernel Team mailing list:

Cover letter:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151550.html
Patch:
https://lists.ubuntu.com/archives/kernel-team/2024-June/151551.html

The next step is for it to be reviewed by senior members of the Kernel Team.
If it gets accepted, it will likely be in the 2024.07.08 SRU cycle as per
https://kernel.ubuntu.com/.

I will write back once the patch has been reviewed by the kernel team.

Thanks,
Matthew

Stefan Bader (smb) on 2024-06-21

Changed in linux (Ubuntu Noble):
status:	In Progress → Fix Committed

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-06-23:

Hi Dong,

The Kernel Team reviewed the patch, and it got 3 acks by Senior Kernel Team
members.

https://lists.ubuntu.com/archives/kernel-team/2024-June/151552.html
https://lists.ubuntu.com/archives/kernel-team/2024-June/151564.html
https://lists.ubuntu.com/archives/kernel-team/2024-June/151583.html

It has now been applied to the master-next branch of the Noble kernel:

https://lists.ubuntu.com/archives/kernel-team/2024-June/151658.html

This should be accepted into the 2024.07.08 SRU cycle https://kernel.ubuntu.com/

I'll write back once there is a kernel in -proposed to verify.

Thanks,
Matthew

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2024-07-11:

This bug is awaiting verification that the linux/6.8.0-40.40 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux' to 'verification-done-noble-linux'. If the problem still exists, change the tag 'verification-needed-noble-linux' to 'verification-failed-noble-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: kernel-spammed-noble-linux-v2 verification-needed-noble-linux

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-07-12:

Hi Dong,

The Kernel team has built the new kernel with the fix. Would be able to help test it and verify that it fixes the issue?

Instructions to Install (On a noble system):
1) cat << EOF | sudo tee /etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed main universe
EOF
2) sudo apt update
3) sudo apt install linux-image-6.8.0-40-generic linux-modules-6.8.0-40-generic linux-modules-extra-6.8.0-40-generic linux-headers-6.8.0-40-generic
4) sudo reboot
5) uname -rv
6.8.0-40-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 5 10:34:03 UTC 2024

Thanks,
Matthew

Dong Liang (qidong-ld) on 2024-07-12

tags:

added: verification-done-noble-linux
removed: verification-needed-noble-linux

Revision history for this message

Dong Liang (qidong-ld) wrote on 2024-07-12:

Hi Matthew Ruffell

Thank you for your long-time support on this case.

I have installed the linux-image-6.8.0-40-generic package using the command provided by you, and have verified the issue on the machine running Ubuntu 24 with kernel version (6.8.0-40-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 5 10:34:03 UTC 2024). The problem has been resolved with this kernel version.

I have changed the tag 'verification-needed-noble-linux' to 'verification-done-noble-linux'. Please verify.

Thanks again.

Dong

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2024-07-14:

Hi Dong,

Yes, the tag is correct, great news that the kernel fixes the issue.

As for a release schedule, have a look at https://kernel.ubuntu.com/ under 2024.07.08, where we will likely see a release to -updates around the week of the 5th August, if everything goes well.

Thanks,
Matthew

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Removing legacy virtio-pci devices causes kernel panic

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
linux package