nbd: requests can become stuck when disconnecting from server with qemu-nbd
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Won't Fix
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Medium
|
Matthew Ruffell | ||
Impish |
Won't Fix
|
Medium
|
Matthew Ruffell | ||
Jammy |
Fix Released
|
Medium
|
Matthew Ruffell | ||
Kinetic |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
BugLink: https:/
[Impact]
After 2516ab1("nbd: only clear the queue on device teardown"), present in 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently being processed. This change was made to fix a race between using the NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing requests. This worked for the most part, as several years ago systemd was not set up to watch nbd devices for changes in their state.
But after:
commit f82abfcda58168d
From: Tony Asleson <email address hidden>
Date: Fri, 8 Feb 2019 15:47:10 -0600
Subject: rules: watch metadata changes on nbd devices
Link: https:/
in systemd v242-rc1, nbd* devices were added to a udev rule to watch those devices for changes with the inotify subsystem. From man udev:
> watch
> Watch the device node with inotify; when the node is closed after being
> opened for writing, a change uevent is synthesized.
>
> nowatch
> Disable the watching of a device node with inotify.
This changed the behaviour of device teardown, since systemd now keeps tabs on the device with inotify, outstanding requests cannot be cleared as nbd_xmit_timeout() will always return 'BLK_EH_
Symptoms of this issue is that the nbd subsystem gets stuck with messages like:
block nbd15: NBD_DISCONNECT
block nbd15: Send disconnect failed -32
...
block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264
...
block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264
...
INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
Not tainted 5.15.0-23-generic #23-Ubuntu
"echo 0 > /proc/sys/
task:qemu-nbd state:D stack: 0 pid: 1267 ppid: 1 flags:0x00000002
Call Trace:
<TASK>
__schedule+
? call_rcu+0xe/0x10
schedule+0x4e/0xb0
blk_mq_
? wait_woken+
blk_mq_
nbd_add_
__nbd_
? security_
nbd_ioctl+
blkdev_
? __fget_
block_
__x64_
do_syscall_
entry_
</TASK>
Additionally, in syslog you will also see systemd-udevd get stuck:
systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time
$ ps aux
...
419 1194 root D 0.1 systemd-udevd -
We can workaround the issue by adding a higher priority udev rule to not watch nbd* devices.
$ cat << EOF >> /etc/udev/
# Disable inotify watching of change events for NBD devices
ACTION=
EOF
$ sudo udevadm control --reload-rules
$ sudo udevadm trigger
[Fix]
The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which was introduced in 5.16, and added to in 5.19. We need to backport all commits related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.
For Focal, Impish and Jammy:
commit 4e6eef5dc25b528
Author: Yu Kuai <email address hidden>
Date: Thu Sep 16 17:33:44 2021 +0800
Subject: nbd: don't handle response without a corresponding request message
Link: https:/
commit 07175cb1baf4c51
Author: Yu Kuai <email address hidden>
Date: Thu Sep 16 17:33:45 2021 +0800
Subject: nbd: make sure request completion won't concurrent
Link: https:/
commit 2895f1831e911ca
Author: Yu Kuai <email address hidden>
Date: Sat May 21 15:37:46 2022 +0800
Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
Link: https:/
commit 09dadb5985023e2
Author: Yu Kuai <email address hidden>
Date: Sat May 21 15:37:47 2022 +0800
Subject: nbd: fix io hung while disconnecting device
Link: https:/
For Focal only (dependency commits):
commit 7b11eab041dacfe
Author: Keith Busch <email address hidden>
Date: Fri May 29 07:51:59 2020 -0700
Subject: blk-mq: blk-mq: provide forced completion method
Link: https:/
commit 15f73f5b3e5958f
Author: Christoph Hellwig <email address hidden>
Date: Thu Jun 11 08:44:47 2020 +0200
Subject: blk-mq: move failure injection out of blk_mq_
Link: https:/
I want to talk about the backport of "blk-mq: move failure injection out of blk_mq_
Note: Bionic is also affected, but the backport list to the 4.15 kernel is unreasonable, and due to systemd in Bionic not having the udev rule set to watch nbd* devices by default, Bionic is not directly affected by the issue. Users would only see the problem if they add nbd* to the watch list in /usr/lib/
[Testcase]
The issue can be easily reproduced with:
$ sudo apt install qemu-utils
$ cat << EOF >> reproducer.sh
#!/bin/bash
sudo modprobe nbd
while :
do
qemu-img create -f qcow2 foo.img 500M
sudo qemu-nbd --disconnect /dev/nbd15 || true
sudo qemu-nbd --connect=
sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
sudo qemu-nbd --disconnect /dev/nbd15
done
EOF
$ chmod +x reproducer.sh
$ yes | ./reproducer.sh
On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or so, and the kernel log will be filled with stuck request messages and hung task timeouts after the 120 second mark.
Test kernels are available in the following ppa:
https:/
If you install these test kernels, and re-run the reproducer, your system will be perfectly stable for many hours with no issues.
I have torture tested the backport to the Focal kernel for more than 4 hours with no issues observed.
[Where problems can occur]
Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change to the NBD subsystem is quite large, and changes how requests are processed and accounted for, depending if they are still outstanding or not.
For Impish and Jammy, these risks are limited to the NBD subsystem itself.
On Focal, the risk of regression is slightly larger due to the backport of "blk-mq: move failure injection out of blk_mq_
If a regression were to occur, users may see NBD requests fail, occur in the incorrect order, or cleared at incorrect times. There is no way to disable NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older kernel while a fix is made.
[Other info]
My bug report to upstream can be found in the following mailing list thread:
https:/
You may be wondering why we cannot simply just backport "nbd: fix io hung while disconnecting device" and resolve the issue with a 1 line change. I actually built test kernels for this scenario to see if it was possible, and they are available here:
https:/
While it did sort of fix the issue, that is, connect and disconnect from nbd devices lasted longer than the 30 seconds the kernels lasted previously, after about 10 minutes of torture testing, I started experiencing race conditions of requests completing multiple times, leading to the following use after frees:
Jammy 5.15 test kernel:
https:/
Impish 5.13 test kernel:
https:/
Focal 5.4 test kernel:
https:/
This was similar to a mailing list thread I found previously:
https:/
https://<email address hidden>/T/
Keith Busch identified this as a double completion of the same request, which is resolved through the NBD_CMD_INFLIGHT infrastructure.
Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT infrastructure for a complete fix.
CVE References
tags: | added: sts |
summary: |
- nbd locks system? + nbd: requests can become stuck when disconnecting from server with qemu- + nbd |
description: | updated |
tags: | added: impish jammy |
Changed in linux (Ubuntu Bionic): | |
status: | New → Won't Fix |
Changed in linux (Ubuntu Focal): | |
status: | New → In Progress |
Changed in linux (Ubuntu Impish): | |
status: | New → In Progress |
Changed in linux (Ubuntu Jammy): | |
status: | New → In Progress |
Changed in linux (Ubuntu Kinetic): | |
status: | Confirmed → Fix Committed |
Changed in linux (Ubuntu Focal): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Impish): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Jammy): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Focal): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
Changed in linux (Ubuntu Impish): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
Changed in linux (Ubuntu Jammy): | |
assignee: | nobody → Matthew Ruffell (mruffell) |
Changed in linux (Ubuntu Impish): | |
status: | In Progress → Won't Fix |
Changed in linux (Ubuntu Jammy): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Released |
Changed in linux (Ubuntu Jammy): | |
status: | Fix Committed → Fix Released |
Changed in linux (Ubuntu Kinetic): | |
status: | Fix Committed → Fix Released |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Here's the apport file from the VM (which is a very recent clean install);
for some reason I couldn't upload it with ubuntu-bug.