Ubuntu
linux package

nbd: requests can become stuck when disconnecting from server with qemu-nbd

Jammy (22.04)
Bug #1896350

Bug #1896350 reported by Dan Kegel on 2020-09-19

This bug affects 2 people

	Status	Importance	Assigned to
linux (Ubuntu)	Fix Released	Undecided	Unassigned
Bionic	Won't Fix	Undecided	Unassigned
Focal	Fix Released	Medium	Matthew Ruffell
Impish	Won't Fix	Medium	Matthew Ruffell
Jammy	Fix Released	Medium	Matthew Ruffell
Kinetic	Fix Released	Undecided	Unassigned

Bug Description

BugLink: https://bugs.launchpad.net/bugs/1896350

[Impact]

After 2516ab1("nbd: only clear the queue on device teardown"), present in 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently being processed. This change was made to fix a race between using the NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing requests. This worked for the most part, as several years ago systemd was not set up to watch nbd devices for changes in their state.

But after:

commit f82abfcda58168d9f667e2094d438763531d3fa6
From: Tony Asleson <email address hidden>
Date: Fri, 8 Feb 2019 15:47:10 -0600
Subject: rules: watch metadata changes on nbd devices
Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6

in systemd v242-rc1, nbd* devices were added to a udev rule to watch those devices for changes with the inotify subsystem. From man udev:

> watch
> Watch the device node with inotify; when the node is closed after being
> opened for writing, a change uevent is synthesized.
>
> nowatch
> Disable the watching of a device node with inotify.

This changed the behaviour of device teardown, since systemd now keeps tabs on the device with inotify, outstanding requests cannot be cleared as nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get stuck, never to complete, because a disconnect has occurred, and never to timeout, as their timers keep being reset.

Symptoms of this issue is that the nbd subsystem gets stuck with messages like:

block nbd15: NBD_DISCONNECT
block nbd15: Send disconnect failed -32
...
block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds
...
block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds
...
INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
Not tainted 5.15.0-23-generic #23-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:qemu-nbd state:D stack: 0 pid: 1267 ppid: 1 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x23d/0x590
? call_rcu+0xe/0x10
schedule+0x4e/0xb0
blk_mq_freeze_queue_wait+0x69/0xa0
? wait_woken+0x70/0x70
blk_mq_freeze_queue+0x1b/0x30
nbd_add_socket+0x76/0x1f0 [nbd]
__nbd_ioctl+0x18b/0x340 [nbd]
? security_capable+0x3d/0x60
nbd_ioctl+0x81/0xb0 [nbd]
blkdev_ioctl+0x12e/0x270
? __fget_files+0x86/0xc0
block_ioctl+0x46/0x50
__x64_sys_ioctl+0x91/0xc0
do_syscall_64+0x5c/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>

Additionally, in syslog you will also see systemd-udevd get stuck:

systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time

$ ps aux
...
419 1194 root D 0.1 systemd-udevd -

We can workaround the issue by adding a higher priority udev rule to not watch nbd* devices.

$ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
# Disable inotify watching of change events for NBD devices
ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
EOF

$ sudo udevadm control --reload-rules
$ sudo udevadm trigger

[Fix]

The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which was introduced in 5.16, and added to in 5.19. We need to backport all commits related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.

For Focal, Impish and Jammy:

commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
Author: Yu Kuai <email address hidden>
Date: Thu Sep 16 17:33:44 2021 +0800
Subject: nbd: don't handle response without a corresponding request message
Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d

commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
Author: Yu Kuai <email address hidden>
Date: Thu Sep 16 17:33:45 2021 +0800
Subject: nbd: make sure request completion won't concurrent
Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9

commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
Author: Yu Kuai <email address hidden>
Date: Sat May 21 15:37:46 2022 +0800
Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e

commit 09dadb5985023e27d4740ebd17e6fea4640110e5
Author: Yu Kuai <email address hidden>
Date: Sat May 21 15:37:47 2022 +0800
Subject: nbd: fix io hung while disconnecting device
Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5

For Focal only (dependency commits):

commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
Author: Keith Busch <email address hidden>
Date: Fri May 29 07:51:59 2020 -0700
Subject: blk-mq: blk-mq: provide forced completion method
Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d

commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
Author: Christoph Hellwig <email address hidden>
Date: Thu Jun 11 08:44:47 2020 +0200
Subject: blk-mq: move failure injection out of blk_mq_complete_request
Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2

I want to talk about the backport of "blk-mq: move failure injection out of blk_mq_complete_request" for Focal, since it changes a number of drivers using the blk_mq_complete_request() call, and can be considered a large regression risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem. CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus blk_should_fake_timeout() just returns false, and is more or less a nop on our kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is really just if(true), and I did think about simply backporting that to the nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" but after talking with Jay, we decided backporting the entire patch was the best way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a custom kernel build with the nbd subsystem.

Note: Bionic is also affected, but the backport list to the 4.15 kernel is unreasonable, and due to systemd in Bionic not having the udev rule set to watch nbd* devices by default, Bionic is not directly affected by the issue. Users would only see the problem if they add nbd* to the watch list in /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or later LXC container running using the hosts nbd kernel module. Because of this Bionic will be Won't Fix.

[Testcase]

The issue can be easily reproduced with:

$ sudo apt install qemu-utils

$ cat << EOF >> reproducer.sh
#!/bin/bash

sudo modprobe nbd

while :
do
        qemu-img create -f qcow2 foo.img 500M
        sudo qemu-nbd --disconnect /dev/nbd15 || true
        sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
        sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
        sudo qemu-nbd --disconnect /dev/nbd15
done
EOF

$ chmod +x reproducer.sh
$ yes | ./reproducer.sh

On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or so, and the kernel log will be filled with stuck request messages and hung task timeouts after the 120 second mark.

Test kernels are available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test

If you install these test kernels, and re-run the reproducer, your system will be perfectly stable for many hours with no issues.

I have torture tested the backport to the Focal kernel for more than 4 hours with no issues observed.

[Where problems can occur]

Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change to the NBD subsystem is quite large, and changes how requests are processed and accounted for, depending if they are still outstanding or not.

For Impish and Jammy, these risks are limited to the NBD subsystem itself.

On Focal, the risk of regression is slightly larger due to the backport of "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT being disabled on the Focal kernel.

If a regression were to occur, users may see NBD requests fail, occur in the incorrect order, or cleared at incorrect times. There is no way to disable NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older kernel while a fix is made.

[Other info]

My bug report to upstream can be found in the following mailing list thread:
https://lkml.org/lkml/2022/4/22/61

You may be wondering why we cannot simply just backport "nbd: fix io hung while disconnecting device" and resolve the issue with a 1 line change. I actually built test kernels for this scenario to see if it was possible, and they are available here:

https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single

While it did sort of fix the issue, that is, connect and disconnect from nbd devices lasted longer than the 30 seconds the kernels lasted previously, after about 10 minutes of torture testing, I started experiencing race conditions of requests completing multiple times, leading to the following use after frees:

Jammy 5.15 test kernel:
https://paste.ubuntu.com/p/FSM6DrgjTy/

Impish 5.13 test kernel:
https://paste.ubuntu.com/p/86tNfsM7Vs/

Focal 5.4 test kernel:
https://paste.ubuntu.com/p/zzVzWx23sb/

This was similar to a mailing list thread I found previously:
https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
https://<email address hidden>/T/

Keith Busch identified this as a double completion of the same request, which is resolved through the NBD_CMD_INFLIGHT infrastructure.

Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT infrastructure for a complete fix.

See original description

Tags:

CVE References

Revision history for this message

Dan Kegel (dank) wrote on 2020-09-19:

AlsaInfo.txt Edit (47.5 KiB, text/plain; charset="utf-8")
CRDA.txt Edit (454 bytes, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (75.8 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.5 KiB, text/plain; charset="utf-8")
IwConfig.txt Edit (181 bytes, text/plain; charset="utf-8")
Lspci.txt Edit (11.1 KiB, text/plain; charset="utf-8")
Lspci-vt.txt Edit (1.1 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (561 bytes, text/plain; charset="utf-8")
Lsusb-t.txt Edit (892 bytes, text/plain; charset="utf-8")
Lsusb-v.txt Edit (50.8 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (4.6 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.1 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (2.4 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (7.9 KiB, text/plain; charset="utf-8")
PulseList.txt Edit (26.7 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (201.6 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (96.5 KiB, text/plain; charset="utf-8")

Revision history for this message

Dan Kegel (dank) wrote on 2020-09-19:

apport.linux-image-5.4.0-47-generic.0beq9kz0.apport Edit (256.7 KiB, text/plain)

Here's the apport file from the VM (which is a very recent clean install);
for some reason I couldn't upload it with ubuntu-bug.

Revision history for this message

Dan Kegel (dank) wrote on 2020-09-19:

FWIW, running same script on ubuntu 16.04 seems to work better.

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2020-09-20: Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Dan Kegel (dank) wrote on 2020-09-20: Re: nbd locks system?

Also seems to work fine on ubuntu 18.04 (fresh, fully updated).

Revision history for this message

Dan Kegel (dank) wrote on 2020-09-20:

On today's groovy snapshot (with the default kernel, 5.8.0-generic), the original problem is still present; didn't seem to show up until 2nd run of the bug script.

Revision history for this message

Dan Kegel (dank) wrote on 2020-09-21:

(The zero size problem mentioned above also occurs occasionally on ubuntu 18.04, and the workaround is to add a sleep in the user script after modprobe nbd, so that's really a separate problem.)

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2022-04-01:

Hi Dan,

Thank you for your bug report. I just came across it now, as I just ran into same issue too. Your reproducer works great, and I have started debugging the issue. At this stage it doesn't seem to be a kernel bug, or a qemu-nbd bug. I think the culprit is systemd-udevd or multipathd, as when I disable systemd-udevd, things work fine.

I will add more details to the bug report in a few days or so, once I have determined the root cause and come up with a fix.

Thanks,
Matthew

Tom Zhou (zhouqt) on 2022-04-01

tags:

added: sts

Matthew Ruffell (mruffell) on 2022-06-22

summary:	- nbd locks system? + nbd: requests can become stuck when disconnecting from server with qemu- + nbd
description:	updated
tags:	added: impish jammy
Changed in linux (Ubuntu Bionic):
status:	New → Won't Fix
Changed in linux (Ubuntu Focal):
status:	New → In Progress
Changed in linux (Ubuntu Impish):
status:	New → In Progress
Changed in linux (Ubuntu Jammy):
status:	New → In Progress
Changed in linux (Ubuntu Kinetic):
status:	Confirmed → Fix Committed
Changed in linux (Ubuntu Focal):
importance:	Undecided → Medium
Changed in linux (Ubuntu Impish):
importance:	Undecided → Medium
Changed in linux (Ubuntu Jammy):
importance:	Undecided → Medium
Changed in linux (Ubuntu Focal):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in linux (Ubuntu Impish):
assignee:	nobody → Matthew Ruffell (mruffell)
Changed in linux (Ubuntu Jammy):
assignee:	nobody → Matthew Ruffell (mruffell)

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2022-06-29:

This bug is awaiting verification that the linux-azure/5.15.0-1014.17 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-jammy

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2022-06-29:

#10

This bug is awaiting verification that the linux-azure/5.4.0-1086.91 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: verification-needed-focal

Stefan Bader (smb) on 2022-07-08

Changed in linux (Ubuntu Impish):
status:	In Progress → Won't Fix
Changed in linux (Ubuntu Jammy):
status:	In Progress → Fix Committed

Tim Gardner (timg-tpi) on 2022-07-19

Changed in linux (Ubuntu Focal):
status:	In Progress → Fix Released

Tim Gardner (timg-tpi) on 2022-07-19

Changed in linux (Ubuntu Jammy):
status:	Fix Committed → Fix Released
Changed in linux (Ubuntu Kinetic):
status:	Fix Committed → Fix Released

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2022-07-20:

#11

Fix released for linux-azure:

linux-azure (5.4.0-1086.91)
linux-azure (5.15.0.1014.17)

Marking back to Fix Committed for Jammy and In progress for Focal to track progress in -generic variants.

Changed in linux (Ubuntu Jammy):
status:	Fix Released → Fix Committed
Changed in linux (Ubuntu Focal):
status:	Fix Released → In Progress

Revision history for this message

Matthew Ruffell (mruffell) wrote on 2022-07-20:

#12

Download full text (4.6 KiB)

Performing verification for Jammy.

I created a new Jammy VM, and installed qemu-utils.

The kernel is 5.15.0-41-generic from -updates.

I ran my reproducer.sh script from the testcase, and within a minute, the nbd request got stuck, and we started seeing hung task timeout oops messages in dmesg:

Jul 20 04:56:20 jammy-nbd kernel: block nbd15: NBD_DISCONNECT
Jul 20 04:56:20 jammy-nbd kernel: block nbd15: Send disconnect failed -32
Jul 20 04:56:20 jammy-nbd sudo[5267]: pam_unix(sudo:session): session closed for user root
Jul 20 04:56:20 jammy-nbd sudo[5271]: ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
Jul 20 04:56:20 jammy-nbd sudo[5271]: pam_unix(sudo:session): session opened for user root(uid=0) by ubuntu(uid=1000)
Jul 20 04:56:20 jammy-nbd kernel: ldm_validate_partition_table(): Disk read failed.
Jul 20 04:56:20 jammy-nbd kernel: Dev nbd15: unable to read RDB block 0
Jul 20 04:56:20 jammy-nbd kernel: nbd15: unable to read partition table
Jul 20 04:56:51 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 30 seconds
Jul 20 04:57:19 jammy-nbd systemd-udevd[440]: nbd15: Worker [2561] processing SEQNUM=3062 is taking a long time
Jul 20 04:57:21 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 60 seconds
Jul 20 04:57:52 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 90 seconds
Jul 20 04:58:23 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 120 seconds
Jul 20 04:58:23 jammy-nbd kernel: INFO: task qemu-nbd:5280 blocked for more than 120 seconds.
Jul 20 04:58:23 jammy-nbd kernel: Not tainted 5.15.0-41-generic #44-Ubuntu
Jul 20 04:58:23 jammy-nbd kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 20 04:58:23 jammy-nbd kernel: task:qemu-nbd state:D stack: 0 pid: 5280 ppid: 1 flags:0x00000002
Jul 20 04:58:23 jammy-nbd kernel: Call Trace:
Jul 20 04:58:23 jammy-nbd kernel: <TASK>
Jul 20 04:58:23 jammy-nbd kernel: __schedule+0x23d/0x590
Jul 20 04:58:23 jammy-nbd kernel: ? call_rcu+0xe/0x10
Jul 20 04:58:23 jammy-nbd kernel: schedule+0x4e/0xb0
Jul 20 04:58:23 jammy-nbd kernel: blk_mq_freeze_queue_wait+0x69/0xa0
Jul 20 04:58:23 jammy-nbd kernel: ? wait_woken+0x70/0x70
Jul 20 04:58:23 jammy-nbd kernel: blk_mq_freeze_queue+0x1b/0x30
Jul 20 04:58:23 jammy-nbd kernel: nbd_add_socket+0x76/0x1f0 [nbd]
Jul 20 04:58:23 jammy-nbd kernel: __nbd_ioctl+0x18b/0x340 [nbd]
Jul 20 04:58:23 jammy-nbd kernel: ? security_capable+0x3d/0x60
Jul 20 04:58:23 jammy-nbd kernel: nbd_ioctl+0x81/0xb0 [nbd]
Jul 20 04:58:23 jammy-nbd kernel: blkdev_ioctl+0x12e/0x270
Jul 20 04:58:23 jammy-nbd kernel: ? __fget_files+0x86/0xc0
Jul 20 04:58:23 jammy-nbd kernel: block_ioctl+0x46/0x50
Jul 20 04:58:23 jammy-nbd kernel: __x64_sys_ioctl+0x91/0xc0
Jul 20 04:58:23 jammy-nbd kernel: do_syscall_64+0x5c/0xc0
Jul 20 04:58:23 jammy-nbd kernel: ? exit_to_user_mode_prepare+0x37/0xb0
Ju...

Performing verification for Jammy.

I created a new Jammy VM, and installed qemu-utils.

The kernel is 5.15.0-41-generic from -updates.

I ran my reproducer.sh script from the testcase, and within a minute, the nbd request got stuck, and we started seeing hung task timeout oops messages in dmesg:

Jul 20 04:56:20 jammy-nbd kernel: block nbd15: NBD_DISCONNECT
Jul 20 04:56:20 jammy-nbd kernel: block nbd15: Send disconnect failed -32
Jul 20 04:56:20 jammy-nbd sudo[5267]: pam_unix(sudo:session): session closed for user root
Jul 20 04:56:20 jammy-nbd sudo[5271]:   ubuntu : TTY=pts/0 ; PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/bin/qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
Jul 20 04:56:20 jammy-nbd sudo[5271]: pam_unix(sudo:session): session opened for user root(uid=0) by ubuntu(uid=1000)
Jul 20 04:56:20 jammy-nbd kernel: ldm_validate_partition_table(): Disk read failed.
Jul 20 04:56:20 jammy-nbd kernel: Dev nbd15: unable to read RDB block 0
Jul 20 04:56:20 jammy-nbd kernel:  nbd15: unable to read partition table
Jul 20 04:56:51 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 30 seconds
Jul 20 04:57:19 jammy-nbd systemd-udevd[440]: nbd15: Worker [2561] processing SEQNUM=3062 is taking a long time
Jul 20 04:57:21 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 60 seconds
Jul 20 04:57:52 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 90 seconds
Jul 20 04:58:23 jammy-nbd kernel: block nbd15: Possible stuck request 0000000064946bb4: control (read@524087296,65536B). Runtime 120 seconds
Jul 20 04:58:23 jammy-nbd kernel: INFO: task qemu-nbd:5280 blocked for more than 120 seconds.
Jul 20 04:58:23 jammy-nbd kernel:       Not tainted 5.15.0-41-generic #44-Ubuntu
Jul 20 04:58:23 jammy-nbd kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 20 04:58:23 jammy-nbd kernel: task:qemu-nbd        state:D stack:    0 pid: 5280 ppid:     1 flags:0x00000002
Jul 20 04:58:23 jammy-nbd kernel: Call Trace:
Jul 20 04:58:23 jammy-nbd kernel:  <TASK>
Jul 20 04:58:23 jammy-nbd kernel:  __schedule+0x23d/0x590
Jul 20 04:58:23 jammy-nbd kernel:  ? call_rcu+0xe/0x10
Jul 20 04:58:23 jammy-nbd kernel:  schedule+0x4e/0xb0
Jul 20 04:58:23 jammy-nbd kernel:  blk_mq_freeze_queue_wait+0x69/0xa0
Jul 20 04:58:23 jammy-nbd kernel:  ? wait_woken+0x70/0x70
Jul 20 04:58:23 jammy-nbd kernel:  blk_mq_freeze_queue+0x1b/0x30
Jul 20 04:58:23 jammy-nbd kernel:  nbd_add_socket+0x76/0x1f0 [nbd]
Jul 20 04:58:23 jammy-nbd kernel:  __nbd_ioctl+0x18b/0x340 [nbd]
Jul 20 04:58:23 jammy-nbd kernel:  ? security_capable+0x3d/0x60
Jul 20 04:58:23 jammy-nbd kernel:  nbd_ioctl+0x81/0xb0 [nbd]
Jul 20 04:58:23 jammy-nbd kernel:  blkdev_ioctl+0x12e/0x270
Jul 20 04:58:23 jammy-nbd kernel:  ? __fget_files+0x86/0xc0
Jul 20 04:58:23 jammy-nbd kernel:  block_ioctl+0x46/0x50
Jul 20 04:58:23 jammy-nbd kernel:  __x64_sys_ioctl+0x91/0xc0
Jul 20 04:58:23 jammy-nbd kernel:  do_syscall_64+0x5c/0xc0
Jul 20 04:58:23 jammy-nbd kernel:  ? exit_to_user_mode_prepare+0x37/0xb0
Jul 20 04:58:23 jammy-nbd kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 20 04:58:23 jammy-nbd kernel:  ? __x64_sys_recvmsg+0x1d/0x20
Jul 20 04:58:23 jammy-nbd kernel:  ? do_syscall_64+0x69/0xc0
Jul 20 04:58:23 jammy-nbd kernel:  ? syscall_exit_to_user_mode+0x27/0x50
Jul 20 04:58:23 jammy-nbd kernel:  ? __x64_sys_recvmsg+0x1d/0x20
Jul 20 04:58:23 jammy-nbd kernel:  ? do_syscall_64+0x69/0xc0
Jul 20 04:58:23 jammy-nbd kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jul 20 04:58:23 jammy-nbd kernel: RIP: 0033:0x7f6c47e47aff
Jul 20 04:58:23 jammy-nbd kernel: RSP: 002b:00007f6c464d1820 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jul 20 04:58:23 jammy-nbd kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f6c47e47aff
Jul 20 04:58:23 jammy-nbd kernel: RDX: 0000000000000009 RSI: 000000000000ab00 RDI: 000000000000000b
Jul 20 04:58:23 jammy-nbd kernel: RBP: 00007f6c464d1910 R08: 0000000000000000 R09: 0000000000000001
Jul 20 04:58:23 jammy-nbd kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000b
Jul 20 04:58:23 jammy-nbd kernel: R13: 00007f6c464d1900 R14: 000000001f400000 R15: 00007f6c3c000b90
Jul 20 04:58:23 jammy-nbd kernel:  </TASK>

I then rebooted, and enabled -proposed and installed the 5.15.0-43-generic kernel.

I started the reproducer.sh script and left it to run for an hour.

At the end of the hour, the script was still running strong. Requests no longer get stuck when we issue NBD_DISCONNECT, and the issue is solved.

The kernel in -proposed fixes the issue, happy to mark verified.

tags:

added: verification-done-jammy
removed: verification-needed-focal verification-needed-jammy

Revision history for this message

Launchpad Janitor (janitor) wrote on 2022-07-28:

#13

Download full text (3.6 KiB)

This bug was fixed in the package linux - 5.15.0-43.46

---------------
linux (5.15.0-43.46) jammy; urgency=medium

* jammy/linux: 5.15.0-43.46 -proposed tracker (LP: #1981243)

* Packaging resync (LP: #1786013)
- debian/dkms-versions -- update from kernel-versions (main/2022.07.11)

  * nbd: requests can become stuck when disconnecting from server with qemu-nbd
    (LP: #1896350)
    - nbd: don't handle response without a corresponding request message
    - nbd: make sure request completion won't concurrent
    - nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
    - nbd: fix io hung while disconnecting device

  * Ubuntu 22.04 and 20.04 DPC Fixes for Failure Cases of DownPort Containment
    events (LP: #1965241)
    - PCI/portdrv: Rename pm_iter() to pcie_port_device_iter()
    - PCI: pciehp: Ignore Link Down/Up caused by error-induced Hot Reset
    - [Config] Enable config option CONFIG_PCIE_EDR

  * [SRU] Ubuntu 22.04 Feature Request-Add support for a NVMe-oF-TCP CDC Client
    - TP 8010 (LP: #1948626)
    - nvme: add CNTRLTYPE definitions for 'identify controller'
    - nvme: send uevent on connection up
    - nvme: expose cntrltype and dctype through sysfs

  * [UBUNTU 22.04] Kernel oops while removing device from cio_ignore list
    (LP: #1980951)
    - s390/cio: derive cdev information only for IO-subchannels

  * Jammy Charmed OpenStack deployment fails over connectivity issues when using
    converged OVS bridge for control and data planes (LP: #1978820)
    - net/mlx5e: TC NIC mode, fix tc chains miss table

* Hairpin traffic does not work with centralized NAT gw (LP: #1967856)
- net: openvswitch: fix misuse of the cached connection on tuple changes

  * alsa: asoc: amd: the internal mic can't be dedected on yellow carp machines
    (LP: #1980700)
    - ASoC: amd: Add driver data to acp6x machine driver
    - ASoC: amd: Add support for enabling DMIC on acp6x via _DSD

  * AMD ACP 6.x DMIC Supports (LP: #1949245)
    - ASoC: amd: add Yellow Carp ACP6x IP register header
    - ASoC: amd: add Yellow Carp ACP PCI driver
    - ASoC: amd: add acp6x init/de-init functions
    - ASoC: amd: add platform devices for acp6x pdm driver and dmic driver
    - ASoC: amd: add acp6x pdm platform driver
    - ASoC: amd: add acp6x irq handler
    - ASoC: amd: add acp6x pdm driver dma ops
    - ASoC: amd: add acp6x pci driver pm ops
    - ASoC: amd: add acp6x pdm driver pm ops
    - ASoC: amd: enable Yellow carp acp6x drivers build
    - ASoC: amd: create platform device for acp6x machine driver
    - ASoC: amd: add YC machine driver using dmic
    - ASoC: amd: enable Yellow Carp platform machine driver build
    - ASoC: amd: fix uninitialized variable in snd_acp6x_probe()
    - [Config] Enable AMD ACP 6 DMIC Support

  * [UBUNTU 20.04] Include patches to avoid self-detected stall with Secure
    Execution (LP: #1979296)
    - KVM: s390: pv: add macros for UVC CC values
    - KVM: s390: pv: avoid stalls when making pages secure

  * [22.04 FEAT] KVM: Attestation support for Secure Execution (crypto)
    (LP: #1959973)
    - drivers/s390/char: Add Ultravisor io device
    - s390/uv_uapi: depend on CONFIG_S390
    - [Co...

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

nbd: requests can become stuck when disconnecting from server with qemu-nbd

Bug Description

CVE References

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package