Ubuntu 16.04 (4.4.0-127) hangs on boot with virtio-scsi MQ enabled
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
High
|
Unassigned | ||
Xenial |
Fix Released
|
High
|
Unassigned |
Bug Description
== SRU Justification ==
The bug reporter noticed that Xenial guests running on Nutanix AHV stopped
booting after they were upgraded to 4.4.0-127. Only guests with scsi mq
enabled suffered from this problem. AHV is one of the few hypervisor
products to offer multiqueue for virtio-scsi devices.
Upon further investigation, the saw that the kernel would hang during the
scanning of scsi targets. More specifically, immediately after coming
across a target without any luns present.
It was found the following commit introduced this regression:
commit f1f609d8015e1d3
Author: Jay Vosburgh <email address hidden>
Date: Thu Apr 19 21:40:00 2018 +0200
The patch spins on the target's 'reqs' counter waiting for the target to quiesce.
Further study revealed that virtio-scsi itself is broken in a way that it
doesn't increment the 'reqs' counter when submitting requests on MQ in
certain conditions. That caused the counter to go to -1 (on the completion
of the first request) and the CPU to hang indefinitely.
This regression is fixed by the requested SAUCE patch.
== Fix ==
UBUNTU: SAUCE: (no-up) virtio-scsi: Increment reqs counter.
== Regression Potential ==
Low. Limited to virtio and fixes a regression.
== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
We noticed that Ubuntu 16.04 guests running on Nutanix AHV stopped booting after they were upgraded to the latest kernel (4.4.0-127). Only guests with scsi mq enabled suffered from this problem. AHV is one of the few hypervisor products to offer multiqueue for virtio-scsi devices.
Upon further investigation, we could see that the kernel would hang during the scanning of scsi targets. More specifically, immediately after coming across a target without any luns present. That's the first time the kernel destroys a target (given it doesn't have luns). This could be confirmed with gdb (attached to qemu's gdbserver):
#0 0xffffffffc0045039 in ?? ()
#1 0xffff88022c753c98 in ?? ()
#2 0xffffffff815d1de6 in scsi_target_destroy (starget=
at /build/
This shows the guest vCPU stuck on virtio-scsi's implementation of target_destroy. Despite lacking symbols, we managed to examine the virtio_
(gdb) p *(struct virtio_
$6 = {tgt_seq = {sequence = 0}, reqs = {counter = -1}, req_vq = 0xffff88022cbdd9e8}
(gdb)
This drew our attention to the following patch which is exclusive to the Ubuntu kernel:
commit f1f609d8015e1d3
Author: Jay Vosburgh <email address hidden>
Date: Thu Apr 19 21:40:00 2018 +0200
In a nutshell, the patch spins on the target's 'reqs' counter waiting for the target to quiesce:
--- a/drivers/
+++ b/drivers/
@@ -785,6 +785,10 @@ static int virtscsi_
static void virtscsi_
{
struct virtio_
+
+ /* we can race with concurrent virtscsi_
+ while (atomic_
+ cpu_relax();
kfree(tgt);
}
Personally, I think this is a catastrophic way of waiting for a target to quiesce since virtscsi_
Nevertheless, further study revealed that virtio-scsi itself is broken in a way that it doesn't increment the 'reqs' counter when submitting requests on MQ in certain conditions. That caused the counter to go to -1 (on the completion of the first request) and the CPU to hang indefinitely.
The following patch fixes the issue:
--- old/linux-
+++ new/linux-
@@ -641,9 +641,10 @@
struct virtio_scsi_vq *req_vq;
- if (shost_
+ if (shost_
- else
+ atomic_
+ } else
return virtscsi_
Signed-off-by: Felipe Franciosi <email address hidden>
Please consider this a urgent fix as all of our customers which use Ubuntu 16.04 and have MQ enabled for better performance will be affected by your latest update. Our workaround is to recommend that they disable SCSI MQ while you work on the issue.
Best regards,
Felipe
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
description: | updated |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Xenial): | |
status: | New → Confirmed |
importance: | Undecided → High |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1775235
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.