race condition between vhost_net_stop and CHR_EVENT_CLOSED on shutdown crashes qemu
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
QEMU |
Fix Released
|
Undecided
|
Unassigned | ||
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
Mitaka |
Fix Released
|
Medium
|
Unassigned | ||
Ocata |
Fix Released
|
Medium
|
Unassigned | ||
qemu (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Trusty |
Won't Fix
|
Medium
|
Unassigned | ||
Xenial |
Fix Released
|
Medium
|
Dan Streetman | ||
Bionic |
Fix Released
|
Medium
|
Unassigned | ||
Cosmic |
Fix Released
|
Medium
|
Unassigned | ||
Disco |
Fix Released
|
Medium
|
Unassigned |
Bug Description
[impact]
on shutdown of a guest, there is a race condition that results in qemu crashing instead of normally shutting down. The bt looks similar to this (depending on the specific version of qemu, of course; this is taken from 2.5 version of qemu):
(gdb) bt
#0 __GI___
#1 0x00005636c0bc4389 in qemu_mutex_lock (mutex=
#2 0x00005636c0988130 in qemu_chr_
#3 0x00005636c08f3483 in vhost_user_write (msg=msg@
at /build/
#4 0x00005636c08f411c in vhost_user_
#5 0x00005636c08efff0 in vhost_virtqueue
#6 0x00005636c08f2944 in vhost_dev_stop (hdev=hdev@
#7 0x00005636c08db2a8 in vhost_net_stop_one (net=0x5636c1bf
#8 0x00005636c08dbe5b in vhost_net_stop (dev=dev@
#9 0x00005636c08d7745 in virtio_
#10 virtio_
#11 0x00005636c08ec42c in virtio_set_status (vdev=0x5636c28
#12 0x00005636c098fed2 in vm_state_notify (running=
#13 0x00005636c089172a in do_vm_stop (state=
#14 vm_stop (state=
#15 0x00005636c085d240 in main_loop_
#16 main_loop () at /build/
#17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /build/
[test case]
unfortunately since this is a race condition, it's very hard to arbitrarily reproduce; it depends very much on the overall configuration of the guest as well as how exactly it's shut down - specifically, its vhost user net must be closed from the host side at a specific time during qemu shutdown.
I have someone with such a setup who has reported to me their setup is able to reproduce this reliably, but the config is too complex for me to reproduce so I have relied on their reproduction and testing to debug and craft the patch for this.
[regression potential]
the change adds a flag to prevent repeated calls to vhost_net_stop(). This also prevents any calls to vhost_net_cleanup() from net_vhost_
[other info]
this was originally seen in the 2.5 version of qemu - specifically, the UCA version in trusty-mitaka (which uses the xenial qemu codebase).
After discussion upstream, it appears this was fixed upstream by commit e7c83a885f8, which is included starting in version 2.9. However, this commit depends on at least commit 5345fdb4467, and likely more other previous commits, which make widespread code changes and are unsuitable to backport. Therefore this seems like it should be specifically worked around in the Xenial qemu codebase.
The specific race condition for this (in the qemu 2.5 code version) is:
as shown in above bt, thread A starts shutting down qemu, e.g.:
vm_stop-
virtio_set_status
virtio_
virtio_
in this function, code gets to an if-else check for (!n->vhost_
While thread A is inside vhost_net_stop(), thread B is triggered by
the vhost net chr handler with a user event and calls:
net_vhost_
qmp_set_link (from case CHR_EVENT_CLOSED)
virtio_
virtio_
notice thread B has now reached the same function that thread A is in; since the checks in the function have not changed, thread B follows the same path that thread A followed, and enters vhost_net_stop().
Since thread A has already shut down and cleaned up some of the internals, once thread B starts trying to also clean up things, it segfaults as the shown in the bt.
Avoiding only this duplicate call to vhost_net_stop() is required, but not enough - let's continue to look at what thread B does after its call to qmp_set_link() returns:
net_vhost_
vhost_user_stop
vhost_
vhost_
However, in main() qemu registers atexit(
net_cleanup
qemu_del_nic (or qemu_del_
qemu_
vhost_
and the duplicate vhost_dev_cleanup fails assertions since things were already cleaned up. Additionally, if thread B's call to vhost_dev_cleanup() comes before thread A finishes vhost_net_stop(), then that will call vhost_dev_stop() and vhost_disable_
Related branches
- Christian Ehrhardt (community): Approve
-
Diff: 147 lines (+119/-0)4 files modifieddebian/changelog (+12/-0)
debian/patches/lp1823458/add-VirtIONet-vhost_stopped-flag-to-prevent-multiple.patch (+67/-0)
debian/patches/lp1823458/do-not-call-vhost_net_cleanup-on-running-net-from-ch.patch (+38/-0)
debian/patches/series (+2/-0)
Changed in qemu (Ubuntu Disco): | |
assignee: | nobody → Dan Streetman (ddstreet) |
Changed in qemu (Ubuntu Cosmic): | |
assignee: | nobody → Dan Streetman (ddstreet) |
Changed in qemu (Ubuntu Bionic): | |
assignee: | nobody → Dan Streetman (ddstreet) |
Changed in qemu (Ubuntu Xenial): | |
assignee: | nobody → Dan Streetman (ddstreet) |
Changed in qemu (Ubuntu Disco): | |
importance: | Undecided → Medium |
Changed in qemu (Ubuntu Cosmic): | |
importance: | Undecided → Medium |
Changed in qemu (Ubuntu Bionic): | |
importance: | Undecided → Medium |
Changed in qemu (Ubuntu Disco): | |
status: | New → In Progress |
Changed in qemu (Ubuntu Cosmic): | |
status: | New → In Progress |
Changed in qemu (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in qemu (Ubuntu Xenial): | |
status: | New → In Progress |
importance: | Undecided → Medium |
description: | updated |
description: | updated |
Changed in qemu (Ubuntu Trusty): | |
status: | New → In Progress |
importance: | Undecided → Medium |
assignee: | nobody → Dan Streetman (ddstreet) |
Changed in qemu: | |
status: | New → In Progress |
assignee: | nobody → Dan Streetman (ddstreet) |
description: | updated |
description: | updated |
Changed in qemu (Ubuntu Disco): | |
status: | In Progress → Fix Released |
Changed in qemu (Ubuntu): | |
status: | In Progress → Fix Released |
Changed in qemu: | |
status: | In Progress → Fix Released |
Changed in qemu (Ubuntu Cosmic): | |
status: | In Progress → Fix Released |
Changed in qemu (Ubuntu Bionic): | |
status: | In Progress → Fix Released |
Changed in qemu (Ubuntu Trusty): | |
status: | In Progress → Won't Fix |
Changed in qemu (Ubuntu Disco): | |
assignee: | Dan Streetman (ddstreet) → nobody |
Changed in qemu (Ubuntu Cosmic): | |
assignee: | Dan Streetman (ddstreet) → nobody |
Changed in qemu (Ubuntu Xenial): | |
assignee: | Dan Streetman (ddstreet) → nobody |
Changed in qemu (Ubuntu Bionic): | |
assignee: | Dan Streetman (ddstreet) → nobody |
Changed in qemu: | |
assignee: | Dan Streetman (ddstreet) → nobody |
Changed in qemu (Ubuntu): | |
assignee: | Dan Streetman (ddstreet) → nobody |
Changed in qemu (Ubuntu Xenial): | |
assignee: | nobody → Dan Streetman (ddstreet) |
Changed in qemu (Ubuntu Trusty): | |
assignee: | Dan Streetman (ddstreet) → nobody |
description: | updated |
Changed in cloud-archive: | |
status: | New → Fix Released |
test builds in https:/ /launchpad. net/~ddstreet/ +archive/ ubuntu/ lp1823458