aio: strengthen memory barriers for bottom half scheduling
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Kilo |
Fix Released
|
Undecided
|
Amad Ali | ||
qemu (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Trusty |
Fix Released
|
Undecided
|
Seyeong Kim |
Bug Description
[Impact]
There are two problems with memory barriers in async.c. The fix is
to use atomic_xchg in order to achieve sequential consistency between
the scheduling of a bottom half and the corresponding execution.
First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
not execute a memory barrier to order any writes needed by the callback
before the read of bh->scheduled. If the other side sees req->state as
THREAD_ACTIVE, the callback is not invoked and you get deadlock.
Second, the memory barrier in aio_bh_poll is too weak. Without this
patch, it is possible that bh->scheduled = 0 is not "published" until
after the callback has returned. Another thread wants to schedule the
bottom half, but it sees bh->scheduled = 1 and does nothing. This causes
a lost wakeup. The memory barrier should have been changed to smp_mb()
in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
2014-06-03) together with qemu_bh_
Both of these involve a store and a load, so they are reproducible on
x86_64 as well. It is however much easier on aarch64, where the
libguestfs test suite triggers the bug fairly easily. Even there the
failure can go away or appear depending on compiler optimization level,
tracing options, or even kernel debugging options.
[Test Case]
Paul Leveille however reported how to trigger the problem within 15
minutes on x86_64 as well. His (untested) recipe, reproduced here
for reference, is the following:
1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.
2) Use “cache=directsync” rather than the default of
“cache=none” to make it happen easier.
3) Use a server with a write-back RAID controller to allow for rapid
IO rates.
4) Run a random-access load that (mostly) writes chunks to various
files on the virtual block device.
a. I use ‘diskload.exe c:25’, a Microsoft HCT load
generator, on Windows VMs.
b. Iometer can probably be configured to generate a similar load.
5) Run multiple VMs in parallel, against the same storage device,
to shake the failure out sooner.
6) IvyBridge and Haswell processors for certain; not sure about others.
A similar patch survived over 12 hours of testing, where an unpatched
QEMU would fail within 15 minutes.
[Regression Potential]
There should be minimal regression potential in this patch
[Original text below]
I have a strong belief that the guest hanging is caused by the bug described in this upstream commit: https:/
I have been running a test for a couple of days (~4 days) with a qemu without the fix and one with the fix included (https:/
[53280.284059] INFO: task flush-253:16:304 blocked for more than 120 seconds.
[53280.285546] "echo 0 > /proc/sys/
[53280.287046] flush-253:16 D 0000000000000001 0 304 2 0x00000000
[53280.287058] ffff880138731710 0000000000000046 000000000000000e ffff88013fd13770
[53280.288652] ffff880138731fd8 ffff880138731fd8 ffff880138731fd8 0000000000013700
[53280.290255] ffff880138672de0 ffff880136fc96f0 0000000000000001 ffff88006f1d5cf8
[53280.291772] Call Trace:
[53280.292324] [<ffffffff8165b
[53280.293294] [<ffffffff8165c
[53280.294431] [<ffffffff8165c
[53280.295379] [<ffffffff8125b
[53280.296493] [<ffffffff81256
[53280.297681] [<ffffffff8108b
[53280.298693] [<ffffffff81256
[53280.299751] [<ffffffff81209
[53280.300868] [<ffffffff811f7
[53280.302078] [<ffffffff81121
[53280.303053] [<ffffffff81122
[53280.304095] [<ffffffff81121
[53280.305207] [<ffffffff81122
[53280.306275] [<ffffffff81123
[53280.307225] [<ffffffff811a2
[53280.308407] [<ffffffff811a2
[53280.309542] [<ffffffff811a3
[53280.310745] [<ffffffff811a3
[53280.311704] [<ffffffff81013
[53280.312756] [<ffffffff81194
[53280.313914] [<ffffffff811a3
[53280.315027] [<ffffffff811a4
[53280.316087] [<ffffffff81077
[53280.317103] [<ffffffff811a4
[53280.318208] [<ffffffff811a4
[53280.319277] [<ffffffff8108a
[53280.320213] [<ffffffff81668
[53280.321286] [<ffffffff8108a
[53280.322398] [<ffffffff81668
[53280.323369] INFO: task dd:25713 blocked for more than 120 seconds.
[53280.324480] "echo 0 > /proc/sys/
[53280.332587] dd D 0000000000000001 0 25713 389 0x00000000
[53280.332591] ffff88011870d918 0000000000000082 ffff88011870d938 ffffffff8105fc08
[53280.334156] ffff88011870dfd8 ffff88011870dfd8 ffff88011870dfd8 0000000000013700
[53280.335696] ffff88013683ade0 ffff880136fc8000 0000000000000002 ffff88006f1d5cf8
[53280.337258] Call Trace:
[53280.337781] [<ffffffff8105f
[53280.338803] [<ffffffff8165b
[53280.339738] [<ffffffff8165c
[53280.340868] [<ffffffff8165c
[53280.341841] [<ffffffff8125b
[53280.342880] [<ffffffff81256
[53280.343975] [<ffffffff8108b
[53280.345023] [<ffffffff81256
[53280.346043] [<ffffffff81118
[53280.347194] [<ffffffff81209
[53280.348271] [<ffffffff811f9
[53280.349276] [<ffffffff81118
[53280.350368] [<ffffffff811fb
[53280.351327] [<ffffffff81118
[53280.352457] [<ffffffff81119
[53280.353582] [<ffffffff8111a
[53280.354708] [<ffffffff81178
[53280.355702] [<ffffffff812d9
[53280.356887] [<ffffffff8129e
[53280.358092] [<ffffffff81179
[53280.359149] [<ffffffff81179
[53280.360671] [<ffffffff81179
[53280.361613] [<ffffffff81666
Setup:
DELL (R620) machine with 4 computes and 1 CIC. No EMC.
Steps to reproduce
1. Create a flavor with ephemeral storage:
$ nova flavor-create --ephemeral 120 m1.ephemeral auto 4096 50 2
2. Boot cirros VM
$ nova boot --flavor m1.ephemeral --image TestVM --nic net-id=<some net> foobar
3. Log into cirros VM and execute:
$ sudo umount /mnt
$ sudo mkdir /data
$ sudo mount -t tmpfs none /data/
$ sudo dd if=/dev/urandom bs=1M count=100 of=/data/data.bin
$ sudo mkfs.ext3 -b 4096 -J size=4 /dev/vdb
$ sudo mount -o data=journal,
4. Create write.sh:
#!/bin/sh
while true
do
dd if=/data/data.bin bs=1M count=100 of=/mnt/$$.tmp 2> /dev/null
done
5. Start around 20 instances of the script:
Run 20 times:
$ sudo ./write.sh &
6. Log into the compute-node where the cirros VM is running.
7. Create pin-io.py:
#!/
import sys
import glob
import os
import random
numa0_cpus = map(int, sys.argv[
numa1_cpus = map(int, sys.argv[
exclude = []
pid = int(sys.argv[3])
exclude = [pid, int(sys.argv[4]), int(sys.argv[5])]
for tid in exclude:
for tid_str in glob.glob(
tid = int(os.
if tid in exclude:
cpu = random.
8. Figure out the pid and the "libvirt name" of the qemu-process that runs the cirros-image.
9. Get the thread-ids of the vcpu0 and vcpu1 threads
$ virsh qemu-monitor-
10. Pin the main-threads to numa-node 0 and the io-threads to numa-node 1:
$ ./pin-io.py <numa0 pcpus> <numa1 pcpus> <qemu pid> <vcpu0 tid> <vcpu1 tid>
Example for dell-compute:
$ ./pin-io.py 0,2,4,6,
11. Wait a couple of days
12. You should see hanging tasks in the dmesg of the vm
Note that the pin-io part may not be needed but it should make the bug appear more often.
I think I have reach the end of the line now and that we should request the fix mentioned fix from Mirantis.
/Rickard Enberg
(Ericsson AB)
information type: | Public → Public Security |
information type: | Public Security → Private Security |
information type: | Private Security → Public |
Changed in cloud-archive: | |
assignee: | nobody → Amad Ali (amad) |
Changed in cloud-archive: | |
status: | New → Invalid |
assignee: | Amad Ali (amad) → nobody |
tags: |
added: verification-kilo-done removed: verification-kilo-needed |
the bug was introduced upstream by commit c2e50e3d11a0bf4 c973cc30478c1af 0f2d5f8e81 (thread-pool: avoid per-thread-pool EventNotifier). Until that commit, the code in async.c was safe because bottom halves are never used across threads.
It was fixed by upstream commit e8d3b1a25f284cd f9705b7cf041228 1cc9ee3a36 released in QEMU 2.3.0 git.qemu. org/?p= qemu.git; a=commit; h=e8d3b1a25f284 cdf9705b7cf0412 281cc9ee3a36
http://
QEMU 2.2 in cloud archive has this bug