aio: strengthen memory barriers for bottom half scheduling

Bug #1587039 reported by Amad Ali
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Kilo
Fix Released
Undecided
Amad Ali
qemu (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Fix Released
Undecided
Seyeong Kim

Bug Description

[Impact]

There are two problems with memory barriers in async.c. The fix is
to use atomic_xchg in order to achieve sequential consistency between
the scheduling of a bottom half and the corresponding execution.

First, if bh->scheduled is already 1 in qemu_bh_schedule, QEMU does
not execute a memory barrier to order any writes needed by the callback
before the read of bh->scheduled. If the other side sees req->state as
THREAD_ACTIVE, the callback is not invoked and you get deadlock.

Second, the memory barrier in aio_bh_poll is too weak. Without this
patch, it is possible that bh->scheduled = 0 is not "published" until
after the callback has returned. Another thread wants to schedule the
bottom half, but it sees bh->scheduled = 1 and does nothing. This causes
a lost wakeup. The memory barrier should have been changed to smp_mb()
in commit 924fe12 (aio: fix qemu_bh_schedule() bh->ctx race condition,
2014-06-03) together with qemu_bh_schedule()'s.

Both of these involve a store and a load, so they are reproducible on
x86_64 as well. It is however much easier on aarch64, where the
libguestfs test suite triggers the bug fairly easily. Even there the
failure can go away or appear depending on compiler optimization level,
tracing options, or even kernel debugging options.

[Test Case]

Paul Leveille however reported how to trigger the problem within 15
minutes on x86_64 as well. His (untested) recipe, reproduced here
for reference, is the following:

   1) Qcow2 (or 3) is critical – raw files alone seem to avoid the problem.

   2) Use “cache=directsync” rather than the default of
   “cache=none” to make it happen easier.

   3) Use a server with a write-back RAID controller to allow for rapid
   IO rates.

   4) Run a random-access load that (mostly) writes chunks to various
   files on the virtual block device.

      a. I use ‘diskload.exe c:25’, a Microsoft HCT load
         generator, on Windows VMs.

      b. Iometer can probably be configured to generate a similar load.

   5) Run multiple VMs in parallel, against the same storage device,
   to shake the failure out sooner.

   6) IvyBridge and Haswell processors for certain; not sure about others.

A similar patch survived over 12 hours of testing, where an unpatched
QEMU would fail within 15 minutes.

[Regression Potential]

There should be minimal regression potential in this patch

[Original text below]

I have a strong belief that the guest hanging is caused by the bug described in this upstream commit: https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg00625.html. I have managed to capture a core-dump of a qemu-process with a hanging VM. That has shown that qemu is waiting for a io-request to be completed, but it never happens. After some more analyzing I can see that the request actually is completed but the completion-routine seems to have never been called. From the core-dump analyze and the bug-description in the commit that I mention, I strongly suspect the memory-barriers is the problem. Unfortunately I have not yet found a good and fast way to prove that is the case. The tricky thing is that the guest needs to dispatch a write (or flush) that is a barrier (i.e no other requests will be dispatched until that request is finished) and that exact request needs to be the one for which the completion is lost. This is hard to accomplish.

I have been running a test for a couple of days (~4 days) with a qemu without the fix and one with the fix included (https://lists.gnu.org/archive/html/qemu-devel/2015-04/msg00625.html). The one without the fix has gotten several "hung_task" messages. The other one as not yet received such a message.

[53280.284059] INFO: task flush-253:16:304 blocked for more than 120 seconds.
[53280.285546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53280.287046] flush-253:16 D 0000000000000001 0 304 2 0x00000000
[53280.287058] ffff880138731710 0000000000000046 000000000000000e ffff88013fd13770
[53280.288652] ffff880138731fd8 ffff880138731fd8 ffff880138731fd8 0000000000013700
[53280.290255] ffff880138672de0 ffff880136fc96f0 0000000000000001 ffff88006f1d5cf8
[53280.291772] Call Trace:
[53280.292324] [<ffffffff8165badf>] schedule+0x3f/0x60
[53280.293294] [<ffffffff8165c8c7>] __mutex_lock_slowpath+0xd7/0x150
[53280.294431] [<ffffffff8165c482>] mutex_lock+0x22/0x40
[53280.295379] [<ffffffff8125b23a>] __log_wait_for_space+0xda/0x1b0
[53280.296493] [<ffffffff81256271>] start_this_handle.isra.8+0xf1/0x400
[53280.297681] [<ffffffff8108b8e0>] ? add_wait_queue+0x60/0x60
[53280.298693] [<ffffffff812567d0>] journal_start+0xc0/0x100
[53280.299751] [<ffffffff81209956>] ext3_journal_start_sb+0x36/0x70
[53280.300868] [<ffffffff811f7f68>] ext3_journalled_writepage+0xd8/0x240
[53280.302078] [<ffffffff81121de7>] __writepage+0x17/0x40
[53280.303053] [<ffffffff81122518>] write_cache_pages+0x208/0x460
[53280.304095] [<ffffffff81121dd0>] ? set_page_dirty_lock+0x50/0x50
[53280.305207] [<ffffffff811227ba>] generic_writepages+0x4a/0x70
[53280.306275] [<ffffffff81123725>] do_writepages+0x35/0x40
[53280.307225] [<ffffffff811a2850>] writeback_single_inode+0x170/0x420
[53280.308407] [<ffffffff811a2f16>] writeback_sb_inodes+0x1b6/0x270
[53280.309542] [<ffffffff811a306e>] __writeback_inodes_wb+0x9e/0xd0
[53280.310745] [<ffffffff811a330b>] wb_writeback+0x26b/0x310
[53280.311704] [<ffffffff810136e5>] ? __switch_to+0xf5/0x360
[53280.312756] [<ffffffff811949b2>] ? get_nr_dirty_inodes+0x52/0x80
[53280.313914] [<ffffffff811a344f>] wb_check_old_data_flush+0x9f/0xb0
[53280.315027] [<ffffffff811a4321>] wb_do_writeback+0x151/0x1d0
[53280.316087] [<ffffffff810777b0>] ? usleep_range+0x50/0x50
[53280.317103] [<ffffffff811a4423>] bdi_writeback_thread+0x83/0x280
[53280.318208] [<ffffffff811a43a0>] ? wb_do_writeback+0x1d0/0x1d0
[53280.319277] [<ffffffff8108ae3c>] kthread+0x8c/0xa0
[53280.320213] [<ffffffff81668174>] kernel_thread_helper+0x4/0x10
[53280.321286] [<ffffffff8108adb0>] ? flush_kthread_worker+0xa0/0xa0
[53280.322398] [<ffffffff81668170>] ? gs_change+0x13/0x13

[53280.323369] INFO: task dd:25713 blocked for more than 120 seconds.
[53280.324480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[53280.332587] dd D 0000000000000001 0 25713 389 0x00000000
[53280.332591] ffff88011870d918 0000000000000082 ffff88011870d938 ffffffff8105fc08
[53280.334156] ffff88011870dfd8 ffff88011870dfd8 ffff88011870dfd8 0000000000013700
[53280.335696] ffff88013683ade0 ffff880136fc8000 0000000000000002 ffff88006f1d5cf8
[53280.337258] Call Trace:
[53280.337781] [<ffffffff8105fc08>] ? load_balance+0x78/0x370
[53280.338803] [<ffffffff8165badf>] schedule+0x3f/0x60
[53280.339738] [<ffffffff8165c8c7>] __mutex_lock_slowpath+0xd7/0x150
[53280.340868] [<ffffffff8165c482>] mutex_lock+0x22/0x40
[53280.341841] [<ffffffff8125b23a>] __log_wait_for_space+0xda/0x1b0
[53280.342880] [<ffffffff81256271>] start_this_handle.isra.8+0xf1/0x400
[53280.343975] [<ffffffff8108b8e0>] ? add_wait_queue+0x60/0x60
[53280.345023] [<ffffffff812567d0>] journal_start+0xc0/0x100
[53280.346043] [<ffffffff81118ec8>] ? grab_cache_page_write_begin+0x78/0xe0
[53280.347194] [<ffffffff81209956>] ext3_journal_start_sb+0x36/0x70
[53280.348271] [<ffffffff811f9eef>] ext3_write_begin+0x9f/0x270
[53280.349276] [<ffffffff8111839a>] generic_perform_write+0xca/0x210
[53280.350368] [<ffffffff811fbd4d>] ? ext3_dirty_inode+0x6d/0xa0
[53280.351327] [<ffffffff8111853d>] generic_file_buffered_write+0x5d/0x90
[53280.352457] [<ffffffff81119ea9>] __generic_file_aio_write+0x229/0x440
[53280.353582] [<ffffffff8111a132>] generic_file_aio_write+0x72/0xe0
[53280.354708] [<ffffffff81178e0a>] do_sync_write+0xda/0x120
[53280.355702] [<ffffffff812d9588>] ? apparmor_file_permission+0x18/0x20
[53280.356887] [<ffffffff8129ec6c>] ? security_file_permission+0x2c/0xb0
[53280.358092] [<ffffffff811793b1>] ? rw_verify_area+0x61/0xf0
[53280.359149] [<ffffffff81179713>] vfs_write+0xb3/0x180
[53280.360671] [<ffffffff81179a3a>] sys_write+0x4a/0x90
[53280.361613] [<ffffffff81666002>] system_call_fastpath+0x16/0x1b

Setup:
DELL (R620) machine with 4 computes and 1 CIC. No EMC.

Steps to reproduce

1. Create a flavor with ephemeral storage:
     $ nova flavor-create --ephemeral 120 m1.ephemeral auto 4096 50 2

2. Boot cirros VM
     $ nova boot --flavor m1.ephemeral --image TestVM --nic net-id=<some net> foobar

3. Log into cirros VM and execute:
     $ sudo umount /mnt
     $ sudo mkdir /data
     $ sudo mount -t tmpfs none /data/
     $ sudo dd if=/dev/urandom bs=1M count=100 of=/data/data.bin
     $ sudo mkfs.ext3 -b 4096 -J size=4 /dev/vdb
     $ sudo mount -o data=journal,barrier=1 /dev/vdb /mnt/

4. Create write.sh:

     #!/bin/sh

     while true
     do
         dd if=/data/data.bin bs=1M count=100 of=/mnt/$$.tmp 2> /dev/null
     done

5. Start around 20 instances of the script:

    Run 20 times:
    $ sudo ./write.sh &

6. Log into the compute-node where the cirros VM is running.

7. Create pin-io.py:

    #!/usr/bin/python

    import sys
    import glob
    import os
    import random

    numa0_cpus = map(int, sys.argv[1].split(","))
    numa1_cpus = map(int, sys.argv[2].split(","))

    exclude = []

    pid = int(sys.argv[3])

    exclude = [pid, int(sys.argv[4]), int(sys.argv[5])]

    for tid in exclude:
        os.system("taskset -cp 0 %i" % tid)

    for tid_str in glob.glob("/proc/%i/task/*" % pid):
        tid = int(os.path.basename(tid_str))

        if tid in exclude:
            continue

        cpu = random.choice(numa1_cpus)

        os.system("taskset -cp %i %i" % (cpu, tid))

8. Figure out the pid and the "libvirt name" of the qemu-process that runs the cirros-image.

9. Get the thread-ids of the vcpu0 and vcpu1 threads

       $ virsh qemu-monitor-command --hmp <libvirt vm-name> "info cpus"

10. Pin the main-threads to numa-node 0 and the io-threads to numa-node 1:

       $ ./pin-io.py <numa0 pcpus> <numa1 pcpus> <qemu pid> <vcpu0 tid> <vcpu1 tid>

    Example for dell-compute:

       $ ./pin-io.py 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 12468 12501 12502

11. Wait a couple of days

12. You should see hanging tasks in the dmesg of the vm

Note that the pin-io part may not be needed but it should make the bug appear more often.

I think I have reach the end of the line now and that we should request the fix mentioned fix from Mirantis.

/Rickard Enberg
(Ericsson AB)

Revision history for this message
Amad Ali (amad) wrote :

the bug was introduced upstream by commit c2e50e3d11a0bf4c973cc30478c1af0f2d5f8e81 (thread-pool: avoid per-thread-pool EventNotifier). Until that commit, the code in async.c was safe because bottom halves are never used across threads.

It was fixed by upstream commit e8d3b1a25f284cdf9705b7cf0412281cc9ee3a36 released in QEMU 2.3.0
http://git.qemu.org/?p=qemu.git;a=commit;h=e8d3b1a25f284cdf9705b7cf0412281cc9ee3a36

QEMU 2.2 in cloud archive has this bug

Revision history for this message
Amad Ali (amad) wrote :

There is a hotfix for this bug in the following PPA:

https://launchpad.net/~amad/+archive/ubuntu/kilo-cloud-archive

Amad Ali (amad)
information type: Public → Public Security
information type: Public Security → Private Security
information type: Private Security → Public
Amad Ali (amad)
Changed in cloud-archive:
assignee: nobody → Amad Ali (amad)
Revision history for this message
Amad Ali (amad) wrote :
Revision history for this message
Zsolt Babindai (zsolt-babindai) wrote :

Our first tests shows that patched qemu works fine and not resulting hanging tasks. We let the VM running further and give it more time to check if any issue coming up.
/Ericsson ETH/

Changed in cloud-archive:
status: New → Invalid
assignee: Amad Ali (amad) → nobody
Revision history for this message
James Page (james-page) wrote : Please test proposed package

Hello Amad, or anyone else affected,

Accepted qemu into kilo-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:kilo-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-kilo-needed to verification-kilo-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-kilo-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-kilo-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Amad,

I've run regression tests successful against kilo-proposed. Once you are able to verify, please tag the bug as specified above. Thanks!

Corey

Revision history for this message
Ryan Beisner (1chb1n) wrote : Update Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

This bug was fixed in the package qemu - 1:2.2+dfsg-5expubuntu9.7~cloud5
---------------

 qemu (1:2.2+dfsg-5expubuntu9.7~cloud5) trusty-kilo; urgency=medium
 .
   * debian/patches/aio-strengthen-memory-barriers.patch: Fix
     "aio: strengthen memory barriers for bottom half scheduling" (LP: #1587039)

tags: added: verification-kilo-done
removed: verification-kilo-needed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This bug seems to be valid for trusty as well as the fix was upstream accepted in 2.1 (post trusty).
Seyeong Kim was working on that as part of 1640382 and has a fix ready plus this one backported.

Changed in qemu (Ubuntu):
status: New → Fix Released
Changed in qemu (Ubuntu Trusty):
status: New → Triaged
assignee: nobody → Seyeong Kim (xtrusia)
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Amad, or anyone else affected,

Accepted qemu into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/2.0.0+dfsg-2ubuntu1.31 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu (Ubuntu Trusty):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1640382

ii qemu 2.0.0+dfsg-2ubuntu1.31 amd64 fast processor emulator
ii qemu-keymaps 2.0.0+dfsg-2ubuntu1.31 all QEMU keyboard maps
ii qemu-system 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries
ii qemu-system-arm 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries (arm)
ii qemu-system-common 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-mips 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries (mips)
ii qemu-system-misc 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries (miscelaneous)
ii qemu-system-ppc 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries (ppc)
ii qemu-system-sparc 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries (sparc)
ii qemu-system-x86 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU full system emulation binaries (x86)
ii qemu-user 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU user mode emulation binaries
ii qemu-utils 2.0.0+dfsg-2ubuntu1.31 amd64 QEMU utilities

tags: added: verification-done
removed: verification-needed
Revision history for this message
Brian Murray (brian-murray) wrote :

Could you explain the testing that you did with the new versions of qemu?

tags: added: verification-needed
removed: verification-done
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

the second test case was i trying to reproduce, but can't reproduced in my env.

this is already fixed in kilo&upstream but not trusty,

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1640382

i originally want to fix above LP, and found one commit but i found Amad fixed one step later commit then i wanted.

so Im trying to patch both of them for trusty.

it's not easy to reproduce this again.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

I'll check this on the other machine

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

ok, I update what I did.

I've been running 10 ~ 20vms on one server for few weeks.

with fio io load tool running on each machine.

each option is the same as testcase section mentioned.

There is no issue so far.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 2.0.0+dfsg-2ubuntu1.31

---------------
qemu (2.0.0+dfsg-2ubuntu1.31) trusty; urgency=medium

  * aio: fix qemu_bh_schedule() bh->ctx race condition (LP: #1640382)
    - d/p/0001-aio-fix-qemu_bh_schedule-bh-ctx-race-condition.patch
  * aio: strengthen memory barriers for bottom half scheduling
    (LP: #1587039)
    - d/p/0002-aio-strengthen-memory-barriers-for-bottom-half-sched.patch

 -- Seyeong Kim <email address hidden> Thu, 24 Nov 2016 10:44:55 +0100

Changed in qemu (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.