Comment 16 for bug 1668829

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Wangzhike,

I failed to update case with latest analysis here. I could isolate the commit with a bisection. If you could provide feedback about this would be also good.

#### ANALYSIS

# git bisection log
#
# bad: [a8c40fa2d667e585382080db36ac44e216b37a1c] Update version for v2.5.0 release
# good: [e5b3a24181ea0cebf1c5b20f44d016311b7048f0] Update version for v2.3.0 release
# bad: [6b324b3e5906fd9a9ce7f4f24decd1f1c7afde97] Merge remote-tracking branch 'remotes/stefanha/tags/net-pull-request' into staging
# good: [a3d586f704609a45b6037534cb2f34da5dfd8895] dma/rc4030: create custom DMA address space
# good: [54f3223730736fca1e6e89bb7f99c4f8432fdabb] ahci: factor ncq_finish out of ncq_cb
# good: [e46e1a74ef482f1ef773e750df9654ef4442ca29] target-arm: Fix broken SCTLR_EL3 reset
# bad: [776f87845137a9b300a4815ba6bf6879310795aa] Merge remote-tracking branch 'remotes/mjt/tags/pull-trivial-patches-2015-07-27' into staging
# good: [21a03d17f2edb1e63f7137d97ba355cc6f19d79f] AioContext: fix broken placement of event_notifier_test_and_clear
# bad: [e40db4c6d391419c0039fe274c74df32a6ca1a28] Merge remote-tracking branch 'remotes/jnsnow/tags/cve-2015-5154-pull-request' into staging
# bad: [30fdfae49d53cfc678859095e49ac60b79562d6f] Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20150723' into staging
# bad: [12e21eb088a51161c78ee39ed54ac56ebcff4243] Merge remote-tracking branch 'remotes/ehabkost/tags/numa-pull-request' into staging
# bad: [dc94bd9166af5236a56bd5bb06845911915a925c] Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging
# good: [b9c46307996856d03ddc1527468ff5401ac03a79] Merge remote-tracking branch 'remotes/mdroth/tags/qga-pull-2015-07-21-tag' into staging
# bad: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: optimize clearing the EventNotifier
# first bad commit: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: optimize clearing the EventNotifier
#

commit 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3 (HEAD, refs/bisect/bad)
Author: Paolo Bonzini <email address hidden>
Date: Tue Jul 21 16:07:53 2015 +0200

    It is pretty rare for aio_notify to actually set the EventNotifier. It
    can happen with worker threads such as thread-pool.c's, but otherwise it
    should never be set thanks to the ctx->notify_me optimization. The
    previous patch, unfortunately, added an unconditional call to
    event_notifier_test_and_clear; now add a userspace fast path that
    avoids the call.

    Note that it is not possible to do the same with event_notifier_set;
    it would break, as proved (again) by the included formal model.

    This patch survived over 3000 reboots on aarch64 KVM.

    Signed-off-by: Paolo Bonzini <email address hidden>
    Reviewed-by: Fam Zheng <email address hidden>
    Tested-by: Richard W.M. Jones <email address hidden>
    Message-id: <email address hidden>
    Signed-off-by: Stefan Hajnoczi <email address hidden>

## UNDERSTANDING

Basic logic for AioContext before this change was:

QEMU has its own asynchronous IO implementation, which has a master structure referred as "AIO Context". This asynchronous IO subsystem is used by different parts of QEMU (the main loop, or, the new IO threads, also based in AIO context). AIO context is a g_source implementation (from glib) and handles different types of file descriptor structures (for polling them) and calls its callbacks based on a need for re-scheduling (bottom half), presence of data in the file descriptors (pollfd), or after timeouts. There are callbacks for rfifolocks (similar to mutexes, with recursion), aiohandlers (using poll to monitor file descriptors), qemu's deferred work implementation called "bottom halves" (just like kernel's bottom halves idea) and notification events.

The mechanism AioContext used, in a simplified version, is this:

- aio is basically a scheduler where you can register callbacks.
- aio first tries to dispatch bottom halves and then the aio handlers.
- aio structure has fields for attention notification: notify_me, notified, notifier.
- aio subsystem knows about the need to block "aio_poll", for example, through notifier variables.

Example:

- aio user creates a bottom half with a specific callback and schedules it.
- qemu_bh_schedule() function already calls "aio_notify()":
  - to set "notified" variable to true.
  - to init an event notifier (notifier) containing "1" as string.
- aio subsystem "acks" this event when:
  - doing a aio_ctx_check before the next run (g_source model).
  - aio_poll() runs, the core loop function.
- the "ack" implies in:
  - setting "notified" back to false.
  - cleaning event notifier (notifier).

With that said, when observing the change more carefully:

- event_notifier_test_and_clear() is not being called directly.
- now, to cleanup the event notifier, we have another "logic".
- the logic doesn't seem to be the problem:
  - the event notifier is an struct with 2 file descriptors (to be read) on it.
  - cleaning this right when called or in the next poll() is no different.

## ANALYSIS (based on assumptions since I'm not reproducing this locally)

# Probable cause

Unfortunately the new variable synchronization asked for some barriers:

+ atomic_mb_set(&ctx->notified, true);

#ifndef atomic_mb_set
#define atomic_mb_set(ptr, i) do { \
    smp_wmb(); \
    atomic_set(ptr, i); \
    smp_mb(); \
} while (0)
#endif

The "smp_wmb()" macro is either 1 or 2 compiler barrier (depending on the architecture). I believe that x86_64 is using __sync_synchronize() directive, causing a full HW barrier to happen (not only an out-of-order correction, but, a real barrier). The "smp_mb()" is the same thing, 1 full HW barrier. So, for this particular change, we added at least two full HW barriers (bigger cost for sensitive workload).

and

+ if (atomic_xchg(&ctx->notified, false)) {

This is a macro translating to __atomic_exchange() directive which, likely, makes GCC to use another mem barrier for the atomicity regarding cache coherency in between different cache layers.

# VirtIO

This commit has probably made a difference for final user's tests because virtio is using the QEMU Bottom Half implementation from AIO instead of the "timer" implementation (for deferred works). I'm not sure if timer would be better or even worse (txmode=timer for virtio device).

https://pastebin.ubuntu.com/26417646/

Calling qemu_bh_new it creates a bottom half for "virtio_net_tx_bh", which is the function responsible for scheduling the bottom half with "qemu_bh_schedule", so, virtio_net_flush_txt can be run again and again (causing the barriers to happen again and again).

# Observations

If we confirm that this is the actual problem - likely - it is possible that we can try making this change more efficient, or even removing it, but that will, likely, depend on upstream accepting what we proposed. The reason behind this is simple: they can choose to have the barriers even if causing lower throughput for some workloads.