I failed to update case with latest analysis here. I could isolate the commit with a bisection. If you could provide feedback about this would be also good.
#### ANALYSIS
# git bisection log
#
# bad: [a8c40fa2d667e585382080db36ac44e216b37a1c] Update version for v2.5.0 release
# good: [e5b3a24181ea0cebf1c5b20f44d016311b7048f0] Update version for v2.3.0 release
# bad: [6b324b3e5906fd9a9ce7f4f24decd1f1c7afde97] Merge remote-tracking branch 'remotes/stefanha/tags/net-pull-request' into staging
# good: [a3d586f704609a45b6037534cb2f34da5dfd8895] dma/rc4030: create custom DMA address space
# good: [54f3223730736fca1e6e89bb7f99c4f8432fdabb] ahci: factor ncq_finish out of ncq_cb
# good: [e46e1a74ef482f1ef773e750df9654ef4442ca29] target-arm: Fix broken SCTLR_EL3 reset
# bad: [776f87845137a9b300a4815ba6bf6879310795aa] Merge remote-tracking branch 'remotes/mjt/tags/pull-trivial-patches-2015-07-27' into staging
# good: [21a03d17f2edb1e63f7137d97ba355cc6f19d79f] AioContext: fix broken placement of event_notifier_test_and_clear
# bad: [e40db4c6d391419c0039fe274c74df32a6ca1a28] Merge remote-tracking branch 'remotes/jnsnow/tags/cve-2015-5154-pull-request' into staging
# bad: [30fdfae49d53cfc678859095e49ac60b79562d6f] Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20150723' into staging
# bad: [12e21eb088a51161c78ee39ed54ac56ebcff4243] Merge remote-tracking branch 'remotes/ehabkost/tags/numa-pull-request' into staging
# bad: [dc94bd9166af5236a56bd5bb06845911915a925c] Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging
# good: [b9c46307996856d03ddc1527468ff5401ac03a79] Merge remote-tracking branch 'remotes/mdroth/tags/qga-pull-2015-07-21-tag' into staging
# bad: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: optimize clearing the EventNotifier
# first bad commit: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: optimize clearing the EventNotifier
#
It is pretty rare for aio_notify to actually set the EventNotifier. It
can happen with worker threads such as thread-pool.c's, but otherwise it
should never be set thanks to the ctx->notify_me optimization. The
previous patch, unfortunately, added an unconditional call to
event_notifier_test_and_clear; now add a userspace fast path that
avoids the call.
Note that it is not possible to do the same with event_notifier_set;
it would break, as proved (again) by the included formal model.
This patch survived over 3000 reboots on aarch64 KVM.
Signed-off-by: Paolo Bonzini <email address hidden>
Reviewed-by: Fam Zheng <email address hidden>
Tested-by: Richard W.M. Jones <email address hidden>
Message-id: <email address hidden>
Signed-off-by: Stefan Hajnoczi <email address hidden>
## UNDERSTANDING
Basic logic for AioContext before this change was:
QEMU has its own asynchronous IO implementation, which has a master structure referred as "AIO Context". This asynchronous IO subsystem is used by different parts of QEMU (the main loop, or, the new IO threads, also based in AIO context). AIO context is a g_source implementation (from glib) and handles different types of file descriptor structures (for polling them) and calls its callbacks based on a need for re-scheduling (bottom half), presence of data in the file descriptors (pollfd), or after timeouts. There are callbacks for rfifolocks (similar to mutexes, with recursion), aiohandlers (using poll to monitor file descriptors), qemu's deferred work implementation called "bottom halves" (just like kernel's bottom halves idea) and notification events.
The mechanism AioContext used, in a simplified version, is this:
- aio is basically a scheduler where you can register callbacks.
- aio first tries to dispatch bottom halves and then the aio handlers.
- aio structure has fields for attention notification: notify_me, notified, notifier.
- aio subsystem knows about the need to block "aio_poll", for example, through notifier variables.
Example:
- aio user creates a bottom half with a specific callback and schedules it.
- qemu_bh_schedule() function already calls "aio_notify()":
- to set "notified" variable to true.
- to init an event notifier (notifier) containing "1" as string.
- aio subsystem "acks" this event when:
- doing a aio_ctx_check before the next run (g_source model).
- aio_poll() runs, the core loop function.
- the "ack" implies in:
- setting "notified" back to false.
- cleaning event notifier (notifier).
With that said, when observing the change more carefully:
- event_notifier_test_and_clear() is not being called directly.
- now, to cleanup the event notifier, we have another "logic".
- the logic doesn't seem to be the problem:
- the event notifier is an struct with 2 file descriptors (to be read) on it.
- cleaning this right when called or in the next poll() is no different.
## ANALYSIS (based on assumptions since I'm not reproducing this locally)
# Probable cause
Unfortunately the new variable synchronization asked for some barriers:
+ atomic_mb_set(&ctx->notified, true);
#ifndef atomic_mb_set
#define atomic_mb_set(ptr, i) do { \
smp_wmb(); \
atomic_set(ptr, i); \
smp_mb(); \
} while (0)
#endif
The "smp_wmb()" macro is either 1 or 2 compiler barrier (depending on the architecture). I believe that x86_64 is using __sync_synchronize() directive, causing a full HW barrier to happen (not only an out-of-order correction, but, a real barrier). The "smp_mb()" is the same thing, 1 full HW barrier. So, for this particular change, we added at least two full HW barriers (bigger cost for sensitive workload).
and
+ if (atomic_xchg(&ctx->notified, false)) {
This is a macro translating to __atomic_exchange() directive which, likely, makes GCC to use another mem barrier for the atomicity regarding cache coherency in between different cache layers.
# VirtIO
This commit has probably made a difference for final user's tests because virtio is using the QEMU Bottom Half implementation from AIO instead of the "timer" implementation (for deferred works). I'm not sure if timer would be better or even worse (txmode=timer for virtio device).
Calling qemu_bh_new it creates a bottom half for "virtio_net_tx_bh", which is the function responsible for scheduling the bottom half with "qemu_bh_schedule", so, virtio_net_flush_txt can be run again and again (causing the barriers to happen again and again).
# Observations
If we confirm that this is the actual problem - likely - it is possible that we can try making this change more efficient, or even removing it, but that will, likely, depend on upstream accepting what we proposed. The reason behind this is simple: they can choose to have the barriers even if causing lower throughput for some workloads.
Hello Wangzhike,
I failed to update case with latest analysis here. I could isolate the commit with a bisection. If you could provide feedback about this would be also good.
#### ANALYSIS
# git bisection log 85382080db36ac4 4e216b37a1c] Update version for v2.5.0 release ebf1c5b20f44d01 6311b7048f0] Update version for v2.3.0 release 9a9ce7f4f24decd 1f1c7afde97] Merge remote-tracking branch 'remotes/ stefanha/ tags/net- pull-request' into staging 45b6037534cb2f3 4da5dfd8895] dma/rc4030: create custom DMA address space ca1e6e89bb7f99c 4f8432fdabb] ahci: factor ncq_finish out of ncq_cb 1ef773e750df965 4ef4442ca29] target-arm: Fix broken SCTLR_EL3 reset b300a4815ba6bf6 879310795aa] Merge remote-tracking branch 'remotes/ mjt/tags/ pull-trivial- patches- 2015-07- 27' into staging e63f7137d97ba35 5cc6f19d79f] AioContext: fix broken placement of event_notifier_ test_and_ clear 9c0039fe274c74d f32a6ca1a28] Merge remote-tracking branch 'remotes/ jnsnow/ tags/cve- 2015-5154- pull-request' into staging c678859095e49ac 60b79562d6f] Merge remote-tracking branch 'remotes/ rth/tags/ pull-tcg- 20150723' into staging 61c78ee39ed54ac 56ebcff4243] Merge remote-tracking branch 'remotes/ ehabkost/ tags/numa- pull-request' into staging 36a56bd5bb06845 911915a925c] Merge remote-tracking branch 'remotes/ stefanha/ tags/block- pull-request' into staging d03ddc1527468ff 5401ac03a79] Merge remote-tracking branch 'remotes/ mdroth/ tags/qga- pull-2015- 07-21-tag' into staging 4209e2c8bbc76ff 05c85a235f3] AioContext: optimize clearing the EventNotifier 4209e2c8bbc76ff 05c85a235f3] AioContext: optimize clearing the EventNotifier
#
# bad: [a8c40fa2d667e5
# good: [e5b3a24181ea0c
# bad: [6b324b3e5906fd
# good: [a3d586f704609a
# good: [54f3223730736f
# good: [e46e1a74ef482f
# bad: [776f87845137a9
# good: [21a03d17f2edb1
# bad: [e40db4c6d39141
# bad: [30fdfae49d53cf
# bad: [12e21eb088a511
# bad: [dc94bd9166af52
# good: [b9c46307996856
# bad: [05e514b1d4d5bd
# first bad commit: [05e514b1d4d5bd
#
commit 05e514b1d4d5bd4 209e2c8bbc76ff0 5c85a235f3 (HEAD, refs/bisect/bad)
Author: Paolo Bonzini <email address hidden>
Date: Tue Jul 21 16:07:53 2015 +0200
It is pretty rare for aio_notify to actually set the EventNotifier. It notifier_ test_and_ clear; now add a userspace fast path that
can happen with worker threads such as thread-pool.c's, but otherwise it
should never be set thanks to the ctx->notify_me optimization. The
previous patch, unfortunately, added an unconditional call to
event_
avoids the call.
Note that it is not possible to do the same with event_notifier_set;
it would break, as proved (again) by the included formal model.
This patch survived over 3000 reboots on aarch64 KVM.
Signed-off-by: Paolo Bonzini <email address hidden>
Reviewed-by: Fam Zheng <email address hidden>
Tested-by: Richard W.M. Jones <email address hidden>
Message-id: <email address hidden>
Signed-off-by: Stefan Hajnoczi <email address hidden>
## UNDERSTANDING
Basic logic for AioContext before this change was:
QEMU has its own asynchronous IO implementation, which has a master structure referred as "AIO Context". This asynchronous IO subsystem is used by different parts of QEMU (the main loop, or, the new IO threads, also based in AIO context). AIO context is a g_source implementation (from glib) and handles different types of file descriptor structures (for polling them) and calls its callbacks based on a need for re-scheduling (bottom half), presence of data in the file descriptors (pollfd), or after timeouts. There are callbacks for rfifolocks (similar to mutexes, with recursion), aiohandlers (using poll to monitor file descriptors), qemu's deferred work implementation called "bottom halves" (just like kernel's bottom halves idea) and notification events.
The mechanism AioContext used, in a simplified version, is this:
- aio is basically a scheduler where you can register callbacks.
- aio first tries to dispatch bottom halves and then the aio handlers.
- aio structure has fields for attention notification: notify_me, notified, notifier.
- aio subsystem knows about the need to block "aio_poll", for example, through notifier variables.
Example:
- aio user creates a bottom half with a specific callback and schedules it.
- qemu_bh_schedule() function already calls "aio_notify()":
- to set "notified" variable to true.
- to init an event notifier (notifier) containing "1" as string.
- aio subsystem "acks" this event when:
- doing a aio_ctx_check before the next run (g_source model).
- aio_poll() runs, the core loop function.
- the "ack" implies in:
- setting "notified" back to false.
- cleaning event notifier (notifier).
With that said, when observing the change more carefully:
- event_notifier_ test_and_ clear() is not being called directly.
- now, to cleanup the event notifier, we have another "logic".
- the logic doesn't seem to be the problem:
- the event notifier is an struct with 2 file descriptors (to be read) on it.
- cleaning this right when called or in the next poll() is no different.
## ANALYSIS (based on assumptions since I'm not reproducing this locally)
# Probable cause
Unfortunately the new variable synchronization asked for some barriers:
+ atomic_ mb_set( &ctx->notified, true);
#ifndef atomic_mb_set
#define atomic_mb_set(ptr, i) do { \
smp_wmb(); \
atomic_set(ptr, i); \
smp_mb(); \
} while (0)
#endif
The "smp_wmb()" macro is either 1 or 2 compiler barrier (depending on the architecture). I believe that x86_64 is using __sync_ synchronize( ) directive, causing a full HW barrier to happen (not only an out-of-order correction, but, a real barrier). The "smp_mb()" is the same thing, 1 full HW barrier. So, for this particular change, we added at least two full HW barriers (bigger cost for sensitive workload).
and
+ if (atomic_ xchg(&ctx- >notified, false)) {
This is a macro translating to __atomic_exchange() directive which, likely, makes GCC to use another mem barrier for the atomicity regarding cache coherency in between different cache layers.
# VirtIO
This commit has probably made a difference for final user's tests because virtio is using the QEMU Bottom Half implementation from AIO instead of the "timer" implementation (for deferred works). I'm not sure if timer would be better or even worse (txmode=timer for virtio device).
https:/ /pastebin. ubuntu. com/26417646/
Calling qemu_bh_new it creates a bottom half for "virtio_net_tx_bh", which is the function responsible for scheduling the bottom half with "qemu_bh_schedule", so, virtio_ net_flush_ txt can be run again and again (causing the barriers to happen again and again).
# Observations
If we confirm that this is the actual problem - likely - it is possible that we can try making this change more efficient, or even removing it, but that will, likely, depend on upstream accepting what we proposed. The reason behind this is simple: they can choose to have the barriers even if causing lower throughput for some workloads.