On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote: > On 07/10/19 16:44, dann frazier wrote: > > On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote: > >> On 02/10/19 11:23, Jan Glauber wrote: > >>> I've looked into this on ThunderX2. The arm64 code generated for the > >>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any > >>> memory barriers. It is just plain ldaxr/stlxr. > >>> > >>> From my understanding this is not sufficient for SMP sync. > >>> > >>> If I read this comment correct: > >>> > >>> void aio_notify(AioContext *ctx) > >>> { > >>> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs > >>> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. > >>> */ > >>> smp_mb(); > >>> if (ctx->notify_me) { > >>> > >>> it points out that the smp_mb() should be paired. But as > >>> I said the used atomics don't generate any barriers at all. > >> > >> Based on the rest of the thread, this patch should also fix the bug: > >> > >> diff --git a/util/async.c b/util/async.c > >> index 47dcbfa..721ea53 100644 > >> --- a/util/async.c > >> +++ b/util/async.c > >> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source) > >> aio_notify_accept(ctx); > >> > >> for (bh = ctx->first_bh; bh; bh = bh->next) { > >> - if (bh->scheduled) { > >> + if (atomic_mb_read(&bh->scheduled)) { > >> return true; > >> } > >> } > >> > >> > >> And also the memory barrier in aio_notify can actually be replaced > >> with a SEQ_CST load: > >> > >> diff --git a/util/async.c b/util/async.c > >> index 47dcbfa..721ea53 100644 > >> --- a/util/async.c > >> +++ b/util/async.c > >> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) > >> > >> void aio_notify(AioContext *ctx) > >> { > >> - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs > >> - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. > >> + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before > >> + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or > >> + * atomic_add in aio_poll. > >> */ > >> - smp_mb(); > >> - if (ctx->notify_me) { > >> + if (atomic_mb_read(&ctx->notify_me)) { > >> event_notifier_set(&ctx->notifier); > >> atomic_mb_set(&ctx->notified, true); > >> } > >> > >> > >> Would you be able to test these (one by one possibly)? > > > > Paolo, > > I tried them both separately and together on a Hi1620 system, each > > time it hung in the first iteration. Here's a backtrace of a run with > > both patches applied: > > Ok, not a great start... I'll find myself an aarch64 machine and look > at it more closely. I'd like the patch to be something we can > understand and document, since this is probably the second most-used > memory barrier idiom in QEMU. > > Paolo I'm still not sure what the actual issue is here, but could it be some bad interaction between the notify_me and the list_lock? The are both 4 byte and side-by-side: address notify_me: 0xaaaadb528aa0 sizeof notify_me: 4 address list_lock: 0xaaaadb528aa4 sizeof list_lock: 4 AFAICS the generated code looks OK (all load/store exclusive done with 32 bit size): e6c: 885ffc01 ldaxr w1, [x0] e70: 11000821 add w1, w1, #0x2 e74: 8802fc01 stlxr w2, w1, [x0] ...but if I bump notify_me size to uint64_t the issue goes away. BTW, the image file I convert in the testcase is ~20 GB. --Jan diff --git a/include/block/aio.h b/include/block/aio.h index a1d6b9e24939..e8a5ea3860bb 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -83,7 +83,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + uint64_t notify_me; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and