QEMU

Bug #1846427
Comment #21

Comment 21 for bug 1846427

Revision history for this message

Kevin Wolf (kwolf-redhat) wrote on 2019-10-23:

#21

> So it's much more likely that is_zero_cow() has a side-effect that somehow
> causes corruption later on even without handle_alloc_space() ever calling
> bdrv_co_pwrite_zeroes().

Yes, looks like it. I think we have ruled out that a changing return value is the cause of the problems because the return code was completely ignored and it still broke for you.

Basically the only other thing I see that our commit has changed is that qcow2_detect_metadata_preallocation() runs now. I assume that if you replace its call in qcow2_co_block_status() with a fixed ret = true or ret = false, the problem will still disappear.

Now what is problematic inside qcow2_detect_metadata_preallocation()? At the moment I see two options:

1. qcow2_get_refcount() is the only thing that does something with the qcow2 internals, the other calls are about bs->file->bs (the raw image file), which is pretty certainly harmless. The interesting thing about the qcow2_get_refcount() call is that other code paths call it with s->lock locked, but this one is unlocked. I wonder if moving qemu_co_mutex_lock(&s->lock); in qcow2_co_block_status() to above the qcow2_detect_metadata_preallocation() call would change anything.

2. Or the problem isn't even related to what qcow2_detect_metadata_preallocation() does, but it's a race elsewhere that is just uncovered because of the timing - preallocation detection must be pretty slow because it checks the refcount of every single cluster in the image. In that case, replacing it with something like qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000000); should have the same effect and cause corruption, too.