> I tried to reproduce the problem locally, on the same commit, with the > steps you described, but I wasn't lucky. I tried keeping the image on my > home directory (XFS), on tmpfs, and finally on a newly created ext4 > filesystem on a spare LVM volume, but the image just wouldn't break even > after letting the loop run for a quite a while. That's certainly an important data point. Is it possible that we're talking about some kind of miscompilation here, maybe because gcc-9.2.0 is just that tiny bit too spanking current? > So as the next step I would like to test my theory that the problem > isn't bdrv_co_block_status() returning a different value after the > commit, but that qcow2_detect_metadata_preallocation() even runs. I > think the easiest way to do this would be modifying handle_alloc_space() > so that it performs the checks, but skips its optimisation regardless of > the is_zero_cow() return value: > if (!is_zero_cow(bs, m) || true) { > continue; > } I made the change and the problem went away. Then, extrapolating the jest of your methodology :), I went ahead and disabled only bdrv_co_pwrite_zeroes() by placing a continue in front of it but let qcow2_pre_write_overlap_check() execute and the problem reappeared. I certainly did not expect that to happen because the function name ends in _check(), suggesting read-only access. And it's not even touched by the commit. This had me so rattled that I revalidated that the problem does indeed not occur with the commit before. And it does not. I left it running for about half an hour without problems. After some more tests I finally figured out that even with -g and no -O gcc is smart enough to optimize out (!is_zero_cow() || true) and that corruption only happens if is_zero_cow() is actually called. Corruption also does not occur if I make is_zero_cow() or is_unallocated() return 0 always. So my first guess was that is_unallocated() sometimes returns false positives, making is_zero_cow() report false positives which is not caught by qcow2_pre_write_overlap_check() and causes bdrv_co_pwrite_zeroes() to zero out actual data. That seemed a bit convoluted to me. But then I realized that corruption still occurs if the rest of handle_alloc_space() is disabled like so: --- a/block/qcow2.c +++ b/block/qcow2.c @@ -2185,9 +2185,8 @@ static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta) continue; } - if (!is_zero_cow(bs, m)) { - continue; - } + is_zero_cow(bs, m); + continue; /* * instead of writing zero COW buffers, So it's much more likely that is_zero_cow() has a side-effect that somehow causes corruption later on even without handle_alloc_space() ever calling bdrv_co_pwrite_zeroes(). That would also explain why qcow2_pre_write_overlap_check() does not catch those false positives overwriting metadata because there simply are none. Putting a breakpoint on handle_alloc_space() and single stepping into is_zero_cow() I do indeed end up in bdrv_co_block_status(): gdb) bt #0 0x0000555555d610fd in bdrv_co_block_status (bs=0x5555567c69e0, want_zero=false, offset=5242880, bytes=12288, pnum=0x7ffedffd7b28, map=0x0, file=0x0) at block/io.c:2048 #1 0x0000555555d6167e in bdrv_co_block_status_above (bs=0x5555567c69e0, base=0x0, want_zero=false, offset=5242880, bytes=12288, pnum=0x7ffedffd7b28, map=0x0, file=0x0) at block/io.c:2190 #2 0x0000555555d61764 in bdrv_block_status_above_co_entry (opaque=0x7ffedffd7a10) at block/io.c:2220 #3 0x0000555555d6188f in bdrv_common_block_status_above (bs=0x5555567c69e0, base=0x0, want_zero=false, offset=5242880, bytes=12288, pnum=0x7ffedffd7b28, map=0x0, file=0x0) at block/io.c:2255 #4 0x0000555555d61afa in bdrv_is_allocated (bs=0x5555567c69e0, offset=5242880, bytes=12288, pnum=0x7ffedffd7b28) at block/io.c:2285 #5 0x0000555555d61b8c in bdrv_is_allocated_above (top=0x5555567c69e0, base=0x0, offset=5242880, bytes=12288, pnum=0x7ffedffd7b80) at block/io.c:2323 #6 0x0000555555d12d48 in is_unallocated (bs=0x5555567c69e0, offset=5242880, bytes=12288) at block/qcow2.c:2151 #7 0x0000555555d12dbc in is_zero_cow (bs=0x5555567c69e0, m=0x5555569d35b0) at block/qcow2.c:2162 #8 0x0000555555d12e9c in handle_alloc_space (bs=0x5555567c69e0, l2meta=0x5555569d35b0) at block/qcow2.c:2188 #9 0x0000555555d13321 in qcow2_co_pwritev (bs=0x5555567c69e0, offset=5255168, bytes=4096, qiov=0x7fffe82ec310, flags=0) at block/qcow2.c:2302 #10 0x0000555555d5e6d5 in bdrv_driver_pwritev (bs=0x5555567c69e0, offset=5255168, bytes=4096, qiov=0x7fffe82ec310, flags=0) at block/io.c:1043 #11 0x0000555555d6014b in bdrv_aligned_pwritev (child=0x55555675cf80, req=0x7ffedffd7e50, offset=5255168, bytes=4096, align=1, qiov=0x7fffe82ec310, flags=0) at block/io.c:1670 #12 0x0000555555d60d77 in bdrv_co_pwritev (child=0x55555675cf80, offset=5255168, bytes=4096, qiov=0x7fffe82ec310, flags=0) at block/io.c:1897 #13 0x0000555555d47cc7 in blk_co_pwritev (blk=0x5555567c6730, offset=5255168, bytes=4096, qiov=0x7fffe82ec310, flags=0) at block/block-backend.c:1183 #14 0x0000555555d484aa in blk_aio_write_entry (opaque=0x7fffe823f920) at block/block-backend.c:1382 #15 0x0000555555e3ff91 in coroutine_trampoline (i0=-399759776, i1=32767) at util/coroutine-ucontext.c:116 #16 0x00007ffff5fc61a0 in () at /lib64/libc.so.6 #17 0x00007ffff17c5920 in () #18 0x0000000000000000 in () At that point it had gotten too late to even attempt to wrap my brain around the whole BDRV_BLOCK_RECURSE logic. But I think the above gives a strong(er|ish) connection between the change and the corruption and how handle_alloc_space() ties into it. Let me know what else I could check to help track this down.