Bug #1846427 “4.1.0: qcow2 corruption on savevm/quit/loadvm cycl...” : Bugs : QEMU

Revision history for this message

Dr. David Alan Gilbert (dgilbert-h) wrote on 2019-10-03:

#1

cc'd in kwolf since he signed off on that change.

Revision history for this message

Michael Weiser (michael-weiser) wrote on 2019-10-16:

#2

> I'm seeing massive corruption of qcow2 images with qemu 4.1.0 and git master
> as of 7f21573c822805a8e6be379d9bcf3ad9effef3dc after a few
> savevm/quit/loadvm cycles.
[...]
> bisected the introduction of the problem to commit
> 69f47505ee66afaa513305de0c1895a224e52c45
> (block: avoid recursive block_status call if possible).

In case it got lost in all the blurb: qemu 4.1.0 is essentially eating VMs by corrupting their images in very short order. Asumming no aggravating circumstances on my end I'd expect this to have the potential to hit a lot of users very hard once qemu 4.1.0 starts appearing in distros.

Having downgraded to 4.0.0 works around the problem for me for now.

Just let me know if there's anything I can do to assist.

Revision history for this message

Dr. David Alan Gilbert (dgilbert-h) wrote on 2019-10-16:

#3

Hi Michael,
How sure are you that it's that commit - have you checked the commit before it?

Revision history for this message

Michael Weiser (michael-weiser) wrote on 2019-10-16:

#4

Yes. As said:

> qemu compiled from the commit before does not exhibit the issue, from that
> commit on it does and reverting the commit off of current master makes it
> disappear.

In my tests the problem only occurs with that commit in the code. I used git bisect to narrow it down to that commit. Even just reverting it off of current master made it go away with my automated reproducer.

If helpful I can retest manually with a real-world VM. OTOH it would certainly be helpful if someone else said they can or cannot reproduce the problem based on my description of the reproducer.

Revision history for this message

Michael Weiser (michael-weiser) wrote on 2019-10-16:

#5

revert commit 69f47505ee66afaa513305de0c1895a224e52c45 Edit (9.2 KiB, text/plain)

I just quickly retested with today's master (commit 69b81893bc28feb678188fbcdce52eff1609bdad) and the automated reproducer. With the attached revert patch applied the loadvm/sleep 10/savevm/quit loop ran 50 times without problem. As soon as I removed the patch, recompiled and replaced the qemu binary with the unpatched, newly compiled one it took another two runs of the loop to produce this output:

QEMU 4.1.50 monitor - type 'help' for more information
(qemu) loadvm foo
(qemu) c
(qemu) stop
(qemu) savevm foo
(qemu) quit
QEMU 4.1.50 monitor - type 'help' for more information
(qemu) loadvm foo
(qemu) c
(qemu) stop
(qemu) savevm foo
Error: Error while deleting snapshot on device 'd': Failed to free the cluster and L1 table: Invalid argument
(qemu) quit
QEMU 4.1.50 monitor - type 'help' for more information
(qemu) loadvm foo
Error: Device 'd' does not have the requested snapshot 'foo'
(qemu) c
(qemu) qcow2_free_clusters failed: Invalid argument
qcow2_free_clusters failed: Invalid argument
qcow2_free_clusters failed: Invalid argument
qcow2_free_clusters failed: Invalid argument
qcow2_free_clusters failed: Invalid argument
qcow2_free_clusters failed: Invalid argument
^Cqemu-system-x86_64: terminating on signal 2

qemu-img check then reports:

48857 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

115210 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.
259259/327680 = 79.12% allocated, 2.51% fragmented, 0.00% compressed clusters
Image end offset: 17942052864

Revision history for this message

Laszlo Ersek (Red Hat) (lersek) wrote on 2019-10-16:

#6

I haven't done any sort of "narrowing down", but recent QEMUs (built from the master branch, post-v4.1) have corrupted at least two VM disk images (qcow2) for me as well. I had to reinstall both VMs.

I didn't make any noise because I was sure that, if I wasn't seeing ghosts, then others must have encountered the symptom earlier than I did, and file bug reports with more details than I had time for.

Perhaps relevant: my use case lacks savevm/loadvm. I only boot and shutdown VMs.

My symptoms have been:
- qemu refusing to start, due to the qcow2 image being corrupt
- qemu-img reporting the image as corrupt
- applications in guests that checksum data reporting problems (such as RPM complaining about RPMDB corruption)

I think the affected qcow2 images may have had compressed clusters. (I no longer have the images.)

Revision history for this message

psyhomb (psyhomb) wrote on 2019-10-16:

#7

I can confirm exactly the same issue on Arch linux running qemu-4.1.0.

After downgrading from 4.1.0 => 4.0.0 everything is running normal again, no corruption detected and all qcow2 images stays healthy.

Revision history for this message

Laszlo Ersek (Red Hat) (lersek) wrote on 2019-10-17:

#8

After reading the message of commit 69f47505ee66 ("block: avoid
recursive block_status call if possible", 2019-06-04), I'm none the
wiser. But, I can at least confirm that all my qcow2 images are
pre-allocated, as a norm. I create them with the following command:

qemu-img create \
  -f qcow2 \
  -o compat=1.1 \
  -o cluster_size=65536 \
  -o preallocation=metadata \
  -o lazy_refcounts=on \
  $FILENAME \
  100G

Perhaps this helps reproducing the issue. The commit message says,
"However, lseek is needed when we have metadata-preallocated image", so
that might be a special case that I've hit with some frequency.

I do have a vague suspicion that the following idea:

    The idea is to compare allocation size in POV of filesystem with
    allocations size in POV of Qcow2 (by refcounts). If allocation in fs is
    significantly lower, consider it as metadata-preallocation case.

is not robust enough. From the description, the "metadata-preallocation
case" appears to be determined with *heuristics*, but then again, "lseek
is needed when we have metadata-preallocated image". So if there is a
clear requirement to behave differently / particularly for
metadata-preallocated images, why is it safe to (basically) *guess*
whether a given image had its metadata pre-allocated?

+ threshold = MAX(real_clusters * 10 / 9, real_clusters + 2);

Where do those constants come from?

... Not sure if it matters: the host filesystem holding my qcow2 images
is "ext4". Filesystem features (dumped with the fs being mounted r/w at
the moment): has_journal, ext_attr resize_inode, dir_index, filetype,
needs_recovery, extent, flex_bg, sparse_super, large_file, huge_file,
uninit_bg, dir_nlink, extra_isize. Filesystem flags:
signed_directory_hash.

Thanks.

Revision history for this message

Laszlo Ersek (Red Hat) (lersek) wrote on 2019-10-17:

#9

(See also / possible duplicate: <https://bugs.launchpad.net/qemu/+bug/1847793>.)

Revision history for this message

Michael Weiser (michael-weiser) wrote on 2019-10-18:

#10

My qcow2 images also reside on an ext4 with features "has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file dir_nlink extra_isize metadata_csum" on a luks-encrypt(ed|ing) device mapper device backed by a partition on an NVMe SSD. The setup is rock solid and I had no other indications of it causing corruption or being corrupted.

I did a quick test with a 32GB USB3 flash drive formatted as a super floppy (without partitions nor encryption) as XFS and saw the same corruption though less heavily so, likely because the drive is much slower (~ 60MB/s write instead of ~600MB/s write for the NVMe SSD).

The savevm/loadvm cycle was basically the first reliable and fast reproducer I was able to find. I have a dim recollection that some of my corruptions also did not involve any loadvm/savevm but were much rarer and not as easily reproducible.

Revision history for this message

Simon John (sej7278) wrote on 2019-10-20:

#11

Not sure if i have exactly the same problem, as my qcow2 corruption seems to be limited to windows10 guests - win2019 and debian10 guests with the same virtio-scsi setup are fine (as are various virtio-blk or ide/sata images from linux/solaris/macos guests).

I find that i randomly have disk image corruption from little more than boot/shutdown cycles - no heavy usage or anything is required. "qemu-img check -r all" usually makes things worse, as does chkdsk.

host filesystem is an ssd with ext4 on top of luks, discard not used (fstrim.timer instead) with features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum

Reported to redhat as assumed it was a virtio-win bug: https://bugzilla.redhat.com/show_bug.cgi?id=1762944 - includes virt-install method to reproduce my test vm's (i don't use qemu directly).

Host is debian sid running qemu version 4.1.0 (Debian 1:4.1-1+b3), libvirt 5.6.0-2, kernel 5.2.0-3 (5.2.17-1)

Revision history for this message

Simon John (sej7278) wrote on 2019-10-20:

#12

Can't seem to reproduce if I convert the qcow2 image to raw+sparse.

Revision history for this message

Kevin Wolf (kwolf-redhat) wrote on 2019-10-21:

#13

After reading some related code, I have more questions than before, but let's see... As more qcow2 code was merged since, I would suggest that we debug the problem on commit 69f4750 (the bisection result) rather than on anything newer.

First of all: Michael, you didn't specify explicitly how your images were created, but can I assume that the test image is not preallocated (in contrast to Laszlo's)?

I find Laszlo's case with a preallocated image particularly surprising because the behaviour isn't supposed to have changed at all for preallocated images, at least if the heuristics still detects them as such. Once a preallocated image becomes almost fully allocated, it's expected that we won't detect it any more. So, Laszlo, do you know how much of your images was allocated? 'qemu-img check' prints the allocation statistics.

The next mystery is why bdrv_co_block_status() is even called. I found only a single call that happens with normal guest I/O and savevm/loadvm, and that's the one in handle_alloc_space(). This function is suspicious because it's relatively new, but commit 69f4750 shouldn't have any effect on it because BDRV_BLOCK_ALLOCATED is set independently of BDRV_BLOCK_RECURSE - and even if the change had an effect, it would be that the function is used less, so if anything, a bug could be expected to be hidden rather than become visible.

I think it might be worth a try reproducing with the handle_alloc_space() call commented out. If that doesn't fix/hide the bug, it would be interesting to see what else calls qcow2_detect_metadata_preallocation(), e.g. by setting a breakpoint there in gdb and getting the stack backtrace when it triggers.

Another caller I see in the code, but didn't get run in my guest, is qcow2_co_pwrite_zeroes(). This is not discard, but maybe the discard mount option does cause a write_zeroes call (WRITE SAME in SCSI) sometimes? But then, your reproducer seems to use AHCI and I can't see a write_zeroes call in the AHCI or IDE device emulation.

The possible (intended) effect of commit 69f4750 is that a block that was previously detected as containing only zeros (BDRV_BLOCK_ZERO) doesn't get this flag any more. This could cause unaligned qcow2_co_pwrite_zeroes() to fail, but then we'd just get a fallback to a normal write, which wouldn't explain any metadata-level corruption.

Michael, would you like to give it a try and figure out in which code path qcow2_detect_metadata_preallocation() is even called in your reproducer and if handle_alloc_space() is linked to this bug somehow?