qemu ide/sata disks do not work well with discard/trim

Bug #1974100 reported by Brent Baccala
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-images
New
Undecided
Unassigned
qemu (Ubuntu)
Confirmed
Low
Unassigned

Bug Description

I encountered this problem using this script to create a GNS3 appliance:

https://github.com/BrentBaccala/NPDC/blob/master/GNS3/ubuntu.py

I'm running a stock gns3-server on Ubuntu 18. I called it like this:

./ubuntu.py -r 20 -s $((1024*1024)) --vnc --boot-script opendesktop.sh

This uses the Ubuntu 20 server cloudimg to create a new GNS3 appliance with a thin provisioned disk resized to 1024*1024 MB = 1 TB.

Although the disk holds less than 5 GB of data once the installation scripts finishes, the qcow2 disk file balloons to over 35 GB, and watching the processes running in the virtual machine shows that the culprit is "ext4lazyinit".

I posted a question about this at https://unix.stackexchange.com/questions/700050

Shutting down the instance, mounting the disk using qemu-nbd, and running "debugfs -R dump_unused" on the first partition shows all kind of junk in the unused blocks.

Running "zerofree" on the partition shows hundreds of thousands of blocks being zeroed

Then using "qemu-img convert -O qcow2" to copy the disk image to another qcow2 file and discard zero blocks reduces its size to 4.3 GB.

I tried modifying gns3 to set discard=on option on the qemu command line; this seems to have no effect.

Here's an actual qemu command line (called from gns3):

/usr/bin/qemu-system-x86_64 -name ubuntu -m 4096M -smp cpus=1,sockets=1 -enable-kvm -machine smm=off -boot order=c -cdrom /home/gns3/GNS3/projects/29706745-5a53-44bf-9313-c8e78089c2f5/29706745-5a53-44bf-9313-c8e78089c2f5_ubuntu.iso -drive file=/home/gns3/GNS3/projects/29706745-5a53-44bf-9313-c8e78089c2f5/project-files/qemu/1deb3b2f-421e-460b-91df-eb36b13d17e9/hda_disk.qcow2,if=ide,index=0,media=disk,id=drive0,discard=on -uuid 1deb3b2f-421e-460b-91df-eb36b13d17e9 -vnc 0.0.0.0:5 -monitor tcp:127.0.0.1:40269,server,nowait -net none -device virtio-net-pci,mac=0c:eb:3b:2f:00:00,netdev=gns3-0 -netdev socket,id=gns3-0,udp=127.0.0.1:20015,localaddr=127.0.0.1:20014

The qemu disk image was created by "qemu-img create", backed by ubuntu-20.04-server-cloudimg-amd64.img, and resizing it with "qemu-img resize".

My best guess is that this is a bug in qemu's handling of disk DISCARD operations.

Tags: bot-comment
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Libera.chat.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1974100/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
Brent Baccala (baccala) wrote :

best guess as to the package

affects: ubuntu → qemu (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Brent,
thank you for all the detail and dedication already spent on this!

Just to be sure as there are two theories up for discussion:

1. ext4lazyinit fills it all
2. gns3 with discard=on fills it all due to maybe DISCARD being ignored

I was reading through the links and discussions and I wanted to ask to maybe cut this down to just one thing to talk about.

The references about early/lazy ext init indicate that no matter if done early or late you'd only loose the "normal" inode overhead which is usually <2%. So while you might see this laze init working and filling a few I'd not expect it to fill it massively.
Can you confirm that?
If so, we can kind of ignore this aspect other than hinting the team that builds the images to consider using "assume_storage_prezeroed" once that has landed.
If so we could then focus solely on #2 if and why discards are being ignored.

OTOH If I read this right, you said sized to 1TB.
So if we assume 1.5% usual inode overhead that would already be ~15G right.
So even more - are we talking about the try to have less inode init or about a lack of proper DISCARD handling?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Even without proper trimming from GNS there would be a regular cleanup on the FS anyway.
Ubuntu has a timer based fstrim to clean up space that was freed without trim/discard awareness.

Assuming that your consumed space is not just inode overhead (I can't help with that) you could check if this ever runs and what it finds:

journalctl -u fstrim.service

example:
May 19 07:34:42 j systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
May 19 07:34:42 j fstrim[1301]: /boot/efi: 99.1 MiB (103965696 bytes) trimmed on /dev/vda15
May 19 07:34:42 j fstrim[1301]: /: 2.1 GiB (2282033152 bytes) trimmed on /dev/vda1
May 19 07:34:42 j systemd[1]: fstrim.service: Deactivated successfully.
May 19 07:34:42 j systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab.

is that active and working in your case?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Finally about discard to the guest.
I see you have .... ,discard=on in your qemu commandline.
But that is only half the deal, depending on various other setup details the guest can recognize or not recognize that.

I do not know all the details of your setup but at least in the past there was quite a difference between disk types. You use if=ide and at least in the past if=scsi and maybe now if=virtio are better to transfer any kind of advanced disk features.

If e.g. above fstrim service runs but nothing happens that might be an indicator it is missing.

Here an example from mine connected via virtio:
$ lsblk --discard
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
vda 512 512B 2G 0
├─vda1 0 512B 2G 0
├─vda14 0 512B 2G 0
└─vda15 0 512B 2G 0
vdb 512 512B 2G 0

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also to attack this from a different angle simultaneously - have you maybe already tried this on older/newer Ubuntu Hosts and did it behave differently there?

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Brent Baccala (baccala) wrote :

I've come up with a more minimal test that doesn't require the whole GNS3 infrastructure.

1. Create a one-line 'meta-data' file:

{instance-id: ubuntu, local-hostname: ubuntu}

2. Create the following 'user-data' file:

#cloud-config
hostname: ubuntu
network:
  config: disabled
resize_rootfs: noblock
users:
  - name: ubuntu
    plain_text_passwd: foobar
    lock_passwd: false
    shell: /bin/bash
    sudo: ALL=(ALL) NOPASSWD:ALL

3. Build a CIDATA image acceptable to cloud-init:

genisoimage -input-charset utf-8 -o cloudinit.iso -l -relaxed-filenames -V cidata -graft-points meta-data user-data

4. With a copy of Ubuntu's cloudimg in the current directory, create a 1 TB thin provisioned disk:

qemu-img create -f qcow2 -b ubuntu-20.04-server-cloudimg-amd64.img test.qcow2 1T

5. Start qemu with a VNC server on whatever port you'd like (you must be in group kvm):

qemu-system-x86_64 -enable-kvm -cdrom cloudinit.iso -drive file=test.qcow2,if=ide,media=disk,discard=on -m 4G -vnc 0.0.0.0:88 -net none

I can change ide to virtio, but if I change it to scsi it either hangs during boot (Ubuntu 18 host) or complains "machine type does not support if=scsi,bus=0,unit=0" (Ubuntu 20 host).

I can do the following to get scsi:

qemu-system-x86_64 -enable-kvm -cdrom cloudinit.iso -device virtio-scsi-pci,id=scsi -drive file=test.qcow2,id=root-img,if=none,discard=on -device scsi-hd,drive=root-img -m 4G -vnc 0.0.0.0:88 -net none

For any drive type, you start it running, let it sit there at a boot prompt, and watch the size of test.qcow2. The bad behavior is that it grows into the 30-40 GB range.

On a Ubuntu 18 host, I see the bad behavior for drive types ide and virtio. scsi seems to be OK.

On an Ubuntu 20 host, I only see the bad behavior for drive type ide. virtio and scsi seem OK.

Revision history for this message
Brent Baccala (baccala) wrote :

Incidentally, I was able to fix my immediate problem by switching the disk type to scsi, based on the suggestions given by Christian Ehrhardt.

It's still an open bug that could cause other people some grief. It sure caused me plenty.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Brent,
first of all I'm glad that you got around things via my suggestions.

There is a reason why those device types are usually recommended in newer guides as well as being the default in higher level tools like virt-manager, uvtool, ... is virtio nowadays. It is just more capable.

Thanks for all the cross-checks that you did!
Taking GNS3 out of the picture and comparing just the disk attachments is great.

We can even take cloud-init out of this, all it does is (as instructed) to extend the root FS on those disks. Via "resize_rootfs: noblock" it does so in background, but all it does is ensure to grow the fs to the size of the partition.
Therefore we can as well just mkfs.ext4 (or any other fs of anyones choice) to compare and even use hot add/remove of disks instead of booting into a new system each time.

So (to me) it really seems to only come down to "qemu ide/sata disks do not work well with discard/trim".
That doesn't seem much of a bug, more of a feature request to the known older variant of disk attachments which is unlikely to happen.

So if we would want to compare it feels that it comes down to:
A) disk attachments, features and options (on the hipervisor)
  examples: discard, detect-zeroes, virtio vs scsi vs sata and such
B) filesystems options
  - in your case defined by how the cloud-image was created
  - But to compare fs-effects we could as well mkfs* a few extra disks

summary: - inode lazy init in a VM fills virtual disk with garbage
+ qemu ide/sata disks do not work well with discard/trim
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Compare:
- 25G qemu images each set to cache=none
- I ide/sata / V = Virtio
- Disk Options:
  1 - discard=unmap
  2 - discard=unmap + detect_zeroes=on
  3 - discard=ignore
  4 - defaults

One can check with lsblk --discard (as mentioned above) how the system thinks
it can discard and with dumpe2fs we can check the features used on the fs (I'll
use all defauls and comare it with the cloud image root fs).

Steps:
- Boot system, mkfs.ext4 all of them.
- Gather lsblk and dumpe2fs data.
- Shutdown and check image sizes on host
- Lazy zeroing is done on first mount - check #1 before mount.
- boot, mount, wait for lazy init to complete
- shutdown and check image size again - check #2

lsblk:
They all had the same features reported (even ide/sata)
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda 0 512B 2G 0
sdb 0 512B 2G 0
sdc 0 512B 2G 0
sdd 0 512B 2G 0
vdc 512 512B 2G 0
vdd 512 512B 2G 0
vde 512 512B 2G 0
vdf 512 512B 2G 0

dump2fs:
Visible options are the same, just "recovery" being set on the cloud image.

Size results:

       check #1 check #2
V1 ~10M ~11M
V2 ~10M ~11M
V3 ~10M ~11M
V4 ~10M ~11M
S1 ~140M ~540M
S2 ~10M ~11M
S3 ~140M ~540M
S4 ~140M ~540M

S2 vs S* on both checks we see that just discard doesn't work there as
well as it does on the newer attachment type via virtio.
But - if you need for other reasons - strictly need to get it zeroes in ide one
can use zero detection for ide/sata disks. This can consume cpu time though,
and in general the right way is just to use a more modern disk type (which - as
mentioned before - what all tools & automation usually do for years already).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Brent - Let me know if your data and/or lsblk/dump2fs look vastly different?

I pondered if I should call this "a feature request to an old attachment type" which is unlikely
to get much attention. But the only stomach-ache that I have with this (and this is why I do not close this) is that it has announced discard to the guest - which if not working it should not.

But I haven't had the time to track down where/if this is lost. For many other performance/feature reasons using a newer attachment type is already strongly encouraged, thereby I'm not sure how much it is worth to spend much more time here (interesting it is, but worth the time?).

Others can please chime in here if many people think tracing this down is super-important.

Changed in qemu (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Low
Revision history for this message
Brent Baccala (baccala) wrote (last edit ):

@Christian - my lsblk output looks very similar to yours. In particular, all device types are reporting discard support; the only difference is the reported discard sizes.

I would suggest that at the end of your tests you check the disk with "debugfs -R dump_unused". I'm seeing disk blocks filled with garbage. I expected them to be filled with zeros.

So, I don't think it's just a feature request. Not only has the device announced discard support, but when you read the blocks back they are not zero filled.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm,
interesting - here our results differ then.

All of my 8 attachments do not have that problem.
"debugfs -R dump_unused" does not report anything and zerofree agrees reporting all of them as "none to modify/free" / "almost all is free" / "total blocks"

example:
$ sudo zerofree -vn /dev/sdd
0/6406707/6553600

Have you run any else to get debugfs/zerofree to show this in your case, or does that even show with the simplified repro in comment #7?

Furthermore I was trying to separate this from the base cloud-image - so all my tests are on new qemu images + mkfs.ext4. Could it be that you see that behavior only on the base image that had content and is extended?

I was looking at this theory (only a problem of the extended base cloud image) by attaching them to nbd in the host after guest shutdown.

zerofree on:
new image / mkfs.ext4 via virtio - 0/6406451/6553344
new image / mkfs.ext4 via ide/sata - 0/6406707/6553600
cloud image / resize2fs via virtio - 4381/1631019/2068731

So indeed I have some (but not much ~17Mb) crap on the latter.
But the amount could as well be from padding the image in the first place on creation.

I do not have your combination (as I'd never use the old disk type for my root and all tools that make it easy for me to deploy one also pick virtio by default), but I assume that your case
"cloud image / resize2fs via ide/sata" might be even worse since (as shown above) you'd also need active zero detection to work out better.

@Brent, could you please confirm me that you also only see that problem on the extended cloud-image and not on otherwise new/fresh disks?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

# I have fetched a new cloud image.

$ wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64-disk-kvm.img
$ file jammy-server-cloudimg-amd64-disk-kvm.img
jammy-server-cloudimg-amd64-disk-kvm.img: QEMU QCOW2 Image (v2), 2361393152 bytes

# Then I have extended it to 25G size.

$ qemu-img resize jammy-server-cloudimg-amd64-disk-kvm.img 25G
Image resized.
$ qemu-img info jammy-server-cloudimg-amd64-disk-kvm.img
image: jammy-server-cloudimg-amd64-disk-kvm.img
file format: qcow2
virtual size: 25 GiB (26843545600 bytes)
disk size: 569 MiB
cluster_size: 65536
Format specific information:
    compat: 0.10
    compression type: zlib
    refcount bits: 16

# Right before using it I have checked the image with zerofree

$ sudo zerofree -vn /dev/nbd3p1
0/185630/548091

# Now I replaced the 8 different images (those are my 8 attachement types I used above) with this non-fresh image.

$ for f in /var/lib/libvirt/images/jdisk*.qcow2; do cp -v jammy-server-cloudimg-amd64-disk-kvm.img $f; done

# Then I started my guest, ran growpart and resize2fs on each of it and mounted it (to trigger the lazy allocation)

$ virsh start jdisk
# (in guest now)
$ for d in sda sdb sdc sdd vdc vdd vde vdf; do sudo growpart /dev/$d 1; done
# All report "CHANGED: partition=1 start=227328 old: size=4384735 end=4612063 new: size=52201439 end=52428767"
$ for d in sda sdb sdc sdd vdc vdd vde vdf; do sudo e2fsck -f /dev/${d}1; sudo resize2fs /dev/${d}1; done
$ for d in sda sdb sdc sdd vdc vdd vde vdf; do sudo mkdir /mnt/${d}; sudo mount /dev/${d}1 /mnt/${d}; done
# a while later umount them all
$ for d in sda sdb sdc sdd vdc vdd vde vdf; do sudo umount /mnt/${d}; done

# Now I'm having the same look at each of them in two ways

# A - from the hosts POV for image size

Again the ide/sata attachments without detect_zeroes=on are the ones growing to 1.3G in this case. The rest all stayed at 573M (just 3M more than when downloaded).

# B - from the guests POV for FS behavior

They ALL reported the very same:
3/5976603/6525179

Which means from the guest/FS POV all those disks are primarily free and do not have much crap.
I do not mind so much about the three blocks being off, I understood you @Brent that you had a lot more of such blocks right?

P.S. if anyone does the same as repro, there is some bonus non-fun in mounting via "root=LABEL=cloudimg-rootfs" as all of them will have the very same label :-)

Changed in qemu (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Brent Baccala (baccala) wrote :

@Christian,

I haven't tried all of your combinations, but when I mount a second disk using the growpart/e2fsck/resize2fs sequence that you suggest and then mount it, it grows to 725 MB and then zerofree reports what you observed:

baccala@osito:~/NPDC/GNS3/bug$ sudo zerofree -v /dev/nbd0p1
3/5963617/6525179

But when I use an identical disk as the root filesystem, it grows to 789 MB and then zerofree reports this:

baccala@osito:~/NPDC/GNS3/bug$ sudo zerofree -v /dev/nbd0p1
1524/5953550/6525179

Both virtual disks are ubuntu 20.04 cloudimgs extended to 25 GB. I used the procedure I described on May 20 (cloudinit.iso on a virtual CD-ROM) with the one disk as root and the other one as /dev/sdb. Both were identical when I booted the VM.

You mentioned that "all tools that make it easy for me to deploy one also pick virtio by default"; can you run qemu from the command line on a bare metal system? If so, I thought my May 20 procedure was pretty simple. If not, then I guess it's harder to verify this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Brent,

> can you run qemu from the command line on a bare metal system? If so, I thought my May 20 procedure was pretty simple.

Oh yes you can run it from cli on bare metal - and your procedure was indeed simple and very useful. I was not trying to challenge any of that - sorry if that was misleading.

Trying again :-): Qemu is like a swiss army knife, doing plenty of things and learning new twists every now and then. The command line can do all of it and due to that it might be that you find plenty of howtos (or just people trying options) that - to stay in the swiss army knife metaphor - still use the blade to cut down a tree. It does work, but in the meantime a saw was added which is much better for the task. That is what here virtio is to ide/sata.
It is in no way "invalidating" that there seems to be an issue with what ide/sata announces and how it then works - but it lowers the severity of this case a lot.

> I haven't tried all of your combinations ...

Thanks for trying the most important ones - that was enough IMHO.

I think with the info so far, by now we might have to split this discussion into two - almost separate - cases:

a) what/why is going on with discard on ide/sata attachments, confirmed but prio rather low (as outlined above)

b) Why are the unused non-zero blocks on a extended cloud-image used as root grow

We might later need to split out an extra bug for (b), but for now I've added a bug task here (they might know more, maybe have already seen or discussed it before)

I think by now for (b) we confirmed:
1. A new fs on those disks has no lost blocks
2. The cloudimage extended has a few lost blocks (12KB - neglegible)
3. The cloudimage extended and used as root grows more lost blocks (seems like 5 MB above)

I could try explain the few lost ones in #2 by padding when creating the image.
But why are they increasing more when used as root device - that I can't explain yet.

@CPC team - do you happen to know more about this already?

Changed in qemu (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
John Chittum (jchittum) wrote :

I haven't come across this before. Our builds for the cloud images from cloud-images.ubuntu.com have the following cleanup code run that should be preventing this sort of thing:

https://git.launchpad.net/livecd-rootfs/tree/live-build/functions#n231

hard to read exact ramifications, but the gist is we create a filesystem, mount it, create the image, then at the end during clean up we run

# rootfs_dev_mapper is set in https://git.launchpad.net/livecd-rootfs/tree/live-build/functions#n62 when mounting an image

e2fsck -y -E discard ${rootfs_dev_mapper}
zerofree ${rootfs_dev_mapper}

before running kpartx to remove the mount

this _should_ be discarding the empty blocks. That matches the expectation b1 and b2.

But I haven't seen

> The cloudimage extended and used as root grows more lost blocks (seems like 5 MB above)

so that bears investigating.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.