image format change during migration breaks instances

Bug #2038898 reported by Pavlo Shchelokovskyy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Pavlo Shchelokovskyy

Bug Description

Discovered in a environment that was configured with

[libvirt]
images_type = raw

only, the other relevant options were at their defaults (use_cow_images = True, force_raw_images = True).

Symptom - the instances were non-responsive and non running after cold migration (e.g. no console log at all), live migration works fine.
Workaround - setting use_cow_images=False solved the problem.

Reproduction on a current multinode devstack:

1. Configure computes as described above - set [libvirt]images_type = raw, leave the rest per default devstack / nova settings.
2. Create a raw image in Glance.
3. Boot an instance from that raw image.
4. Inspect the image on the file system - the image is in fact raw.
5. Cold-migrate the server.
6. Migration finishes successfully, instance is reported as up and running on the new host - but in fact it has completely failed to start (not accessible, no console log, nothing).
7. If you check the image file nova uses on the new compute - it is now qcow2, not raw.
8. But the libvirt XML of the instance still defines the disk as raw!

Oct 09 12:15:35 pshchelo-devstack-jammy nova-compute[427994]: <devices>
Oct 09 12:15:35 pshchelo-devstack-jammy nova-compute[427994]: <disk type="file" device="disk">
Oct 09 12:15:35 pshchelo-devstack-jammy nova-compute[427994]: <driver name="qemu" type="raw" cache="none"/>
Oct 09 12:15:35 pshchelo-devstack-jammy nova-compute[427994]: <source file="/opt/stack/data/nova/instances/22749d77-83a1-4ae9-ade8-7bd9548406cd/disk"/>
Oct 09 12:15:35 pshchelo-devstack-jammy nova-compute[427994]: <target dev="vda" bus="virtio"/>
Oct 09 12:15:35 pshchelo-devstack-jammy nova-compute[427994]: </disk>

Stopping the instance and manually converting the disk back to raw allows instance to start properly.

I tracked it down to this place in finish_migration method:

https://opendev.org/openstack/nova/src/branch/stable/2023.2/nova/virt/libvirt/driver.py#L11739

            if (disk_name != 'disk.config' and
                        info['type'] == 'raw' and CONF.use_cow_images):
                self._disk_raw_to_qcow2(info['path'])

Effectively, nova changes disk type but not changing the XML appropriately to reflect the actual new disk format, and thus the instance fails to start.

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

what's more, disk.info is also not updated:

❯ qemu-img info disk | head -n2
image: disk
file format: qcow2

❯ qemu-img info disk.eph0 | head -n2
image: disk.eph0
file format: qcow2

❯ qemu-img info disk.swap | head -n2
image: disk.swap
file format: qcow2

❯ cat disk.info
{"/opt/stack/data/nova/instances/be952acc-2774-48ca-ad87-bd2fa773423a/disk": "raw", "/opt/stack/data/nova/instances/be952acc-2774-48ca-ad87-bd2fa773423a/disk.eph0": "raw", "/opt/stack/data/nova/instances/be952acc-2774-48ca-ad87-bd2fa773423a/disk.swap": "raw", "/opt/stack/data/nova/instances/be952acc-2774-48ca-ad87-bd2fa773423a/disk.config": "raw"}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/897842

Changed in nova:
status: New → In Progress
description: updated
description: updated
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote : Re: image format change during migration is not reflected in libvirt XML

alternative question to ask - why the conversion of the image format in the first place? the image is booted as raw the first time on the identically configured compute, so why disk format changes after migration?..

It seems it all due to this confusingly existing 'CONF.use_cow_images' setting that is used exclusively only to choose between Qcow2 and Flat for image backend when the same confusing historic default of 'default' for libvirt.images_type is used.

As the comment in this place suggests,
> # Convert raw disks to qcow2 if migrating to host which uses
> # qcow2 from host which uses raw.
so it should be done only when the host is actually using Qcow2 image backend, which is not tested for here.

summary: - image format change during migration is not reflected in libvirt XML
+ image format change during migration breaks instances
Revision history for this message
sean mooney (sean-k-mooney) wrote :

so setting
images_type = raw
and
use_cow_images = True, force_raw_images = True

is not actually intended to be a valid configuration.

force_raw_images = True is not relenvet here as that just forces the share backing file in the image cache to be raw.

but images_type = raw and use_cow_images = True should not both be set at the same time.

as such I'm not convince that the patch you submitted to fix this is the correct fix.

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

@sean, so may be then just a check that these both are not set at once? And fail to start compute?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/898229

Changed in nova:
assignee: nobody → Pavlo Shchelokovskyy (pshchelo)
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

Note that the same happens if I set

[libvirt]
images_type = flat

instead of 'raw', they fail identically in this regard.

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

I did the testing with full set of permutations of:

- libvirt.images_type = default | qcow2 | flat | raw
- use_cow_images = True | False
- force_raw_images = True | False
- Glance image = qcow2 | raw

I was booting an instance and checking the `qemu-img info $nova_state_path/instances/<instance_id>/disk` for the image format, if the backing file is used and the backing file format.

Results: https://paste.opendev.org/show/bAMQ9pR9tOewEDKB4roK/

Summary:

- `libvirt.images_type = raw` behaves exactly the same as `flat`
- `use_cow_images` has any effect only with `libvirt.images_type = default`

  `images_type = default` + `use_cow_images = True` == `images_type = qcow2`

  `images_type = default` + `use_cow_images = False` == `images_type = flat`

- when non-default `images_type` is configured, use_cow_images has no effect at all

That's it, it plays no other role with libvirt driver when booting the instance (is used however is several places with hyperv though).

It is however used in finish_migration to decide if to convert the image during cold migration. I believe this is a leftover from the times when there was no separation between qcow2 and 'raw/flat', and this option was the only one on which the format of the instance disk was chosen on.
AFAIK since we generate the XML on the target node from scratch but telling it to use existing copied from src host image. The problem is when the XML we generate based on settings of current (target) host says one thing, but the actual file format is something else - if we think it is qcow2, but it is raw, we have security vulnerability, if we think it is raw but it is qcow2 - instance does not start at all.
So it seems this check in finish_migration is broken now, it does not account for all the possible ways how this vulnerability may be triggered, and more over, breaks other legitimate scenarios.

What should be done IMO:
1. fix the check in finish_migration to not rely on `use_cow_images`
2. deprecate the usage of `use_cow_images` with libvirt driver
3. deprecate `default` value for `libvirt.images_type`

we could start by emitting a startup warning in libvirt driver when use_cow_images = False and images_type = default, to warn those relying on that behavior of choosing 'flat' that this is deprecated, and they rather must set it explicitly.

at the same time, we should deprecate the use_cow_images and copy it over to hyperv.use_cow_images, and use that in the hyperv driver.

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote (last edit ):

And now another very funny thing - the flat + force_raw_images behaves inconsistently - depending on how it was configured before! looks like because of image cache

- set flat + force_raw_images = False
- boot the instance from qcow2 image
- as expected, the image disk is qcow2, both as file and in instance XML
- delete the instance
- reconfigure to force_raw_images = True
- DO NOT touch anything in the image cache/backing files path (/opt/stack/data/nova/instances/_base in DevStack)
- restart n-cpu
- boot the instance from the same qcow2 glance image on this host again
- now, expected would be the vm disk will be raw, BUT it is STILL the same qcow2 both on disk and in XML!

if while re-configuring I clean up the image cache, everything works as expected.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.