libvirt: resize with deleted backing image fails

Bug #1546778 reported by Chris St. Pierre
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matthew Booth
Kilo
Won't Fix
Undecided
Unassigned
Liberty
Fix Released
Medium
Matthew Booth

Bug Description

Once the glance image from which an instance was spawned is deleted, resizes of that image fail if they would take place across more than one compute node. Migration and live block migration both succeed.

Resize fails, I believe, because 'qemu-img resize' is called (https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7218-L7221) before the backing image has been transferred from the source compute node (https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7230-L7233).

Replication requires two compute nodes. To replicate:

1. Boot an instance from an image or snapshot.
2. Delete the image from Glance.
3. Resize the instance. It will fail with an error similar to:

Stderr: u"qemu-img: Could not open '/var/lib/nova/instances/f77f1c5c-71f7-4645-afa1-dd30bacef874/disk': Could not open backing file: Could not open '/var/lib/nova/instances/_base/ca94b18d94077894f4ccbaafb1881a90225f1224': No such file or directory\n"

Tags: libvirt resize
Revision history for this message
stgleb (gstepanov) wrote :

I think there is no need to delete image from glance, this bug occurs
because migration is triggered during resize. But backing file is not copied in this
code flow, i would like to propose patch that fix this problem.

Steps to reproduce:

1. Deploy openstack with 2 or more computes.
2. Boot one or more instances on the same compute with flavor micro or tiny.
3. Resize one of the instances, that probably will cause migration.

If instance disk has backing file - migration will fail because in current implementation
backing files are not copied from one compute to another.

Revision history for this message
stgleb (gstepanov) wrote :
Revision history for this message
Chris St. Pierre (stpierre) wrote :

I can replicate and confirm that. If the destination hypervisor doesn't have the backing image in its cache, regardless of whether or not it's available in glance the resize fails.

Revision history for this message
Chris St. Pierre (stpierre) wrote :

Your replication steps are slightly incomplete; there are two ways to replicate it. Either:

1. Create a new image. (This ensures that the image backing file isn't in any of your compute nodes' image cache.)
2. Boot an instance.
3. Resize.

Or:

1. Delete all instances, and then delete all backing images from the nova image cache on all compute nodes.
2. Boot.
3. Resize.

Either way, you need to ensure that the destination hypervisor doesn't have the backing image in its cache. Deleting the image from Glance may also do this, eventually, since the Nova image cache manager purges cached images that are not in use and have been deleted from Glance, but just resizing the first instance booted from a new image is easier and more reliable.

Changed in nova:
assignee: nobody → stgleb (gstepanov)
status: New → In Progress
Revision history for this message
Tardis Xu (xiaoxubeii) wrote :

Resize may trigger cold migration. I think if resize failed because of backing file, migration would fail too.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

Can I just clear up some jargon here which is a bit confusing. Resize doesn't trigger a cold migration. Resize *is* a cold migration. Sometimes it migrates to the same host it started on. What you're talking about here is a regular resize where the source and destination hosts are different, which is normal. When I read about a migration happening during a resize, I think about 2 separate user-initiated actions and a whole different class of bug.

Secondly, there is code already in place which should be handling this, specifically finish_migration(). The interesting question is why that isn't working.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

Reproducer:

$ nova flavor-create test1 test2 256 1 1
$ nova flavor-create test2 test2 256 2 1

$ glance image-create --name cirros --container-format bare --disk-format qcow2 --file cirros-0.3.4-x86_64-disk.img
$ nova boot --image cirros --flavor test1 foo
# nova show foo indicates that foo was launched on node2
$ nova service-disable node2.example.com nova-compute
$ glance image-delete <cirros image-id>
$ nova resize foo test2

Revision history for this message
Matthew Booth (mbooth-9) wrote :

So, it seems that the bug is in finish_migration. The problem is that it attempts to resize the disk before checking that the backing file is available. I'll post a patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/288640

Changed in nova:
assignee: stgleb (gstepanov) → Matthew Booth (mbooth-9)
Revision history for this message
Matthew Booth (mbooth-9) wrote :

That patch may require tests; I'll have a look on Monday. I've confirmed that it resolves the reproducer I posted above.

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
tags: added: libvirt resize
tags: added: liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/288640
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=db7fd539f261ea53f6c005478049424b9dae1543
Submitter: Jenkins
Branch: master

commit db7fd539f261ea53f6c005478049424b9dae1543
Author: Matthew Booth <email address hidden>
Date: Fri Mar 4 18:34:21 2016 +0000

    libvirt: Fix resize of instance with deleted glance image

    finish_migration() in the libvirt driver was attempting to resize an
    image before checking that its backing file was present. This patch
    re-orders these 2 operations. In doing so, we also have to resolve an
    overloading of the 'disk_info' variable.

    Closes-Bug: #1546778

    Change-Id: I03e08fae97416ebe5cdedcf238a06d1b90203c5d

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/290561

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/290563

Matt Riedemann (mriedem)
tags: removed: liberty-backport-potential
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0rc1

This issue was fixed in the openstack/nova 13.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/290561
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=359dd54d667d9b0a6a05497488aa48076d35991a
Submitter: Jenkins
Branch: stable/liberty

commit 359dd54d667d9b0a6a05497488aa48076d35991a
Author: Matthew Booth <email address hidden>
Date: Fri Mar 4 18:34:21 2016 +0000

    libvirt: Fix resize of instance with deleted glance image

    finish_migration() in the libvirt driver was attempting to resize an
    image before checking that its backing file was present. This patch
    re-orders these 2 operations. In doing so, we also have to resolve an
    overloading of the 'disk_info' variable.

    (cherry picked from commit db7fd539f261ea53f6c005478049424b9dae1543)

    Conflicts:
      nova/virt/libvirt/driver.py

    Minor context difference, as liberty explicitly converted image_meta
    to an object before use.

    Closes-Bug: #1546778

    Change-Id: I03e08fae97416ebe5cdedcf238a06d1b90203c5d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Gleb Stepanov (<email address hidden>) on branch: master
Review: https://review.openstack.org/282275

Revision history for this message
Matt Riedemann (mriedem) wrote :

There is a backport proposed to stable/kilo but I don't think we should take it since the original fix for this introduced a regression which we are having to fix on master, stable/mitaka and stable/liberty now. I'd rather not deal with that in stable/kilo too.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/kilo)

Change abandoned by Matthew Booth (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/290563
Reason: No problem.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/nova 12.0.3

This issue was fixed in the openstack/nova 12.0.3 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.