Root disk lost when resizing instance from imagebackend to rbd backed flavor

Bug #1803331 reported by Logan V
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
High
Unassigned

Bug Description

We have flavor classes using different nova disk backends that are separated using host aggregates. For example, we have a flavor named l1.tiny which is using imagebackend, and s1.small using rbd backend. The hypervisors configured for imagebackend are added to the host aggregate where l1.* instances are scheduled, and the rbd hypervisors are in an aggregate where s1.* instances are scheduled.

When resizing an instance from l1.tiny to s1.small, the instance fails to resize and enters error state. The root disk is also lost during the failed resize. The host of the instance is set to one of the s1.* aggregate HVs, and the imagebackend disk is no longer present on the original l1.* hypervisor.

The error provided in 'instance show' is:

| fault | {u'message': u'[errno 2] error opening image 5a8ab7a3-3e59-442c-a603-2c24652788cb_disk at snapshot None', u'code': 500, u'details': u' File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/compute/manager.py", line 204, in decorated_function\n return function(self, context, *args, **kwargs)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/compute/manager.py", line 4062, in finish_resize\n self._set_instance_obj_error_state(context, instance)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/compute/manager.py", line 4050, in finish_resize\n disk_info, image_meta)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/compute/manager.py", line 4012, in _finish_resize\n old_instance_type)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/compute/manager.py", line 4007, in _finish_resize\n block_device_info, power_on)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7454, in finish_migration\n fallback_from_host=migration.source_compute)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3160, in _create_image\n fallback_from_host)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 3264, in _create_and_inject_local_root\n backend.create_snap(libvirt_utils.RESIZE_SNAPSHOT_NAME)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/virt/libvirt/imagebackend.py", line 941, in create_snap\n return self.driver.create_snap(self.rbd_name, name)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py", line 392, in create_snap\n with RBDVolumeProxy(self, str(volume), pool=pool) as vol:\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py", line 78, in __init__\n driver._disconnect_from_rados(client, ioctx)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/openstack/venvs/nova-untagged/local/lib/python2.7/site-packages/nova/virt/libvirt/storage/rbd_utils.py", line 74, in __init__\n read_only=read_only))\n File "rbd.pyx", line 1392, in rbd.Image.__init__ (/build/ceph-12.2.2/obj-x86_64-linux-gnu/src/pybind/rbd/pyrex/rbd.c:13540)\n', u'created': u'2018-11-14T11:03:11Z'} |

We are currently seeing this behavior on Ocata. I'm not certain if more recent nova releases experience this also.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

based on irc conversation this is a more general bug then the title suggest.
this likly will effect resize and cold migrate between any to hosts where the image backend changes
e.g. any combination of lvm,rbd,image where the value differes on each host.
this may or may not also affect live migration.

tags: added: migration resize
Revision history for this message
sean mooney (sean-k-mooney) wrote :

not this usecase is currently not supported so we should either prevent this in the schduler/condoctor
or in the virtdirver so that we do not attempt the migration in this case.

we could also track this as a future feature and convert to a blueprint/spec if we want to enable cross backend migration.

Changed in nova:
status: New → Triaged
importance: Undecided → High
Revision history for this message
sean mooney (sean-k-mooney) wrote :

set as high due to possibility of data loss.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

As discussed on IRC, I don't believe any data has been lost here, but it would be good if the reporter could confirm.

I believe the source node will have copied the root disk to the instance directory on the destination, but the destination is configured to expect an RBD disk, so doesn't find it. The root disk should also still exist on the source node in the _resized directory, and should be restored with a revert resize.

Revision history for this message
Logan V (loganv) wrote :

On the source node, yes I found the disk is still present in the _resize directory.

# ls /var/lib/nova/instances/a1b29f17-e910-4216-9165-e148e62d1ba1_resize/ -lha
total 20M
drwxr-xr-x 2 nova nova 4.0K Nov 14 12:48 .
drwxr-xr-x 29 nova nova 4.0K Nov 14 12:50 ..
-rw------- 1 root root 54K Nov 14 12:50 console.log
-rw-r--r-- 1 root root 20M Nov 14 12:50 disk
-rw-r--r-- 1 nova nova 79 Nov 14 12:48 disk.info

In the logs I see that it did try to copy the disk to the destination node. If the disk is small enough, it will copy, but usually this will fail because the destination host (our rbd hypervisors) are booted from a very small ramdisk which cannot hold the instance root disk.

Even though it copied my test instance disk, the end result is the instance goes to error state due to the error in my initial bug report. The resize is not revertable:

ubuntu@b0ca2dda2a32:~$ openstack server resize --flavor s1.small --wait a1b29f17-e910-4216-9165-e148e62d1ba1
Error resizing server: a1b29f17-e910-4216-9165-e148e62d1ba1
Error resizing server
ubuntu@b0ca2dda2a32:~$ openstack server resize --revert a1b29f17-e910-4216-9165-e148e62d1ba1
Cannot 'revertResize' instance a1b29f17-e910-4216-9165-e148e62d1ba1 while it is in vm_state error (HTTP 409) (Request-ID: req-7ec082bf-7eb3-4b7f-be1f-fe2221ad0f39)
ubuntu@b0ca2dda2a32:~$ openstack server set --state active a1b29f17-e910-4216-9165-e148e62d1ba1
ubuntu@b0ca2dda2a32:~$ openstack server resize --revert a1b29f17-e910-4216-9165-e148e62d1ba1
Cannot 'revertResize' instance a1b29f17-e910-4216-9165-e148e62d1ba1 while it is in vm_state active (HTTP 409) (Request-ID: req-fd828140-405b-4747-b6c5-7f846a46e2dc)

So even though I have the disk present on the source and (maybe) the destination node, I don't see a way to easily recover the instance back to its previous state.

Last I tested what will happen if I delete the instance now that it is stuck in this state. The disk was deleted from the destination node, however it is not cleaned up from the source node.

Revision history for this message
Matthew Booth (mbooth-9) wrote :

ubuntu@b0ca2dda2a32:~$ openstack server set --state active a1b29f17-e910-4216-9165-e148e62d1ba1
ubuntu@b0ca2dda2a32:~$ openstack server resize --revert a1b29f17-e910-4216-9165-e148e62d1ba1
Cannot 'revertResize' instance a1b29f17-e910-4216-9165-e148e62d1ba1 while it is in vm_state active (HTTP 409) (Request-ID: req-fd828140-405b-4747-b6c5-7f846a46e2dc)

...

I believe the instance need to be in 'resize_confirm', not 'active' to be reverted.

Revision history for this message
Logan V (loganv) wrote :

ubuntu@b0ca2dda2a32:~$ openstack server set --state resize_confirm a44021f5-a462-4b0c-9ed5-f49149e0059d
usage: openstack server set [-h] [--name <new-name>] [--root-password]
                            [--property <key=value>] [--state <state>]
                            <server>
openstack server set: error: argument --state: invalid choice: u'resize_confirm' (choose from 'active', 'error')

Simply hacking the instances.vm_state for this uuid in the nova table did not help. I did that and now revert resize gives:
ubuntu@b0ca2dda2a32:~$ openstack server resize --revert a44021f5-a462-4b0c-9ed5-f49149e0059d
Instance has not been resized. (HTTP 400) (Request-ID: req-eb57a679-0d60-40e4-85c8-5bd7bd51c574)

tags: added: ceph
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.