instance corrupted after volume retype

Bug #1896621 reported by Craig McIntyre on 2020-09-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Lee Yarwood
Queens
Undecided
Lee Yarwood
Rocky
Undecided
Lee Yarwood
Stein
Undecided
Lee Yarwood
Train
Undecided
Unassigned
Ussuri
Undecided
Unassigned
Victoria
Undecided
Unassigned

Bug Description

Description
===========

Following a cinder volume retype on a volume attached to a running instance, the instance became corrupt and cannot boot into the guest operating system any more.

Upon further investigating it seems the retype operation failed. The nova-compute logs registered the following error:

Exception during message handling: libvirtError: block copy still active: domain has active block job

see log extract: http://paste.openstack.org/show/798201/

Steps to reproduce
==================

I'm not sure how easy this would be to replicate the exact problem.

As an admin user within the project, in Horizon go to Project | Volume | Volume, then from the context menu of the required volume select "change volume type".

Select the new type and migration policy 'on-demand'.

Following this it was reported that the instance was none-responsive, when checking in the console the instance was unable to boot from the volume.

Environment
===========
DISTRIB_ID="OSA"
DISTRIB_RELEASE="18.1.5"
DISTRIB_CODENAME="Rocky"
DISTRIB_DESCRIPTION="OpenStack-Ansible"

# nova-manage --version
18.1.1

# virsh version
Compiled against library: libvirt 4.0.0
Using library: libvirt 4.0.0
Using API: QEMU 4.0.0
Running hypervisor: QEMU 2.11.1

Cinder v13.0.3 backed volumes using Zadara VPSA driver

Lee Yarwood (lyarwood) wrote :

Are users trying to migrate multiple volumes attached to the same instance at the same time?

Craig McIntyre (ceemac) wrote :

That is a distinct possibility, is that part of the issue do you think?

Lee Yarwood (lyarwood) wrote :

Yes, the logs suggest multiple blockjobs are running in parallel and causing a failure to update the persistent domain configuration to point at the new volumes.

There's zero locking around this operation within Nova so the fix is trivial if it is the root cause. I'll try to work up a reproducer this week alongside a bugfix introducing an instance lock on the compute.

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → Confirmed
importance: Undecided → High
Craig McIntyre (ceemac) wrote :

Thanks. Do you have any thoughts on how this failure would have resulted in a corrupted instance?

Kashyap Chamarthy (kashyapc) wrote :
Download full text (12.5 KiB)

Paste-bins expire, so copy/pasting content from the paste-bin (http://paste.openstack.org/show/798201/) in the description below:

--------------------------------------------
2020-09-21 11:12:27.538 3887 INFO nova.compute.manager [req-1363655f-fa54-4674-a5ae-d1d26d2ea86d 473593595cf04e41a6382d261eea3576 19151bb269ae4214bdf83e18083e8b54 - default default] [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] Swapping volume ed93fa66-13d0-447e-8084-4fd523cf3015 for 1bde1669-657e-4608-a8c6-b654ef58121f
2020-09-21 11:12:28.979 3887 INFO os_brick.initiator.connectors.iscsi [req-1363655f-fa54-4674-a5ae-d1d26d2ea86d 473593595cf04e41a6382d261eea3576 19151bb269ae4214bdf83e18083e8b54 - default default] Trying to connect to iSCSI portal 172.20.213.204:3260
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [req-1363655f-fa54-4674-a5ae-d1d26d2ea86d 473593595cf04e41a6382d261eea3576 19151bb269ae4214bdf83e18083e8b54 - default default] [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] Failed to swap volume ed93fa66-13d0-447e-8084-4fd523cf3015 for 1bde1669-657e-4608-a8c6-b654ef58121f: libvirtError: block copy still active: domain has active block job
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] Traceback (most recent call last):
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] File "/openstack/venvs/nova-18.1.5/lib/python2.7/site-packages/nova/compute/manager.py", line 5726, in _swap_volume
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] mountpoint, resize_to)
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] File "/openstack/venvs/nova-18.1.5/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1568, in swap_volume
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] self._swap_volume(guest, disk_dev, conf, resize_to)
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] File "/openstack/venvs/nova-18.1.5/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 1530, in _swap_volume
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] self._host.write_instance_config(xml)
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] File "/openstack/venvs/nova-18.1.5/lib/python2.7/site-packages/nova/virt/libvirt/host.py", line 864, in write_instance_config
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] domain = self.get_connection().defineXML(xml)
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] File "/openstack/venvs/nova-18.1.5/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
2020-09-21 11:13:56.292 3887 ERROR nova.compute.manager [instance: cd879458-57d9-4f4c-8214-1f99456b6f42] result = proxy_call(self._autowrap, f, *args, **kwargs)
2020-09-21 11:13:56.292 3887 ER...

Lee Yarwood (lyarwood) wrote :

I've been able to reproduce the `block copy still active` error when attempting to retype multiple volumes attached to the same instance. I'll write up a fix for this now as it's trivial.

As discussed in #openstack-nova you're also seeing your Windows AD instances fail to start after either rebooting themselves *or* being rebooted by users.

I think this is due to Nova not correctly reverting the active domain configuration back to the original volume when we hit this failure thus leaving it writing to the new volume. We eventually disconnect this new volume from the underlying host as well that I assume would cause the disk within the instance to enter a RO state.

I'm going to try to reproduce this part locally now.

Lee Yarwood (lyarwood) wrote :

I've been able to reproduce and confirm what is leading to the corruption of disk images *after* the `block copy still active` error in the following pastebin:

http://paste.openstack.org/show/798405/

In short n-cpu doesn't rollback the active domain configuration after successfully pivoting to the new volume if we *then* hit a failure, such as the case here.

As a result the instance ends up trying to write to a device that n-cpu then disconnects from the underlying host and c-vol unmaps from the backend as part of error handling higher up in the compute layer.

I'll work on the following next week to avoid this.

1. Locking _swap_volume by instance.uuid in the compute manager to avoid the race.

2. Refactoring and improving error handling within the libvirt driver swap_volume methods to ensure
   that we rollback the active domain configuration if we hit a late error.

3. Writing functional tests to cover the above and additional failure cases currently missing from
   the codebase.

Fix proposed to branch: master
Review: https://review.opendev.org/754695

Changed in nova:
status: Confirmed → In Progress

Reviewed: https://review.opendev.org/754695
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6cf449bdd0d4beb95cf12311e7d2f8669e625fac
Submitter: Zuul
Branch: master

commit 6cf449bdd0d4beb95cf12311e7d2f8669e625fac
Author: Lee Yarwood <email address hidden>
Date: Mon Sep 28 12:18:29 2020 +0100

    compute: Lock by instance.uuid lock during swap_volume

    The libvirt driver is currently the only virt driver implementing swap
    volume within Nova. While libvirt itself does support moving between
    multiple volumes attached to the same instance at the same time the
    current logic within the libvirt driver makes a call to
    virDomainGetXMLDesc that fails if there are active block jobs against
    any disk attached to the domain.

    This change simply uses an instance.uuid based lock in the compute layer
    to serialise requests to swap_volume to avoid this from being possible.

    Closes-Bug: #1896621
    Change-Id: Ic5ce2580e7638a47f1ffddb4edbb503bf490504c

Changed in nova:
status: In Progress → Fix Released
Craig McIntyre (ceemac) wrote :

Thanks again for your work on this.

This issue was fixed in the openstack/nova 22.1.0 release.

This issue was fixed in the openstack/nova 21.2.0 release.

melanie witt (melwitt) wrote :

https://review.opendev.org/c/openstack/nova/+/758732 has been released in ussuri 21.2.0

melanie witt (melwitt) wrote :
melanie witt (melwitt) wrote :
melanie witt (melwitt) wrote :
melanie witt (melwitt) wrote :

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers