clean source instance directory failed in _cleanup_resize when images_type is rbd

Bug #1761062 reported by guolidong on 2018-04-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

Description
===========
When images_type is rbd, and boot an instance from image, perform resize and resize-confirm of this instance, it will not clean up source instance directory and result in live-migration this instance failed. The following is the error log.

2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server [req-e20743e3-a683-41ba-b47b-ba92e97eff37 4f7cd8bf676d43bc9faf09b2eb41482f 2c3d8251c39545cbb6f77f331b7164f8 - default default] Exception during message handling: DestinationDiskExists: The supplied disk path (/var/lib/nova/instances/d8db3f2a-cd8f-48e1-9951-012d762664b2) already exists, it is expected not to exist.
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 76, in wrapped
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server function_name, call_dict, binary)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server self.force_reraise()
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 67, in wrapped
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 880, in decorated_function
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 223, in decorated_function
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server self.force_reraise()
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 211, in decorated_function
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5615, in pre_live_migration
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server migrate_data)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova_patch/virt/libvirt/driver.py", line 1095, in wrap
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server migrate_data=migrate_data)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7081, in pre_live_migration
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server raise exception.DestinationDiskExists(path=instance_dir)
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server DestinationDiskExists: The supplied disk path (/var/lib/nova/instances/d8db3f2a-cd8f-48e1-9951-012d762664b2) already exists, it is expected not to exist.
2018-03-22 10:05:40.657 20779 ERROR oslo_messaging.rpc.server
2018-03-22 10:05:42.665 20779 INFO nova.virt.libvirt.driver [req-e20743e3-a683-41ba-b47b-ba92e97eff37 4f7cd8bf676d43bc9faf09b2eb41482f 2c3d8251c39545cbb6f77f331b7164f8 - default default] [instance: d8db3f2a-cd8f-48e1-9951-012d762664b2] Instance destroyed successfully.

when images_type is rbd, it will considered is shared storage(in _is_storage_shared_with) in resize process.
In this environment, instance directory is not shared storage actually.

Environment
===========
nova: origin/stable/pike
libvirt+kvm

guolidong (guolidong) on 2018-04-04
Changed in nova:
assignee: nobody → guolidong (guolidong)
guolidong (guolidong) on 2018-04-09
Changed in nova:
assignee: guolidong (guolidong) → nobody
Matt Riedemann (mriedem) wrote :

Looks like this is the problem: https://review.openstack.org/#/c/327419/

That's been around since Newton. I also assume that if people are using ceph/rbd then their computes are on shared storage.

Why would the two computes here not be on the same shared storage pool? If they aren't, but other computes are on shared storage, then you should probably use host aggregates to define the groups of hosts which are in the same shared storage pools so that the scheduler won't pick a destination host for the resize which the source compute can't reach.

tags: added: ceph libvirt
melanie witt (melwitt) wrote :

You can only hit this error if you're not using shared storage, but if you're using 'images_type = rbd', you should be using shared storage. Can you explain more about your environment and why/how you are not using shared storage with rbd?

Changed in nova:
status: New → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired

Reviewed: https://review.openstack.org/618478
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d6c1f6a1032ed2ea99f3d8b70ccf38065163d785
Submitter: Zuul
Branch: master

commit d6c1f6a1032ed2ea99f3d8b70ccf38065163d785
Author: Lee Yarwood <email address hidden>
Date: Mon Dec 3 09:03:26 2018 +0000

    libvirt: Add workaround to cleanup instance dir when using rbd

    At present all virt drivers provide a cleanup method that takes a single
    destroy_disks boolean to indicate when the underlying storage of an
    instance should be destroyed.

    When cleaning up after an evacuation or revert resize the value of
    destroy_disks is determined by the compute layer calling down both into
    the check_instance_shared_storage_local method of the local virt driver
    and remote check_instance_shared_storage method of the virt driver on
    the host now running the instance.

    For the Libvirt driver the initial local call will return None when
    using the shared block RBD imagebackend as it is assumed all instance
    storage is shared resulting in destroy_disks always being False when
    cleaning up. This behaviour is wrong as the instance disks are stored
    separately to the instance directory that still needs to be cleaned up
    on the host. Additionally this directory could also be shared
    independently of the disks on a NFS share for example and would need to
    also be checked before removal.

    This change introduces a backportable workaround configurable for the
    Libvirt driver with which operators can ensure that the instance
    directory is always removed during cleanup when using the RBD
    imagebackend. When enabling this workaround operators will need to
    ensure that the instance directories are not shared between computes.

    Future work will allow for the removal of this workaround by separating
    the shared storage checks from the compute to virt layers between the
    actual instance disks and any additional storage required by the
    specific virt backend.

    Related-Bug: #1761062
    Partial-Bug: #1414895
    Change-Id: I8fd6b9f857a1c4919c3365951e2652d2d477df77

Reviewed: https://review.openstack.org/627958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8c678ae57299076a5013f0be985621e064acfee0
Submitter: Zuul
Branch: stable/rocky

commit 8c678ae57299076a5013f0be985621e064acfee0
Author: Lee Yarwood <email address hidden>
Date: Mon Dec 3 09:03:26 2018 +0000

    libvirt: Add workaround to cleanup instance dir when using rbd

    At present all virt drivers provide a cleanup method that takes a single
    destroy_disks boolean to indicate when the underlying storage of an
    instance should be destroyed.

    When cleaning up after an evacuation or revert resize the value of
    destroy_disks is determined by the compute layer calling down both into
    the check_instance_shared_storage_local method of the local virt driver
    and remote check_instance_shared_storage method of the virt driver on
    the host now running the instance.

    For the Libvirt driver the initial local call will return None when
    using the shared block RBD imagebackend as it is assumed all instance
    storage is shared resulting in destroy_disks always being False when
    cleaning up. This behaviour is wrong as the instance disks are stored
    separately to the instance directory that still needs to be cleaned up
    on the host. Additionally this directory could also be shared
    independently of the disks on a NFS share for example and would need to
    also be checked before removal.

    This change introduces a backportable workaround configurable for the
    Libvirt driver with which operators can ensure that the instance
    directory is always removed during cleanup when using the RBD
    imagebackend. When enabling this workaround operators will need to
    ensure that the instance directories are not shared between computes.

    Future work will allow for the removal of this workaround by separating
    the shared storage checks from the compute to virt layers between the
    actual instance disks and any additional storage required by the
    specific virt backend.

    Related-Bug: #1761062
    Partial-Bug: #1414895
    Change-Id: I8fd6b9f857a1c4919c3365951e2652d2d477df77
    (cherry picked from commit d6c1f6a1032ed2ea99f3d8b70ccf38065163d785)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/628726
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b7bf1fbe4917c285f7bb635e791204d67b809049
Submitter: Zuul
Branch: stable/queens

commit b7bf1fbe4917c285f7bb635e791204d67b809049
Author: Lee Yarwood <email address hidden>
Date: Mon Dec 3 09:03:26 2018 +0000

    libvirt: Add workaround to cleanup instance dir when using rbd

    At present all virt drivers provide a cleanup method that takes a single
    destroy_disks boolean to indicate when the underlying storage of an
    instance should be destroyed.

    When cleaning up after an evacuation or revert resize the value of
    destroy_disks is determined by the compute layer calling down both into
    the check_instance_shared_storage_local method of the local virt driver
    and remote check_instance_shared_storage method of the virt driver on
    the host now running the instance.

    For the Libvirt driver the initial local call will return None when
    using the shared block RBD imagebackend as it is assumed all instance
    storage is shared resulting in destroy_disks always being False when
    cleaning up. This behaviour is wrong as the instance disks are stored
    separately to the instance directory that still needs to be cleaned up
    on the host. Additionally this directory could also be shared
    independently of the disks on a NFS share for example and would need to
    also be checked before removal.

    This change introduces a backportable workaround configurable for the
    Libvirt driver with which operators can ensure that the instance
    directory is always removed during cleanup when using the RBD
    imagebackend. When enabling this workaround operators will need to
    ensure that the instance directories are not shared between computes.

    Future work will allow for the removal of this workaround by separating
    the shared storage checks from the compute to virt layers between the
    actual instance disks and any additional storage required by the
    specific virt backend.

    NOTE(lyarwood): Conflicts as If1b6e5f20d2ea82d94f5f0550f13189fc9bc16c4
    only merged in Rocky and the backports of
    Id3c74c019da29070811ffc368351e2238b3f6da5 and
    I217fba9138132b107e9d62895d699d238392e761 have yet to land on
    stable/queens from stable/rocky.

    Conflicts:
            nova/conf/workarounds.py

    Related-Bug: #1761062
    Partial-Bug: #1414895
    Change-Id: I8fd6b9f857a1c4919c3365951e2652d2d477df77
    (cherry picked from commit d6c1f6a1032ed2ea99f3d8b70ccf38065163d785)
    (cherry picked from commit 8c678ae57299076a5013f0be985621e064acfee0)

tags: added: in-stable-queens
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers