failed compute node didn't delete instance's path directory in init_host

Bug #1414895 reported by ChangBo Guo(gcb)
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Lee Yarwood
Queens
Fix Released
Low
Lee Yarwood
Rocky
Fix Released
Low
Lee Yarwood

Bug Description

 It will do clean for evacuated instances in method init_host in nova/compute/manager.py we restart a failed compute node .
  while using volume-backend shared-storage (like ceph) , nova leaves instance path directory in failed compuet node.

The root cause is that we only passed argument "destroy_disks" to driver.destroy, the value will be true while using ceph.
then will not delete instance path .

This may lead live-migration instance failure with exeption DestinationDiskExists

Revision history for this message
Sean Dague (sdague) wrote :

This seems to be a ceph specific issue, marking as low

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
tags: added: ceph
Changed in nova:
assignee: nobody → Bartosz Fic (bartosz-fic)
tags: added: live-migrate
Changed in nova:
assignee: Bartosz Fic (bartosz-fic) → nobody
Paul Murray (pmurray)
tags: added: live-migration
removed: live-migrate
lvmxh (shaohef)
Changed in nova:
assignee: nobody → lvmxh (shaohef)
lvmxh (shaohef)
Changed in nova:
assignee: lvmxh (shaohef) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/281913

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: Confirmed → In Progress
Revision history for this message
Timofey Durakov (tdurakov) wrote :

This problem isn't affect live migration because there are 2 similar methods for checking shared storage, one described here is used during revert resize and cleanup compute after fail. So removing live-migrate tag.
P.S. this check duplication should be refactored.

tags: removed: live-migration
Revision history for this message
Timofey Durakov (tdurakov) wrote :

oh, side effect for live-migration process.

tags: added: live-migration
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: master
Review: https://review.openstack.org/281913

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Lee Yarwood: Are you still working on this?

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Yes but I see that @tdurakov also has a review up re-factoring much of this code [1]. I'll review this and reassign if required.

[1] https://review.openstack.org/#/c/280653/

Revision history for this message
Augustina Ragwitz (auggy) wrote :

I left a comment on that review to include this bug in that "closes bug" list. If that's inaccurate then please feel free to pipe up and clarify.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Lee Yarwood (lyarwood) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/477780

Changed in nova:
assignee: nobody → Jeffrey Zhang (jeffrey4l)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/618478

Changed in nova:
assignee: Jeffrey Zhang (jeffrey4l) → Lee Yarwood (lyarwood)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/627958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/618478
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d6c1f6a1032ed2ea99f3d8b70ccf38065163d785
Submitter: Zuul
Branch: master

commit d6c1f6a1032ed2ea99f3d8b70ccf38065163d785
Author: Lee Yarwood <email address hidden>
Date: Mon Dec 3 09:03:26 2018 +0000

    libvirt: Add workaround to cleanup instance dir when using rbd

    At present all virt drivers provide a cleanup method that takes a single
    destroy_disks boolean to indicate when the underlying storage of an
    instance should be destroyed.

    When cleaning up after an evacuation or revert resize the value of
    destroy_disks is determined by the compute layer calling down both into
    the check_instance_shared_storage_local method of the local virt driver
    and remote check_instance_shared_storage method of the virt driver on
    the host now running the instance.

    For the Libvirt driver the initial local call will return None when
    using the shared block RBD imagebackend as it is assumed all instance
    storage is shared resulting in destroy_disks always being False when
    cleaning up. This behaviour is wrong as the instance disks are stored
    separately to the instance directory that still needs to be cleaned up
    on the host. Additionally this directory could also be shared
    independently of the disks on a NFS share for example and would need to
    also be checked before removal.

    This change introduces a backportable workaround configurable for the
    Libvirt driver with which operators can ensure that the instance
    directory is always removed during cleanup when using the RBD
    imagebackend. When enabling this workaround operators will need to
    ensure that the instance directories are not shared between computes.

    Future work will allow for the removal of this workaround by separating
    the shared storage checks from the compute to virt layers between the
    actual instance disks and any additional storage required by the
    specific virt backend.

    Related-Bug: #1761062
    Partial-Bug: #1414895
    Change-Id: I8fd6b9f857a1c4919c3365951e2652d2d477df77

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/628726

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/627958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8c678ae57299076a5013f0be985621e064acfee0
Submitter: Zuul
Branch: stable/rocky

commit 8c678ae57299076a5013f0be985621e064acfee0
Author: Lee Yarwood <email address hidden>
Date: Mon Dec 3 09:03:26 2018 +0000

    libvirt: Add workaround to cleanup instance dir when using rbd

    At present all virt drivers provide a cleanup method that takes a single
    destroy_disks boolean to indicate when the underlying storage of an
    instance should be destroyed.

    When cleaning up after an evacuation or revert resize the value of
    destroy_disks is determined by the compute layer calling down both into
    the check_instance_shared_storage_local method of the local virt driver
    and remote check_instance_shared_storage method of the virt driver on
    the host now running the instance.

    For the Libvirt driver the initial local call will return None when
    using the shared block RBD imagebackend as it is assumed all instance
    storage is shared resulting in destroy_disks always being False when
    cleaning up. This behaviour is wrong as the instance disks are stored
    separately to the instance directory that still needs to be cleaned up
    on the host. Additionally this directory could also be shared
    independently of the disks on a NFS share for example and would need to
    also be checked before removal.

    This change introduces a backportable workaround configurable for the
    Libvirt driver with which operators can ensure that the instance
    directory is always removed during cleanup when using the RBD
    imagebackend. When enabling this workaround operators will need to
    ensure that the instance directories are not shared between computes.

    Future work will allow for the removal of this workaround by separating
    the shared storage checks from the compute to virt layers between the
    actual instance disks and any additional storage required by the
    specific virt backend.

    Related-Bug: #1761062
    Partial-Bug: #1414895
    Change-Id: I8fd6b9f857a1c4919c3365951e2652d2d477df77
    (cherry picked from commit d6c1f6a1032ed2ea99f3d8b70ccf38065163d785)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/628726
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b7bf1fbe4917c285f7bb635e791204d67b809049
Submitter: Zuul
Branch: stable/queens

commit b7bf1fbe4917c285f7bb635e791204d67b809049
Author: Lee Yarwood <email address hidden>
Date: Mon Dec 3 09:03:26 2018 +0000

    libvirt: Add workaround to cleanup instance dir when using rbd

    At present all virt drivers provide a cleanup method that takes a single
    destroy_disks boolean to indicate when the underlying storage of an
    instance should be destroyed.

    When cleaning up after an evacuation or revert resize the value of
    destroy_disks is determined by the compute layer calling down both into
    the check_instance_shared_storage_local method of the local virt driver
    and remote check_instance_shared_storage method of the virt driver on
    the host now running the instance.

    For the Libvirt driver the initial local call will return None when
    using the shared block RBD imagebackend as it is assumed all instance
    storage is shared resulting in destroy_disks always being False when
    cleaning up. This behaviour is wrong as the instance disks are stored
    separately to the instance directory that still needs to be cleaned up
    on the host. Additionally this directory could also be shared
    independently of the disks on a NFS share for example and would need to
    also be checked before removal.

    This change introduces a backportable workaround configurable for the
    Libvirt driver with which operators can ensure that the instance
    directory is always removed during cleanup when using the RBD
    imagebackend. When enabling this workaround operators will need to
    ensure that the instance directories are not shared between computes.

    Future work will allow for the removal of this workaround by separating
    the shared storage checks from the compute to virt layers between the
    actual instance disks and any additional storage required by the
    specific virt backend.

    NOTE(lyarwood): Conflicts as If1b6e5f20d2ea82d94f5f0550f13189fc9bc16c4
    only merged in Rocky and the backports of
    Id3c74c019da29070811ffc368351e2238b3f6da5 and
    I217fba9138132b107e9d62895d699d238392e761 have yet to land on
    stable/queens from stable/rocky.

    Conflicts:
            nova/conf/workarounds.py

    Related-Bug: #1761062
    Partial-Bug: #1414895
    Change-Id: I8fd6b9f857a1c4919c3365951e2652d2d477df77
    (cherry picked from commit d6c1f6a1032ed2ea99f3d8b70ccf38065163d785)
    (cherry picked from commit 8c678ae57299076a5013f0be985621e064acfee0)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Jeffrey Zhang (<email address hidden>) on branch: master
Review: https://review.opendev.org/477780

Lee Yarwood (lyarwood)
Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.