Compute leaks volume attachments if we fail in driver.pre_live_migration

Bug #1778206 reported by Matthew Booth
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Artom Lifshitz
Queens
Fix Committed
Low
Artom Lifshitz
Rocky
Fix Committed
Low
Artom Lifshitz

Bug Description

ComputeManager.pre_live_migration fails to clean up volume attachments if the call to driver.pre_live_migration() fails. There's a try block in there to clean up attachments, but its scope isn't large enough. The result is a volume in a perpetual attaching state.

Revision history for this message
Andrey Volkov (avolkov) wrote :

For me the bug looks like:
BDM cleanup happens in nova.compute.manager.ComputeManager._rollback_live_migration
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L6636
based on migrate_data.old_vol_attachment_ids.
If nova.compute.manager.ComputeManager.pre_live_migration fails there is no such data
in migrate_data and BDM cleanup will not happen.
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L6166

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/587439
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1a29248d5e688ba1d4f806895dccd45fcb34b833
Submitter: Zuul
Branch: master

commit 1a29248d5e688ba1d4f806895dccd45fcb34b833
Author: Matthew Booth <email address hidden>
Date: Tue Jun 26 14:42:47 2018 +0100

    Ensure attachment cleanup on failure in driver.pre_live_migration

    Previously, if the call to driver.pre_live_migration failed (which in
    libvirt can happen with a DestinationDiskExists exception), the
    compute manager wouldn't rollback/cleanup volume attachments, leading
    to corrupt volume attachment information, and, depending on the
    backend, the instance being unable to access its volume. This patch
    moves the driver.pre_live_migration call inside the existing
    try/except, allowing the compute manager to properly rollback/cleanup
    volume attachments.

    The compute manager has its own _rollback_live_migration() cleanup in
    case the pre_live_migration() RPC call to the destination fails. There
    should be no conflicts between the cleanup in that and the new volume
    cleanup in the except block. The remove_volume_connection() ->
    driver_detach() -> detach_volume() call catches the InstanceNotFound
    exception and warns about the instance disappearing (it was never
    really on the destination in the first place). The attachment_delete()
    in _rollback_live_migration() is contingent on there being an
    old_vol_attachment in migrate_data, which there isn't because
    pre_live_migration() raised instead of returning.

    Change-Id: I67f66e95d69ae6df22e539550a3eac697ea8f5d8
    Closes-bug: 1778206

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem)
tags: added: compute live-migration volumes
no longer affects: nova/pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/612715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/612774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/612715
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6aecf475dbff9c7d477378c021749b987d47deae
Submitter: Zuul
Branch: stable/rocky

commit 6aecf475dbff9c7d477378c021749b987d47deae
Author: Matthew Booth <email address hidden>
Date: Tue Jun 26 14:42:47 2018 +0100

    Ensure attachment cleanup on failure in driver.pre_live_migration

    Previously, if the call to driver.pre_live_migration failed (which in
    libvirt can happen with a DestinationDiskExists exception), the
    compute manager wouldn't rollback/cleanup volume attachments, leading
    to corrupt volume attachment information, and, depending on the
    backend, the instance being unable to access its volume. This patch
    moves the driver.pre_live_migration call inside the existing
    try/except, allowing the compute manager to properly rollback/cleanup
    volume attachments.

    The compute manager has its own _rollback_live_migration() cleanup in
    case the pre_live_migration() RPC call to the destination fails. There
    should be no conflicts between the cleanup in that and the new volume
    cleanup in the except block. The remove_volume_connection() ->
    driver_detach() -> detach_volume() call catches the InstanceNotFound
    exception and warns about the instance disappearing (it was never
    really on the destination in the first place). The attachment_delete()
    in _rollback_live_migration() is contingent on there being an
    old_vol_attachment in migrate_data, which there isn't because
    pre_live_migration() raised instead of returning.

    Change-Id: I67f66e95d69ae6df22e539550a3eac697ea8f5d8
    Closes-bug: 1778206
    (cherry picked from commit 1a29248d5e688ba1d4f806895dccd45fcb34b833)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/612774
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4b53e4edf9f7f8c794316d3a0920d727723d1bab
Submitter: Zuul
Branch: stable/queens

commit 4b53e4edf9f7f8c794316d3a0920d727723d1bab
Author: Matthew Booth <email address hidden>
Date: Tue Jun 26 14:42:47 2018 +0100

    Ensure attachment cleanup on failure in driver.pre_live_migration

    Previously, if the call to driver.pre_live_migration failed (which in
    libvirt can happen with a DestinationDiskExists exception), the
    compute manager wouldn't rollback/cleanup volume attachments, leading
    to corrupt volume attachment information, and, depending on the
    backend, the instance being unable to access its volume. This patch
    moves the driver.pre_live_migration call inside the existing
    try/except, allowing the compute manager to properly rollback/cleanup
    volume attachments.

    The compute manager has its own _rollback_live_migration() cleanup in
    case the pre_live_migration() RPC call to the destination fails. There
    should be no conflicts between the cleanup in that and the new volume
    cleanup in the except block. The remove_volume_connection() ->
    driver_detach() -> detach_volume() call catches the InstanceNotFound
    exception and warns about the instance disappearing (it was never
    really on the destination in the first place). The attachment_delete()
    in _rollback_live_migration() is contingent on there being an
    old_vol_attachment in migrate_data, which there isn't because
    pre_live_migration() raised instead of returning.

    Conflicts in nova/compute/manager.py due to absence of setting
    migrate_data.wait_for_vif_plug from 5aadff75c3a.

    Change-Id: I67f66e95d69ae6df22e539550a3eac697ea8f5d8
    Closes-bug: 1778206
    (cherry picked from commit 1a29248d5e688ba1d4f806895dccd45fcb34b833)
    (cherry picked from commit 6aecf475dbff9c7d477378c021749b987d47deae)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.3

This issue was fixed in the openstack/nova 18.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.9

This issue was fixed in the openstack/nova 17.0.9 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.