[SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state

Bug #1821594 reported by Rodrigo Barbieri
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Pike
Triaged
Medium
Unassigned
Queens
Fix Committed
Medium
Tony Breeds
Rocky
Fix Committed
Medium
Matt Riedemann
Stein
Fix Released
Medium
Matt Riedemann
Ubuntu Cloud Archive
Fix Released
Medium
Unassigned
Queens
Fix Released
Medium
Unassigned
Rocky
Fix Released
Medium
Unassigned
Stein
Fix Released
Medium
Unassigned
Train
Fix Released
Undecided
Unassigned
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Medium
Unassigned
Cosmic
Fix Released
Medium
Unassigned
Disco
Fix Released
Medium
Unassigned
Eoan
Fix Released
Undecided
Unassigned

Bug Description

Description:

When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.

The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.

When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.

Steps to reproduce:

Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.

Expected results:

Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.

Actual results:

Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.

Environment:

I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.

=======================================================================

[Impact]

If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.

This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.

This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.

[Test Case]

1. Reproducing the bug

1a. Inject failure

The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.

An example when using libvirt is to add a line:

raise Exception("TEST")

in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012

1b. Restart nova-compute service: systemctl restart nova-compute

1c. Create a VM

1d. Then, invoke a cold migration: "openstack server migrate {id}"

1e. Wait for instance status: VERIFY_RESIZE

1f. Invoke "openstack server resize {id} --confirm"

1g. Wait for instance status: ERROR

1h. Check migration stuck in "confirming" status: nova migration-list

1i. Check allocations, you should see 2 allocations, one with the VM ID, the other with the migration uuid

export ENDPOINT=<placement_endpoint>
export TOKEN=`openstack token issue| grep ' id '| awk '{print $4}'`
for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq [.allocations]; done

2. Cleanup

2a. Delete the VM

2b. Delete the stale allocation:

export ID=<migration_uuid>
curl -k -s -X DELETE $ENDPOINT/allocations/$ID -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement placement 1.17"

3. Install package that contains the fixed code

4. Confirm bug is fixed

4a. Repeat steps 1a through 1g

4b. Check migration with "error" status: nova migration-list

4c. Check allocations, you should see only 1 allocation with the VM ID

for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq [.allocations]; done

5. Cleanup

5a. Delete the VM

[Regression Potential]

New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail.

[Other Info]

None

Revision history for this message
Matt Riedemann (mriedem) wrote :

IRC discussion here:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-03-25.log.html#t2019-03-25T13:07:57

Setting the migration record status to error is easy, so I'll start with that. Refactoring the code to handle errors but still cleanup allocations is a bit more complicated, so that can be split into a separate change to close the bug.

Also note related bug 1793569 with some scripts for cleaning up orphaned allocations.

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: compute placement resize
Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

This goes back to Pike as noted in the bug description. Before Pike the ResourceTracker.update_available_resource code would at least correct the allocations based on the instance.flavor and instance.host.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/647546

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/647566

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/647741

Changed in nova:
assignee: Matt Riedemann (mriedem) → Eric Fried (efried)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/647546
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=408ef8f84a698f764aa5d769d6d01fd9340de2e5
Submitter: Zuul
Branch: master

commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

    Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

    Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
    Partial-Bug: #1821594

Matt Riedemann (mriedem)
Changed in nova:
assignee: Eric Fried (efried) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.openstack.org/649421

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/647566
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03a6d26691c1f182224d59190b79f48df278099e
Submitter: Zuul
Branch: master

commit 03a6d26691c1f182224d59190b79f48df278099e
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

    Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/647741
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4d0aab16ecac5128a20ab9229d444bc49a9c722b
Submitter: Zuul
Branch: master

commit 4d0aab16ecac5128a20ab9229d444bc49a9c722b
Author: Matt Riedemann <email address hidden>
Date: Tue Mar 26 09:55:23 2019 -0400

    api-ref: add more details to confirmResize troubleshooting

    This does a few things:

    1. Fixes the "migration_status" wording since that does not
       exist and could be confused as a field on the server
       resource which it is not, it is referring to the migration
       resource status.

    2. Fixes the RESIZED status mention since that is not a real
       server status, it probably meant VERIFY_RESIZE (RESIZED is
       the name of the nova.compute.vm_states variable used in the
       code to set the VERIFY_RESIZE status in the API).

    3. Adds words about options to correct the server status from
       ERROR after confirmResize fails, the most obvious being
       an admin resetting the server status to ACTIVE or the user
       hard rebooting the server.

    Change-Id: I7de257ad78031d137616d724bee3fa1cbf40d6fa
    Related-Bug: #1821594

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.openstack.org/650437

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/652127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/652146

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/652150

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/652153

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state

I'm not sure how practical it will be to try and cleanup the leaked allocation in stable/pike because there was a pretty decent sized refactor of the allocation cleanup code in Queens:

https://review.openstack.org/#/c/498947/

Which would mean the backport to Pike would definitely not be clean. So I'll probably not put effort into backporting I29c5f491ec20a71283190a1599e7732541de736f to Pike but I can still fix the migration status getting stuck in 'confirming' issue with change Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh furthermore we'd need https://review.openstack.org/#/c/479802/ (or a part of it) in Pike for Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2, so I probably won't pursue that either for now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.openstack.org/649421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=972d4e0eb391e83fe8d3020ff95db0e6a840a224
Submitter: Zuul
Branch: stable/stein

commit 972d4e0eb391e83fe8d3020ff95db0e6a840a224
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

    Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

    Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
    Partial-Bug: #1821594
    (cherry picked from commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/650437
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5f515060f6c79f113f7b8107596e41056445c79f
Submitter: Zuul
Branch: stable/stein

commit 5f515060f6c79f113f7b8107596e41056445c79f
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

    Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

    Conflicts:
          nova/compute/manager.py

    NOTE(mriedem): The conflict is due to not having change
    I49034f87e240775cca7a4d769819dd788b86832a in Stein.

    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594
    (cherry picked from commit 03a6d26691c1f182224d59190b79f48df278099e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/652127
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2a25d1e48bce34c7dd0d5783908fdb0dc1b19113
Submitter: Zuul
Branch: stable/rocky

commit 2a25d1e48bce34c7dd0d5783908fdb0dc1b19113
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

    Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

    Conflicts:
          nova/tests/functional/test_servers.py
          nova/tests/unit/compute/test_compute_mgr.py

    NOTE(mriedem): The functional test conflict is due to not
    having change I99427a52676826990d2a2ffc82cf30ad945b939c
    in Rocky. The unit test conflict is due to not having
    change I0851e2d54a1fdc82fe3291fb7e286e790f121e92 in Rocky.
    The source_node attribute on the fake Migration object in
    the unit test is added here because change
    I312d61383345ea0ac1ab0c277b4c468e6aa94656 is not in Rocky.

    Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
    Partial-Bug: #1821594
    (cherry picked from commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5)
    (cherry picked from commit 972d4e0eb391e83fe8d3020ff95db0e6a840a224)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/652146
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=37ac54a42ec91821d62864d63c486c002608491b
Submitter: Zuul
Branch: stable/rocky

commit 37ac54a42ec91821d62864d63c486c002608491b
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

    Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

    Conflicts:
          nova/compute/manager.py
          nova/tests/unit/compute/test_compute_mgr.py

    NOTE(mriedem): The manager.py conflict is due to not having change
    Ibb8c12fb2799bb5ceb9e3d72a2b86dbb4f14451e in Rocky. In addition,
    since change I0851e2d54a1fdc82fe3291fb7e286e790f121e92 is not in
    Rocky the _delete_allocation_after_move signature needs to be
    handled a bit different in this backport. The test_compute_mgr.py
    conflict and source_node addition in the setUp is due to the same
    two changes.

    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594
    (cherry picked from commit 03a6d26691c1f182224d59190b79f48df278099e)
    (cherry picked from commit 5f515060f6c79f113f7b8107596e41056445c79f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/657870

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state

The bug is unfortunately not fixed in rocky as shown in this recreate test:

https://review.opendev.org/#/c/658834/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/658929

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/657870
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=873ac499c50125adc2fb49728d936922f9acf4a9
Submitter: Zuul
Branch: master

commit 873ac499c50125adc2fb49728d936922f9acf4a9
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

    Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

    Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    Related-bug: #1821594

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/658929
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d7d7f115430c7ffeb88ec9dcd155ac69b29d7513
Submitter: Zuul
Branch: stable/stein

commit d7d7f115430c7ffeb88ec9dcd155ac69b29d7513
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

    Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

    Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    Related-bug: #1821594
    (cherry picked from commit 873ac499c50125adc2fb49728d936922f9acf4a9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/659338

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/652150
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c6a42cd3b38528bbf410b33c3bfc384600ae053f
Submitter: Zuul
Branch: stable/queens

commit c6a42cd3b38528bbf410b33c3bfc384600ae053f
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

    Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

    Conflicts:
          nova/tests/unit/compute/test_compute_mgr.py

    NOTE(mriedem): The conflict is due to not having change
    Ia05525058e47efb806cf8820410c8bc80eccca25 in Queens.

    Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
    Partial-Bug: #1821594
    (cherry picked from commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5)
    (cherry picked from commit 972d4e0eb391e83fe8d3020ff95db0e6a840a224)
    (cherry picked from commit e3f69c8af0d13f0aa60e9a267f25050729f7766c)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/661349

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/659338
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dac3239e92fc1865cacc17bbfbd2316072a9d26e
Submitter: Zuul
Branch: stable/rocky

commit dac3239e92fc1865cacc17bbfbd2316072a9d26e
Author: Matt Riedemann <email address hidden>
Date: Wed May 15 12:27:17 2019 -0400

    [stable-only] Delete allocations even if _confirm_resize raises (part 2)

    The backport 37ac54a42ec91821d62864d63c486c002608491b to fix
    bug 1821594 did not account for how the _delete_allocation_after_move
    method before Stein is tightly coupled to the migration status
    being set to "confirmed" which is what the _confirm_resize method
    does after self.driver.confirm_migration returns.

    However, if self.driver.confirm_migration raises an exception
    we still want to cleanup the allocations held on the source node
    and for that we call _delete_allocation_after_move. But because
    of that tight coupling before Stein, we need to temporarily
    mutate the migration status to "confirmed" to get the cleanup
    method to do what we want.

    This isn't a problem starting in Stein because change
    I0851e2d54a1fdc82fe3291fb7e286e790f121e92 removed that
    tight coupling on the migration status, so this is a stable
    branch only change.

    Note that we don't call self.reportclient.delete_allocation_for_instance
    directly since before Stein we still need to account for a
    migration that does not move the source node allocations to the
    migration record, and that logic is in _delete_allocation_after_move.

    A simple unit test assertion is added here but the functional
    test added in change I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    will assert the bug is fixed properly before Stein.

    Change-Id: I933687891abef4878de09481937d576ce5899511
    Closes-Bug: #1821594

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/658834
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=01bfb7863c80c43538632952ec9f1728f0b412d6
Submitter: Zuul
Branch: stable/rocky

commit 01bfb7863c80c43538632952ec9f1728f0b412d6
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

    Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

    This backport is not clean due to change
    If6aa37d9b6b48791e070799ab026c816fda4441c, which
    refactored the testing framework. Within the refactor,
    new assertion methods were added and method
    "assertFlavorMatchesAllocation" was modified.
    This backport needed to be adapted in order to
    be compatible with the testing framework prior to
    If6aa37d9b6b48791e070799ab026c816fda4441c.

    Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    Related-bug: #1821594
    (cherry picked from commit 873ac499c50125adc2fb49728d936922f9acf4a9)
    (cherry picked from commit d7d7f115430c7ffeb88ec9dcd155ac69b29d7513)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/652153
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b8435d6001c30e67ac277e414b11708d4d660555
Submitter: Zuul
Branch: stable/queens

commit b8435d6001c30e67ac277e414b11708d4d660555
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

    Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594
    (cherry picked from commit 03a6d26691c1f182224d59190b79f48df278099e)
    (cherry picked from commit 5f515060f6c79f113f7b8107596e41056445c79f)
    (cherry picked from commit 37ac54a42ec91821d62864d63c486c002608491b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/661349
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5600309a1fc38f0f98629608d3f41eacf66d714a
Submitter: Zuul
Branch: stable/queens

commit 5600309a1fc38f0f98629608d3f41eacf66d714a
Author: Matt Riedemann <email address hidden>
Date: Wed May 15 12:27:17 2019 -0400

    [stable-only] Delete allocations even if _confirm_resize raises (part 2)

    The backport https://review.opendev.org/#/c/652153/ to fix
    bug 1821594 did not account for how the _delete_allocation_after_move
    method before Stein is tightly coupled to the migration status
    being set to "confirmed" which is what the _confirm_resize method
    does after self.driver.confirm_migration returns.

    However, if self.driver.confirm_migration raises an exception
    we still want to cleanup the allocations held on the source node
    and for that we call _delete_allocation_after_move. But because
    of that tight coupling before Stein, we need to temporarily
    mutate the migration status to "confirmed" to get the cleanup
    method to do what we want.

    This isn't a problem starting in Stein because change
    I0851e2d54a1fdc82fe3291fb7e286e790f121e92 removed that
    tight coupling on the migration status, so this is a stable
    branch only change.

    Note that we don't call self.reportclient.delete_allocation_for_instance
    directly since before Stein we still need to account for a
    migration that does not move the source node allocations to the
    migration record, and that logic is in _delete_allocation_after_move.

    A simple unit test assertion is added here but the functional
    test added in change I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    will assert the bug is fixed properly before Stein.

    Change-Id: I933687891abef4878de09481937d576ce5899511
    Closes-Bug: #1821594
    (cherry picked from commit dac3239e92fc1865cacc17bbfbd2316072a9d26e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/658136
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=944b24f462f8785a5c039efb74eec5310ef7729e
Submitter: Zuul
Branch: stable/queens

commit 944b24f462f8785a5c039efb74eec5310ef7729e
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

    Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

    NOTE(mriedem): The _wait_until_deleted helper method
    is modified here to include the changes from patch
    I4ced19bd9259f0b5a50b89dd5908abe35ca73894 in Rocky
    otherwise the test fails.

    Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    Related-bug: #1821594
    (cherry picked from commit 873ac499c50125adc2fb49728d936922f9acf4a9)
    (cherry picked from commit d7d7f115430c7ffeb88ec9dcd155ac69b29d7513)
    (cherry picked from commit 01bfb7863c80c43538632952ec9f1728f0b412d6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.1

This issue was fixed in the openstack/nova 19.0.1 release.

description: updated
description: updated
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This will be fixed in the 19.0.1 stein stable point release via https://bugs.launchpad.net/cloud-archive/+bug/1831754.

tags: added: sts
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Fyi I'm currently working on a train snapshot (and blocked atm) which is needed prior to SRU of any other releases.

tags: added: sts-sru-needed
description: updated
summary: - Error in confirm_migration leaves stale allocations and 'confirming'
- migration state
+ [SRU] Error in confirm_migration leaves stale allocations and
+ 'confirming' migration state
Changed in nova (Ubuntu Eoan):
status: New → Fix Committed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Rodrigo, thank you for the patches and debdiffs. I've uploaded your changes to cosmic and bionic unapproved queues where they are awaiting review from the SRU team.
https://launchpad.net/ubuntu/cosmic/+queue?queue_state=1&queue_text=nova
https://launchpad.net/ubuntu/bionic/+queue?queue_state=1&queue_text=nova

Changed in nova (Ubuntu Disco):
importance: Undecided → Medium
status: New → Triaged
Changed in nova (Ubuntu Cosmic):
importance: Undecided → Medium
status: New → Triaged
Changed in nova (Ubuntu Bionic):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.1

This issue was fixed in the openstack/nova 18.2.1 release.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Rodrigo, or anyone else affected,

Accepted nova into cosmic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:18.2.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-cosmic to verification-done-cosmic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-cosmic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in nova (Ubuntu Cosmic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-cosmic
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Rodrigo, or anyone else affected,

Accepted nova into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Improved test case with more detailed steps in order to make it easier to script and validate the output

description: updated
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Cosmic Rocky verified using [Test Case]. Attached output.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Bionic Rocky verified using [Test Case]. Attached output.

tags: added: verification-done-cosmic
removed: verification-needed-cosmic
tags: added: verification-rocky-done
removed: verification-rocky-needed
tags: added: verification-done
removed: verification-needed
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:18.2.0-0ubuntu2

---------------
nova (2:18.2.0-0ubuntu2) cosmic; urgency=medium

  * Cherry-picked from upstream to ensure no stale
    allocations are left over on failed cold
    migrations (LP: #1821594).
    - d/p/bug_1821594_1.patch: Fix migration record status
    - d/p/bug_1821594_2.patch: Delete failed allocation part 1
    - d/p/bug_1821594_3.patch: Delete failed allocation part 2
    - d/p/bug_1821594_4.patch: New functional test

nova (2:18.2.0-0ubuntu1) cosmic; urgency=medium

  [Sahid Orentino Ferdjaoui]
  * New stable point release for OpenStack Rocky (LP: #1830695).
  * d/p/ensure-rbd-auth-fallback-uses-matching-credentials.patch:
    Dropped. Fixed upstream in 18.2.0.

  [Corey Bryant]
  * d/p/skip-openssl-1.1.1-tests.patch: Dropped as this is now properly fixed
    by xenapi-agent-change-openssl-error-handling.patch.

 -- Rodrigo Barbieri <email address hidden> Mon, 10 Jun 2019 12:36:21 +0000

Changed in nova (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Revision history for this message
James Page (james-page) wrote :

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
James Page (james-page) wrote :

This bug was fixed in the package nova - 2:18.2.0-0ubuntu2~cloud0
---------------

 nova (2:18.2.0-0ubuntu2~cloud0) bionic-rocky; urgency=medium
 .
   * New upstream release for the Ubuntu Cloud Archive.
 .
 nova (2:18.2.0-0ubuntu2) cosmic; urgency=medium
 .
   * Cherry-picked from upstream to ensure no stale
     allocations are left over on failed cold
     migrations (LP: #1821594).
     - d/p/bug_1821594_1.patch: Fix migration record status
     - d/p/bug_1821594_2.patch: Delete failed allocation part 1
     - d/p/bug_1821594_3.patch: Delete failed allocation part 2
     - d/p/bug_1821594_4.patch: New functional test
 .
 nova (2:18.2.0-0ubuntu1) cosmic; urgency=medium
 .
   [Sahid Orentino Ferdjaoui]
   * New stable point release for OpenStack Rocky (LP: #1830695).
   * d/p/ensure-rbd-auth-fallback-uses-matching-credentials.patch:
     Dropped. Fixed upstream in 18.2.0.
 .
   [Corey Bryant]
   * d/p/skip-openssl-1.1.1-tests.patch: Dropped as this is now properly fixed
     by xenapi-agent-change-openssl-error-handling.patch.

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Rodrigo, or anyone else affected,

Accepted nova into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:17.0.10-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in nova (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
removed: verification-done
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Bionic Queens verified using [Test Case]. Attached output above ^.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Marking Disco/Stein as Fix Committed since it is the same in bug 1831754

Changed in nova (Ubuntu Disco):
status: Triaged → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.11

This issue was fixed in the openstack/nova 17.0.11 release.

Revision history for this message
Corey Bryant (corey.bryant) wrote : Please test proposed package

Hello Rodrigo, or anyone else affected,

Accepted nova into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Xenial Queens verified using [Test Case]. Attached output.

tags: added: verification-queens-done
removed: verification-queens-needed
Changed in nova (Ubuntu Disco):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:17.0.10-0ubuntu2

---------------
nova (2:17.0.10-0ubuntu2) bionic; urgency=medium

  * Cherry-picked from upstream to ensure no stale
    allocations are left over on failed cold
    migrations (LP: #1821594).
    - d/p/bug_1821594_1.patch: Fix migration record status
    - d/p/bug_1821594_2.patch: Delete failed allocation part 1
    - d/p/bug_1821594_3.patch: Delete failed allocation part 2
    - d/p/bug_1821594_4.patch: New functional test

nova (2:17.0.10-0ubuntu1) bionic; urgency=medium

  * New stable point release for OpenStack Queens (LP: #1830341).
  * d/p/ensure-rbd-auth-fallback-uses-matching-credentials.patch:
    Dropped. Fixed upstream in 17.0.10.

 -- Rodrigo Barbieri <email address hidden> Mon, 10 Jun 2019 13:20:41 +0000

Changed in nova (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This was fixed in queens nova package version 2:17.0.10-0ubuntu2~cloud0

Changed in nova (Ubuntu Eoan):
status: Fix Committed → Fix Released
tags: added: sts-sru-done
removed: sts-sru-needed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.