Bug #1821594 “[SRU] Error in confirm_migration leaves stale allo...” : Series pike : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-03-25:

#1

IRC discussion here:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-03-25.log.html#t2019-03-25T13:07:57

Setting the migration record status to error is easy, so I'll start with that. Refactoring the code to handle errors but still cleanup allocations is a bit more complicated, so that can be split into a separate change to close the bug.

Also note related bug 1793569 with some scripts for cleaning up orphaned allocations.

Changed in nova:
status:	New → Triaged
importance:	Undecided → Medium
tags:	added: compute placement resize

Matt Riedemann (mriedem) on 2019-03-25

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-03-25:

#2

This goes back to Pike as noted in the bug description. Before Pike the ResourceTracker.update_available_resource code would at least correct the allocations based on the instance.flavor and instance.host.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-25: Fix proposed to nova (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/647546

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-25:

#4

Fix proposed to branch: master
Review: https://review.openstack.org/647566

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-26: Related fix proposed to nova (master)

#5

Related fix proposed to branch: master
Review: https://review.openstack.org/647741

OpenStack Infra (hudson-openstack) on 2019-03-28

Changed in nova:
assignee:	Matt Riedemann (mriedem) → Eric Fried (efried)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-29: Fix merged to nova (master)

#6

Reviewed: https://review.openstack.org/647546
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=408ef8f84a698f764aa5d769d6d01fd9340de2e5
Submitter: Zuul
Branch: master

commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
Partial-Bug: #1821594

Matt Riedemann (mriedem) on 2019-03-29

Changed in nova:
assignee:	Eric Fried (efried) → Matt Riedemann (mriedem)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-02: Fix proposed to nova (stable/stein)

#7

Fix proposed to branch: stable/stein
Review: https://review.openstack.org/649421

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-05: Fix merged to nova (master)

#8

Reviewed: https://review.openstack.org/647566
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03a6d26691c1f182224d59190b79f48df278099e
Submitter: Zuul
Branch: master

commit 03a6d26691c1f182224d59190b79f48df278099e
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
Closes-Bug: #1821594

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-05: Related fix merged to nova (master)

#9

Reviewed: https://review.openstack.org/647741
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4d0aab16ecac5128a20ab9229d444bc49a9c722b
Submitter: Zuul
Branch: master

commit 4d0aab16ecac5128a20ab9229d444bc49a9c722b
Author: Matt Riedemann <email address hidden>
Date: Tue Mar 26 09:55:23 2019 -0400

api-ref: add more details to confirmResize troubleshooting

This does a few things:

    1. Fixes the "migration_status" wording since that does not
       exist and could be confused as a field on the server
       resource which it is not, it is referring to the migration
       resource status.

    2. Fixes the RESIZED status mention since that is not a real
       server status, it probably meant VERIFY_RESIZE (RESIZED is
       the name of the nova.compute.vm_states variable used in the
       code to set the VERIFY_RESIZE status in the API).

    3. Adds words about options to correct the server status from
       ERROR after confirmResize fails, the most obvious being
       an admin resetting the server status to ACTIVE or the user
       hard rebooting the server.

Change-Id: I7de257ad78031d137616d724bee3fa1cbf40d6fa
Related-Bug: #1821594

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-05: Fix proposed to nova (stable/stein)

#10

Fix proposed to branch: stable/stein
Review: https://review.openstack.org/650437

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix proposed to nova (stable/rocky)

#11

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/652127

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12:

#12

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/652146

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix proposed to nova (stable/queens)

#13

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/652150

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12:

#14

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/652153

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-12: Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state

#15

I'm not sure how practical it will be to try and cleanup the leaked allocation in stable/pike because there was a pretty decent sized refactor of the allocation cleanup code in Queens:

https://review.openstack.org/#/c/498947/

Which would mean the backport to Pike would definitely not be clean. So I'll probably not put effort into backporting I29c5f491ec20a71283190a1599e7732541de736f to Pike but I can still fix the migration status getting stuck in 'confirming' issue with change Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-04-12:

#16

Oh furthermore we'd need https://review.openstack.org/#/c/479802/ (or a part of it) in Pike for Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2, so I probably won't pursue that either for now.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12: Fix merged to nova (stable/stein)

#17

Reviewed: https://review.openstack.org/649421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=972d4e0eb391e83fe8d3020ff95db0e6a840a224
Submitter: Zuul
Branch: stable/stein

commit 972d4e0eb391e83fe8d3020ff95db0e6a840a224
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

    Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
    Partial-Bug: #1821594
    (cherry picked from commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-12:

#18

Reviewed: https://review.openstack.org/650437
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5f515060f6c79f113f7b8107596e41056445c79f
Submitter: Zuul
Branch: stable/stein

commit 5f515060f6c79f113f7b8107596e41056445c79f
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

Conflicts:
nova/compute/manager.py

NOTE(mriedem): The conflict is due to not having change
I49034f87e240775cca7a4d769819dd788b86832a in Stein.

    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594
    (cherry picked from commit 03a6d26691c1f182224d59190b79f48df278099e)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-16: Fix merged to nova (stable/rocky)

#19

Reviewed: https://review.openstack.org/652127
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2a25d1e48bce34c7dd0d5783908fdb0dc1b19113
Submitter: Zuul
Branch: stable/rocky

commit 2a25d1e48bce34c7dd0d5783908fdb0dc1b19113
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

    Conflicts:
          nova/tests/functional/test_servers.py
          nova/tests/unit/compute/test_compute_mgr.py

    NOTE(mriedem): The functional test conflict is due to not
    having change I99427a52676826990d2a2ffc82cf30ad945b939c
    in Rocky. The unit test conflict is due to not having
    change I0851e2d54a1fdc82fe3291fb7e286e790f121e92 in Rocky.
    The source_node attribute on the fake Migration object in
    the unit test is added here because change
    I312d61383345ea0ac1ab0c277b4c468e6aa94656 is not in Rocky.

    Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
    Partial-Bug: #1821594
    (cherry picked from commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5)
    (cherry picked from commit 972d4e0eb391e83fe8d3020ff95db0e6a840a224)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-04-16:

#20

Reviewed: https://review.openstack.org/652146
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=37ac54a42ec91821d62864d63c486c002608491b
Submitter: Zuul
Branch: stable/rocky

commit 37ac54a42ec91821d62864d63c486c002608491b
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

    Conflicts:
          nova/compute/manager.py
          nova/tests/unit/compute/test_compute_mgr.py

    NOTE(mriedem): The manager.py conflict is due to not having change
    Ibb8c12fb2799bb5ceb9e3d72a2b86dbb4f14451e in Rocky. In addition,
    since change I0851e2d54a1fdc82fe3291fb7e286e790f121e92 is not in
    Rocky the _delete_allocation_after_move signature needs to be
    handled a bit different in this backport. The test_compute_mgr.py
    conflict and source_node addition in the setUp is due to the same
    two changes.

    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594
    (cherry picked from commit 03a6d26691c1f182224d59190b79f48df278099e)
    (cherry picked from commit 5f515060f6c79f113f7b8107596e41056445c79f)

Reviewed:  https://review.openstack.org/652146
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=37ac54a42ec91821d62864d63c486c002608491b
Submitter: Zuul
Branch:    stable/rocky

commit 37ac54a42ec91821d62864d63c486c002608491b
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Mon Mar 25 14:02:17 2019 -0400

Delete allocations even if _confirm_resize raises
    
    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.
    
    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.
    
    Conflicts:
          nova/compute/manager.py
          nova/tests/unit/compute/test_compute_mgr.py
    
    NOTE(mriedem): The manager.py conflict is due to not having change
    Ibb8c12fb2799bb5ceb9e3d72a2b86dbb4f14451e in Rocky. In addition,
    since change I0851e2d54a1fdc82fe3291fb7e286e790f121e92 is not in
    Rocky the _delete_allocation_after_move signature needs to be
    handled a bit different in this backport. The test_compute_mgr.py
    conflict and source_node addition in the setUp is due to the same
    two changes.
    
    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594
    (cherry picked from commit 03a6d26691c1f182224d59190b79f48df278099e)
    (cherry picked from commit 5f515060f6c79f113f7b8107596e41056445c79f)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-08: Related fix proposed to nova (master)

#21

Related fix proposed to branch: master
Review: https://review.opendev.org/657870

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-05-13: Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state

#22

The bug is unfortunately not fixed in rocky as shown in this recreate test:

https://review.opendev.org/#/c/658834/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-13: Related fix proposed to nova (stable/stein)

#23

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/658929

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-14: Related fix merged to nova (master)

#24

Reviewed: https://review.opendev.org/657870
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=873ac499c50125adc2fb49728d936922f9acf4a9
Submitter: Zuul
Branch: master

commit 873ac499c50125adc2fb49728d936922f9acf4a9
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
Related-bug: #1821594

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-14: Related fix merged to nova (stable/stein)

#25

Reviewed: https://review.opendev.org/658929
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d7d7f115430c7ffeb88ec9dcd155ac69b29d7513
Submitter: Zuul
Branch: stable/stein

commit d7d7f115430c7ffeb88ec9dcd155ac69b29d7513
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

    Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    Related-bug: #1821594
    (cherry picked from commit 873ac499c50125adc2fb49728d936922f9acf4a9)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-15: Fix proposed to nova (stable/rocky)

#26

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/659338

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-24: Fix merged to nova (stable/queens)

#27

Reviewed: https://review.opendev.org/652150
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c6a42cd3b38528bbf410b33c3bfc384600ae053f
Submitter: Zuul
Branch: stable/queens

commit c6a42cd3b38528bbf410b33c3bfc384600ae053f
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 13:16:42 2019 -0400

Error out migration when confirm_resize fails

    If anything fails and raises an exception during
    confirm_resize, the migration status is stuck in
    "confirming" status even though the instance status
    may be "ERROR".

    This change adds the errors_out_migration decorator
    to the confirm_resize method to make sure the migration
    status is "error" if an error is raised.

    In bug 1821594 it was the driver.confirm_migration
    method that raised some exception, so a unit test is
    added here which simulates a similar scenario.

    This only partially closes the bug because we are still
    leaking allocations on the source node resource provider
    since _delete_allocation_after_move is not called. That
    will be dealt with in a separate patch.

Conflicts:
nova/tests/unit/compute/test_compute_mgr.py

NOTE(mriedem): The conflict is due to not having change
Ia05525058e47efb806cf8820410c8bc80eccca25 in Queens.

    Change-Id: Ic7d78ad43a2bad7f932c22c98944accbbed9e9e2
    Partial-Bug: #1821594
    (cherry picked from commit 408ef8f84a698f764aa5d769d6d01fd9340de2e5)
    (cherry picked from commit 972d4e0eb391e83fe8d3020ff95db0e6a840a224)
    (cherry picked from commit e3f69c8af0d13f0aa60e9a267f25050729f7766c)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-24: Fix proposed to nova (stable/queens)

#28

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/661349

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-24: Fix merged to nova (stable/rocky)

#29

Reviewed: https://review.opendev.org/659338
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dac3239e92fc1865cacc17bbfbd2316072a9d26e
Submitter: Zuul
Branch: stable/rocky

commit dac3239e92fc1865cacc17bbfbd2316072a9d26e
Author: Matt Riedemann <email address hidden>
Date: Wed May 15 12:27:17 2019 -0400

[stable-only] Delete allocations even if _confirm_resize raises (part 2)

    The backport 37ac54a42ec91821d62864d63c486c002608491b to fix
    bug 1821594 did not account for how the _delete_allocation_after_move
    method before Stein is tightly coupled to the migration status
    being set to "confirmed" which is what the _confirm_resize method
    does after self.driver.confirm_migration returns.

    However, if self.driver.confirm_migration raises an exception
    we still want to cleanup the allocations held on the source node
    and for that we call _delete_allocation_after_move. But because
    of that tight coupling before Stein, we need to temporarily
    mutate the migration status to "confirmed" to get the cleanup
    method to do what we want.

    This isn't a problem starting in Stein because change
    I0851e2d54a1fdc82fe3291fb7e286e790f121e92 removed that
    tight coupling on the migration status, so this is a stable
    branch only change.

    Note that we don't call self.reportclient.delete_allocation_for_instance
    directly since before Stein we still need to account for a
    migration that does not move the source node allocations to the
    migration record, and that logic is in _delete_allocation_after_move.

    A simple unit test assertion is added here but the functional
    test added in change I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    will assert the bug is fixed properly before Stein.

Change-Id: I933687891abef4878de09481937d576ce5899511
Closes-Bug: #1821594

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-24: Related fix merged to nova (stable/rocky)

#30

Reviewed: https://review.opendev.org/658834
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=01bfb7863c80c43538632952ec9f1728f0b412d6
Submitter: Zuul
Branch: stable/rocky

commit 01bfb7863c80c43538632952ec9f1728f0b412d6
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

    This backport is not clean due to change
    If6aa37d9b6b48791e070799ab026c816fda4441c, which
    refactored the testing framework. Within the refactor,
    new assertion methods were added and method
    "assertFlavorMatchesAllocation" was modified.
    This backport needed to be adapted in order to
    be compatible with the testing framework prior to
    If6aa37d9b6b48791e070799ab026c816fda4441c.

    Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    Related-bug: #1821594
    (cherry picked from commit 873ac499c50125adc2fb49728d936922f9acf4a9)
    (cherry picked from commit d7d7f115430c7ffeb88ec9dcd155ac69b29d7513)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-30: Fix merged to nova (stable/queens)

#31

Reviewed: https://review.opendev.org/652153
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b8435d6001c30e67ac277e414b11708d4d660555
Submitter: Zuul
Branch: stable/queens

commit b8435d6001c30e67ac277e414b11708d4d660555
Author: Matt Riedemann <email address hidden>
Date: Mon Mar 25 14:02:17 2019 -0400

Delete allocations even if _confirm_resize raises

    When we are confirming a resize, the guest is on the dest
    host and the instance host/node values in the database
    are pointing at the dest host, so the _confirm_resize method
    on the source is really best effort. If something fails, we
    should not leak allocations in placement for the source compute
    node resource provider since the instance is not actually
    consuming the source node provider resources.

    This change refactors the error handling around the _confirm_resize
    call so the big nesting for _error_out_instance_on_exception is
    moved to confirm_resize and then a try/finally is added around
    _confirm_resize so we can be sure to try and cleanup the allocations
    even if _confirm_resize fails in some obscure way. If _confirm_resize
    does fail, the error gets re-raised along with logging a traceback
    and hint about how to correct the instance state in the DB by hard
    rebooting the server on the dest host.

    Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
    Closes-Bug: #1821594
    (cherry picked from commit 03a6d26691c1f182224d59190b79f48df278099e)
    (cherry picked from commit 5f515060f6c79f113f7b8107596e41056445c79f)
    (cherry picked from commit 37ac54a42ec91821d62864d63c486c002608491b)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-30:

#32

Reviewed: https://review.opendev.org/661349
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5600309a1fc38f0f98629608d3f41eacf66d714a
Submitter: Zuul
Branch: stable/queens

commit 5600309a1fc38f0f98629608d3f41eacf66d714a
Author: Matt Riedemann <email address hidden>
Date: Wed May 15 12:27:17 2019 -0400

[stable-only] Delete allocations even if _confirm_resize raises (part 2)

    The backport https://review.opendev.org/#/c/652153/ to fix
    bug 1821594 did not account for how the _delete_allocation_after_move
    method before Stein is tightly coupled to the migration status
    being set to "confirmed" which is what the _confirm_resize method
    does after self.driver.confirm_migration returns.

    However, if self.driver.confirm_migration raises an exception
    we still want to cleanup the allocations held on the source node
    and for that we call _delete_allocation_after_move. But because
    of that tight coupling before Stein, we need to temporarily
    mutate the migration status to "confirmed" to get the cleanup
    method to do what we want.

    This isn't a problem starting in Stein because change
    I0851e2d54a1fdc82fe3291fb7e286e790f121e92 removed that
    tight coupling on the migration status, so this is a stable
    branch only change.

    Note that we don't call self.reportclient.delete_allocation_for_instance
    directly since before Stein we still need to account for a
    migration that does not move the source node allocations to the
    migration record, and that logic is in _delete_allocation_after_move.

    A simple unit test assertion is added here but the functional
    test added in change I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    will assert the bug is fixed properly before Stein.

    Change-Id: I933687891abef4878de09481937d576ce5899511
    Closes-Bug: #1821594
    (cherry picked from commit dac3239e92fc1865cacc17bbfbd2316072a9d26e)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-30: Related fix merged to nova (stable/queens)

#33

Reviewed: https://review.opendev.org/658136
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=944b24f462f8785a5c039efb74eec5310ef7729e
Submitter: Zuul
Branch: stable/queens

commit 944b24f462f8785a5c039efb74eec5310ef7729e
Author: Rodrigo Barbieri <email address hidden>
Date: Wed May 8 16:01:25 2019 -0300

Add functional confirm_migration_error test

    This test checks if allocations have been
    successfully cleaned up upon the driver failing
    during "confirm_migration".

    NOTE(mriedem): The _wait_until_deleted helper method
    is modified here to include the changes from patch
    I4ced19bd9259f0b5a50b89dd5908abe35ca73894 in Rocky
    otherwise the test fails.

    Change-Id: I9d6478f492351b58aa87b8f56e907ee633d6d1c6
    Related-bug: #1821594
    (cherry picked from commit 873ac499c50125adc2fb49728d936922f9acf4a9)
    (cherry picked from commit d7d7f115430c7ffeb88ec9dcd155ac69b29d7513)
    (cherry picked from commit 01bfb7863c80c43538632952ec9f1728f0b412d6)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-06: Fix included in openstack/nova 19.0.1

#34

This issue was fixed in the openstack/nova 19.0.1 release.

Rodrigo Barbieri (rodrigo-barbieri2010) on 2019-06-07

description:

updated

Rodrigo Barbieri (rodrigo-barbieri2010) on 2019-06-07

description:

updated

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-06-10:

#37

bug_1821594_cosmic_rocky.debdiff Edit (31.2 KiB, text/plain)

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-06-10:

#38

bug_1821594_bionic_queens.debdiff Edit (31.3 KiB, text/plain)

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2019-06-10:

#39

This will be fixed in the 19.0.1 stein stable point release via https://bugs.launchpad.net/cloud-archive/+bug/1831754.

Rodrigo Barbieri (rodrigo-barbieri2010) on 2019-06-11

tags:

added: sts

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2019-06-11:

#40

Fyi I'm currently working on a train snapshot (and blocked atm) which is needed prior to SRU of any other releases.

Edward Hope-Morley (hopem) on 2019-06-12

tags:

added: sts-sru-needed

Rodrigo Barbieri (rodrigo-barbieri2010) on 2019-06-12

description:

updated

Edward Hope-Morley (hopem) on 2019-06-12

summary:	- Error in confirm_migration leaves stale allocations and 'confirming' - migration state + [SRU] Error in confirm_migration leaves stale allocations and + 'confirming' migration state
Changed in nova (Ubuntu Eoan):
status:	New → Fix Committed

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2019-06-17:

#41

Rodrigo, thank you for the patches and debdiffs. I've uploaded your changes to cosmic and bionic unapproved queues where they are awaiting review from the SRU team.
https://launchpad.net/ubuntu/cosmic/+queue?queue_state=1&queue_text=nova
https://launchpad.net/ubuntu/bionic/+queue?queue_state=1&queue_text=nova

Changed in nova (Ubuntu Disco):
importance:	Undecided → Medium
status:	New → Triaged
Changed in nova (Ubuntu Cosmic):
importance:	Undecided → Medium
status:	New → Triaged
Changed in nova (Ubuntu Bionic):
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-18: Fix included in openstack/nova 18.2.1

#42

This issue was fixed in the openstack/nova 18.2.1 release.

Revision history for this message

Brian Murray (brian-murray) wrote on 2019-06-18: Please test proposed package

#44

Hello Rodrigo, or anyone else affected,

Accepted nova into cosmic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:18.2.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-cosmic to verification-done-cosmic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-cosmic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in nova (Ubuntu Cosmic):
status:	Triaged → Fix Committed
tags:	added: verification-needed verification-needed-cosmic

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2019-06-26:

#45

Hello Rodrigo, or anyone else affected,

Accepted nova into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:rocky-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-rocky-needed

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-06-26:

#46

Improved test case with more detailed steps in order to make it easier to script and validate the output

description:

updated

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-06-27:

#47

cosmic-rocky-validation.txt Edit (30.8 KiB, text/plain)

Cosmic Rocky verified using [Test Case]. Attached output.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-06-27:

#48

bionic-rocky-validation.txt Edit (32.3 KiB, text/plain)

Bionic Rocky verified using [Test Case]. Attached output.

Rodrigo Barbieri (rodrigo-barbieri2010) on 2019-07-01

tags:

added: verification-done-cosmic
removed: verification-needed-cosmic

Rodrigo Barbieri (rodrigo-barbieri2010) on 2019-07-01

tags:	added: verification-rocky-done removed: verification-rocky-needed
tags:	added: verification-done removed: verification-needed

Revision history for this message

Brian Murray (brian-murray) wrote on 2019-07-02: Update Released

#49

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-07-02:

#50

This bug was fixed in the package nova - 2:18.2.0-0ubuntu2

---------------
nova (2:18.2.0-0ubuntu2) cosmic; urgency=medium

  * Cherry-picked from upstream to ensure no stale
    allocations are left over on failed cold
    migrations (LP: #1821594).
    - d/p/bug_1821594_1.patch: Fix migration record status
    - d/p/bug_1821594_2.patch: Delete failed allocation part 1
    - d/p/bug_1821594_3.patch: Delete failed allocation part 2
    - d/p/bug_1821594_4.patch: New functional test

nova (2:18.2.0-0ubuntu1) cosmic; urgency=medium

  [Sahid Orentino Ferdjaoui]
  * New stable point release for OpenStack Rocky (LP: #1830695).
  * d/p/ensure-rbd-auth-fallback-uses-matching-credentials.patch:
    Dropped. Fixed upstream in 18.2.0.

  [Corey Bryant]
  * d/p/skip-openssl-1.1.1-tests.patch: Dropped as this is now properly fixed
    by xenapi-agent-change-openssl-error-handling.patch.

-- Rodrigo Barbieri <email address hidden> Mon, 10 Jun 2019 12:36:21 +0000

Changed in nova (Ubuntu Cosmic):
status:	Fix Committed → Fix Released

Revision history for this message

James Page (james-page) wrote on 2019-07-03:

#51

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

James Page (james-page) wrote on 2019-07-03:

#52

This bug was fixed in the package nova - 2:18.2.0-0ubuntu2~cloud0
---------------

nova (2:18.2.0-0ubuntu2~cloud0) bionic-rocky; urgency=medium
.
   * New upstream release for the Ubuntu Cloud Archive.
.
nova (2:18.2.0-0ubuntu2) cosmic; urgency=medium
.
   * Cherry-picked from upstream to ensure no stale
     allocations are left over on failed cold
     migrations (LP: #1821594).
     - d/p/bug_1821594_1.patch: Fix migration record status
     - d/p/bug_1821594_2.patch: Delete failed allocation part 1
     - d/p/bug_1821594_3.patch: Delete failed allocation part 2
     - d/p/bug_1821594_4.patch: New functional test
.
nova (2:18.2.0-0ubuntu1) cosmic; urgency=medium
.
   [Sahid Orentino Ferdjaoui]
   * New stable point release for OpenStack Rocky (LP: #1830695).
   * d/p/ensure-rbd-auth-fallback-uses-matching-credentials.patch:
     Dropped. Fixed upstream in 18.2.0.
.
   [Corey Bryant]
   * d/p/skip-openssl-1.1.1-tests.patch: Dropped as this is now properly fixed
     by xenapi-agent-change-openssl-error-handling.patch.

Revision history for this message

Brian Murray (brian-murray) wrote on 2019-07-03: Please test proposed package

#53

Hello Rodrigo, or anyone else affected,

Accepted nova into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:17.0.10-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in nova (Ubuntu Bionic):
status:	Triaged → Fix Committed
tags:	added: verification-needed verification-needed-bionic removed: verification-done

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-07-05:

#54

bionic-queens-validation.txt Edit (31.1 KiB, text/plain)

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-07-05:

#55

Bionic Queens verified using [Test Case]. Attached output above ^.

tags:

added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2019-07-08:

#56

Marking Disco/Stein as Fix Committed since it is the same in bug 1831754

Changed in nova (Ubuntu Disco):
status:	Triaged → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-10: Fix included in openstack/nova 17.0.11

#57

This issue was fixed in the openstack/nova 17.0.11 release.

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2019-07-11: Please test proposed package

#59

Hello Rodrigo, or anyone else affected,

Accepted nova into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:queens-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-queens-needed

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2019-07-11:

#60

xenial-queens-validation.txt Edit (30.7 KiB, text/plain)

Xenial Queens verified using [Test Case]. Attached output.

tags:

added: verification-queens-done
removed: verification-queens-needed

Edward Hope-Morley (hopem) on 2019-07-15

Changed in nova (Ubuntu Disco):
status:	Fix Committed → Fix Released

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-08-06:

#61

This bug was fixed in the package nova - 2:17.0.10-0ubuntu2

---------------
nova (2:17.0.10-0ubuntu2) bionic; urgency=medium

  * Cherry-picked from upstream to ensure no stale
    allocations are left over on failed cold
    migrations (LP: #1821594).
    - d/p/bug_1821594_1.patch: Fix migration record status
    - d/p/bug_1821594_2.patch: Delete failed allocation part 1
    - d/p/bug_1821594_3.patch: Delete failed allocation part 2
    - d/p/bug_1821594_4.patch: New functional test

nova (2:17.0.10-0ubuntu1) bionic; urgency=medium

  * New stable point release for OpenStack Queens (LP: #1830341).
  * d/p/ensure-rbd-auth-fallback-uses-matching-credentials.patch:
    Dropped. Fixed upstream in 17.0.10.

-- Rodrigo Barbieri <email address hidden> Mon, 10 Jun 2019 13:20:41 +0000

Changed in nova (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Revision history for this message

Corey Bryant (corey.bryant) wrote on 2019-08-29:

#62

This was fixed in queens nova package version 2:17.0.10-0ubuntu2~cloud0

Changed in nova (Ubuntu Eoan):
status:	Fix Committed → Fix Released

Edward Hope-Morley (hopem) on 2019-09-09

tags:

added: sts-sru-done
removed: sts-sru-needed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-27: Fix included in openstack/nova 20.0.0.0rc1

#63

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

OpenStack Compute (nova)

[SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Matt Riedemann
Pike	Triaged	Medium	Unassigned
Queens	Fix Committed	Medium	Tony Breeds
Rocky	Fix Committed	Medium	Matt Riedemann
Stein	Fix Released	Medium	Matt Riedemann
Ubuntu Cloud Archive	Fix Released	Medium	Unassigned
Queens	Fix Released	Medium	Unassigned
Rocky	Fix Released	Medium	Unassigned
Stein	Fix Released	Medium	Unassigned
Train	Fix Released	Undecided	Unassigned
nova (Ubuntu)	Fix Released	Undecided	Unassigned
Bionic	Fix Released	Medium	Unassigned
Cosmic	Fix Released	Medium	Unassigned
Disco	Fix Released	Medium	Unassigned
Eoan	Fix Released	Undecided	Unassigned