_poll_unconfirmed_resizes may not retry later if confirm_resize fails in API

Bug #1855927 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Lee Yarwood
Ussuri
Fix Released
Undecided
Unassigned

Bug Description

This is based on code inspection but let's say I have configured my computes to set resize_confirm_window=3600 to automatically confirm a resized server after 1 hour. Within that hour, let's say the source compute service is down.

The periodic task gets the unconfirmed migrations with status='finished' which have been updated some time older than the given configurable window:

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/manager.py#L8793

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/db/sqlalchemy/api.py#L4342

The periodic task then calls the compute API code to confirm the resize:

https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7160

which changes the migration status to 'confirming':

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3684

And casts off to the source compute:

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/rpcapi.py#L600

Now if the source compute is down and that fails, the compute manager task code will handle it and say it will retry later:

https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7163

However, because the migration status was changed from 'finished' to 'confirming' the task will not retry because it won't find the migration given the DB query. And trying to confirm the resize via the API will fail as well because we'll get MigrationNotFoundByStatus since the migration status is no longer 'finished':

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3681

The compute manager code should probably mark the migration status as 'finished' again if it's really going to try later, or mark the migration status as 'error'. Note that the confirm_resize method in the compute manager doesn't mark the migration status as 'error' if something fails there either:

https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L3807

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/699045

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/699291

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/699045
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=95c9d710dd6a06b8d2fcd941343075872cdf95d5
Submitter: Zuul
Branch: master

commit 95c9d710dd6a06b8d2fcd941343075872cdf95d5
Author: Matt Riedemann <email address hidden>
Date: Fri Dec 13 17:47:42 2019 -0500

    Add recreate test for bug 1855927

    When _poll_unconfirmed_resizes runs, if the actual resize
    confirm fails on the source host the migration status may
    be stuck in "confirming" status if it never reached the
    source compute. Subsequent runs of _poll_unconfirmed_resizes
    will not be able to auto-confirm the resize nor will the user
    be able to manually confirm the resize. An admin could reset
    the status on the server to ACTIVE or ERROR but that means
    the source compute never gets cleaned up since you can only
    confirm or revert a resize on a server with VERIFY_RESIZE status.

    This adds a functional recreate test for the bug.

    Note that if the request does get to the source compute and fails,
    the _error_out_instance_on_exception context manager should set
    the instance to ERROR status and the errors_out_migration decorator
    should set the migration status to "error". The problem is if we
    never get that far we can't recover automatically or via the API
    without operator intervention.

    Change-Id: I93e27d907a237ae16b6b4c7202ad157b067aa30e
    Related-Bug: #1855927

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/701756

Changed in nova:
assignee: Matt Riedemann (mriedem) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Lee Yarwood (lyarwood)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/699291
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e4601c77fb3b90638a6a56ec1a8e0e9eb7c91777
Submitter: Zuul
Branch: master

commit e4601c77fb3b90638a6a56ec1a8e0e9eb7c91777
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 16 16:15:24 2019 -0500

    Ensure source compute is up when confirming a resize

    When _poll_unconfirmed_resizes runs or a user tries to confirm
    a resize in the API, if the source compute service is down the
    migration status will be stuck in "confirming" status if it never
    reached the source compute. Subsequent runs of
    _poll_unconfirmed_resizes will not be able to auto-confirm the
    resize nor will the user be able to manually confirm the resize.
    An admin could reset the status on the server to ACTIVE or ERROR
    but that means the source compute never gets cleaned up since you
    can only confirm or revert a resize on a server with VERIFY_RESIZE
    status.

    This adds a check in the API before updating the migration record
    such that if the source compute service is down the API returns a
    409 response as an indication to try again later.

    SingleCellSimple._fake_target_cell is updated so that tests using
    it can assert when a context was targeted without having to stub
    nova.context.target_cell. As a result some HostManager unit tests
    needed to be updated.

    Change-Id: I33aa5e32cb321e5a16da51e227af2f67ed9e6713
    Closes-Bug: #1855927

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/748369

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/748369
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5f3c33a4f12a40e5b5d9fef2a1b5345f5ff5cd04
Submitter: Zuul
Branch: stable/ussuri

commit 5f3c33a4f12a40e5b5d9fef2a1b5345f5ff5cd04
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 16 16:15:24 2019 -0500

    Ensure source compute is up when confirming a resize

    When _poll_unconfirmed_resizes runs or a user tries to confirm
    a resize in the API, if the source compute service is down the
    migration status will be stuck in "confirming" status if it never
    reached the source compute. Subsequent runs of
    _poll_unconfirmed_resizes will not be able to auto-confirm the
    resize nor will the user be able to manually confirm the resize.
    An admin could reset the status on the server to ACTIVE or ERROR
    but that means the source compute never gets cleaned up since you
    can only confirm or revert a resize on a server with VERIFY_RESIZE
    status.

    This adds a check in the API before updating the migration record
    such that if the source compute service is down the API returns a
    409 response as an indication to try again later.

    SingleCellSimple._fake_target_cell is updated so that tests using
    it can assert when a context was targeted without having to stub
    nova.context.target_cell. As a result some HostManager unit tests
    needed to be updated.

    Change-Id: I33aa5e32cb321e5a16da51e227af2f67ed9e6713
    Closes-Bug: #1855927
    (cherry picked from commit e4601c77fb3b90638a6a56ec1a8e0e9eb7c91777)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/701756
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.