API allows source compute service/node deletion while instances are pending a resize confirm/revert

Bug #1852610 reported by Matt Riedemann on 2019-11-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Queens
Low
Lee Yarwood
Rocky
Medium
Matt Riedemann
Stein
Medium
Matt Riedemann
Train
Medium
Matt Riedemann

Bug Description

This is split off from bug 1829479 which is about deleting a compute service which had servers evacuated from it which will orphan resource providers in placement.

A similar scenario is true where the API will allow deleting a source compute service which has migration-based allocations for the source node resource provider and pending instance resizes involving the source node. A simple scenario is:

1. create a server on host1
2. resize or cold migrate it to a dest host2
3. delete the compute service for host1

At this point the resource provider for host1 is orphaned.

4. try to confirm/revert the resize of the server which will fail because the compute node for host1 is gone and this results in the server going to ERROR status

Based on the discussion in this mailing list thread:

http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010843.html

We should probably have the DELETE /os-services/{service_id} API block trying to delete a service that has pending migrations.

Matt Riedemann (mriedem) on 2019-11-14
Changed in nova:
status: New → Triaged
importance: Undecided → Medium

Related fix proposed to branch: master
Review: https://review.opendev.org/694364

Matt Riedemann (mriedem) wrote :

This goes back further than Rocky but since Queens is in extended maintenance mode upstream I figure it's best to just focus on Rocky+ for now.

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)

Fix proposed to branch: master
Review: https://review.opendev.org/694389

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/694351
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=94d3743b185d22c07504f5d878dff2f9ef42cee3
Submitter: Zuul
Branch: master

commit 94d3743b185d22c07504f5d878dff2f9ef42cee3
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 11:38:07 2019 -0500

    Add functional recreate test for bug 1852610

    It is possible to delete a source compute service which has
    pending migration-based allocations and servers in VERIFY_RESIZE
    status. Doing so deletes the compute service and compute node
    but orphans the source node resource provider along with its
    resource allocations held by the migration record while there
    is a pending resized server.

    This adds a simple cold migrate test which deletes the source
    compute service while the server is in VERIFY_RESIZE status and
    then tries to confirm the resize which fails.

    Change-Id: I644608b4e197ddea31c5f264adb492f2c8931f04
    Related-Bug: #1852610

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/694545

Reviewed: https://review.opendev.org/694364
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f7dde6054e559752d8e9be044c32c0741f8f39c5
Submitter: Zuul
Branch: master

commit f7dde6054e559752d8e9be044c32c0741f8f39c5
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 12:16:53 2019 -0500

    Add functional recreate revert resize test for bug 1852610

    This builds on I644608b4e197ddea31c5f264adb492f2c8931f04 and
    adds a revert resize test which deletes the source compute service
    while the server is in VERIFY_RESIZE status and then reverts the
    resize. The results are a bit different from the confirm scenario
    because the confirm fails while the revert actually works which
    is more dumb luck based on where the compute service drops the
    move claim during the revert process (on the dest which still exists
    rather than the source).

    Change-Id: I2dcb1cb3e1f8ed469a4c5bf81ca5ca2fcf1fa73c
    Related-Bug: #1852610

Reviewed: https://review.opendev.org/694389
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=92fed026103b47fa2a76ea09204a4ba24c21e191
Submitter: Zuul
Branch: master

commit 92fed026103b47fa2a76ea09204a4ba24c21e191
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 14:19:26 2019 -0500

    Block deleting compute services with in-progress migrations

    This builds on I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    which made DELETE /os-services/{service_id} fail with a 409
    response if the host has instances on it. This change checks
    for in-progress migrations involving the nodes on the host,
    either as the source or destination nodes, and returns a 409
    error response if any are found.

    Failling to do this can lead to orphaned resource providers
    in placement and also failing to properly confirm or revert
    a pending resize or cold migration.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior.

    Change-Id: I70e06c607045a1c0842f13069e51fef438012a9c
    Closes-Bug: #1852610

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/694544
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=28d76cc7ae5c86d251915392b5b961a975b343ae
Submitter: Zuul
Branch: stable/train

commit 28d76cc7ae5c86d251915392b5b961a975b343ae
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 11:38:07 2019 -0500

    Add functional recreate test for bug 1852610

    It is possible to delete a source compute service which has
    pending migration-based allocations and servers in VERIFY_RESIZE
    status. Doing so deletes the compute service and compute node
    but orphans the source node resource provider along with its
    resource allocations held by the migration record while there
    is a pending resized server.

    This adds a simple cold migrate test which deletes the source
    compute service while the server is in VERIFY_RESIZE status and
    then tries to confirm the resize which fails.

    Change-Id: I644608b4e197ddea31c5f264adb492f2c8931f04
    Related-Bug: #1852610
    (cherry picked from commit 94d3743b185d22c07504f5d878dff2f9ef42cee3)

tags: added: in-stable-train

Reviewed: https://review.opendev.org/694545
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3774952410f98bfde014bd9fdc0897d4a9a6c50f
Submitter: Zuul
Branch: stable/train

commit 3774952410f98bfde014bd9fdc0897d4a9a6c50f
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 12:16:53 2019 -0500

    Add functional recreate revert resize test for bug 1852610

    This builds on I644608b4e197ddea31c5f264adb492f2c8931f04 and
    adds a revert resize test which deletes the source compute service
    while the server is in VERIFY_RESIZE status and then reverts the
    resize. The results are a bit different from the confirm scenario
    because the confirm fails while the revert actually works which
    is more dumb luck based on where the compute service drops the
    move claim during the revert process (on the dest which still exists
    rather than the source).

    Change-Id: I2dcb1cb3e1f8ed469a4c5bf81ca5ca2fcf1fa73c
    Related-Bug: #1852610
    (cherry picked from commit f7dde6054e559752d8e9be044c32c0741f8f39c5)

Reviewed: https://review.opendev.org/694546
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a9650b3cbfc674e283964090fb64ac6297be5b78
Submitter: Zuul
Branch: stable/train

commit a9650b3cbfc674e283964090fb64ac6297be5b78
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 14:19:26 2019 -0500

    Block deleting compute services with in-progress migrations

    This builds on I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    which made DELETE /os-services/{service_id} fail with a 409
    response if the host has instances on it. This change checks
    for in-progress migrations involving the nodes on the host,
    either as the source or destination nodes, and returns a 409
    error response if any are found.

    Failling to do this can lead to orphaned resource providers
    in placement and also failing to properly confirm or revert
    a pending resize or cold migration.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior.

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to change
    Iec61f56c05e06924def814a3a6e09ceb91a15894 which is not in Train.

    NOTE(mriedem): services.py had to be updated to add the LOG
    variable since change I8403a841f21a624a546ae5f26bb9ba19318ece6a
    is not in Train.

    Change-Id: I70e06c607045a1c0842f13069e51fef438012a9c
    Closes-Bug: #1852610
    (cherry picked from commit 92fed026103b47fa2a76ea09204a4ba24c21e191)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/695938

Reviewed: https://review.opendev.org/695935
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7d673872462f53d0ce5e651263253ec4057a2138
Submitter: Zuul
Branch: stable/stein

commit 7d673872462f53d0ce5e651263253ec4057a2138
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 11:38:07 2019 -0500

    Add functional recreate test for bug 1852610

    It is possible to delete a source compute service which has
    pending migration-based allocations and servers in VERIFY_RESIZE
    status. Doing so deletes the compute service and compute node
    but orphans the source node resource provider along with its
    resource allocations held by the migration record while there
    is a pending resized server.

    This adds a simple cold migrate test which deletes the source
    compute service while the server is in VERIFY_RESIZE status and
    then tries to confirm the resize which fails.

    Conflicts:
          nova/tests/functional/wsgi/test_services.py

    NOTE(mriedem): The conflict is due to change
    If32bca070185937ef83f689b7163d965a89ec10a which is not in Stein.

    Change-Id: I644608b4e197ddea31c5f264adb492f2c8931f04
    Related-Bug: #1852610
    (cherry picked from commit 94d3743b185d22c07504f5d878dff2f9ef42cee3)
    (cherry picked from commit 28d76cc7ae5c86d251915392b5b961a975b343ae)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/695938
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9983b2462401176db0c332f21bc8b24ba1d81503
Submitter: Zuul
Branch: stable/stein

commit 9983b2462401176db0c332f21bc8b24ba1d81503
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 12:16:53 2019 -0500

    Add functional recreate revert resize test for bug 1852610

    This builds on I644608b4e197ddea31c5f264adb492f2c8931f04 and
    adds a revert resize test which deletes the source compute service
    while the server is in VERIFY_RESIZE status and then reverts the
    resize. The results are a bit different from the confirm scenario
    because the confirm fails while the revert actually works which
    is more dumb luck based on where the compute service drops the
    move claim during the revert process (on the dest which still exists
    rather than the source).

    Conflicts:
          nova/tests/functional/wsgi/test_services.py

    NOTE(mriedem): The conflict is due to change
    If32bca070185937ef83f689b7163d965a89ec10a which is not in Stein.

    Change-Id: I2dcb1cb3e1f8ed469a4c5bf81ca5ca2fcf1fa73c
    Related-Bug: #1852610
    (cherry picked from commit f7dde6054e559752d8e9be044c32c0741f8f39c5)
    (cherry picked from commit 3774952410f98bfde014bd9fdc0897d4a9a6c50f)

Reviewed: https://review.opendev.org/695940
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a0290858b717b4cefd0d6fc17acc2b143ab12ac4
Submitter: Zuul
Branch: stable/stein

commit a0290858b717b4cefd0d6fc17acc2b143ab12ac4
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 14:19:26 2019 -0500

    Block deleting compute services with in-progress migrations

    This builds on I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    which made DELETE /os-services/{service_id} fail with a 409
    response if the host has instances on it. This change checks
    for in-progress migrations involving the nodes on the host,
    either as the source or destination nodes, and returns a 409
    error response if any are found.

    Failling to do this can lead to orphaned resource providers
    in placement and also failing to properly confirm or revert
    a pending resize or cold migration.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior.

    Conflicts:
          nova/api/openstack/compute/services.py
          nova/tests/functional/integrated_helpers.py
          nova/tests/functional/wsgi/test_services.py

    NOTE(mriedem): The conflict in services.py is due to not
    having change I9d257a003d315b84b937dcef91f3cb41f3e24b53 in Stein.
    The conflict in integrated_helpers.py is due to not having change
    I4aac187283c2f341b5c2712be85f722156e14f63 or change
    Ibeb16ce16263c43bad9f148480bbebca413d8117 in Stein. As a result
    test_services does not use _confirm_resize but just inlines the
    call and wait for ACTIVE status in the test. The conflict in
    test_services.py is due to not having change
    If32bca070185937ef83f689b7163d965a89ec10a in Stein.

    Change-Id: I70e06c607045a1c0842f13069e51fef438012a9c
    Closes-Bug: #1852610
    (cherry picked from commit 92fed026103b47fa2a76ea09204a4ba24c21e191)
    (cherry picked from commit a9650b3cbfc674e283964090fb64ac6297be5b78)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/698110

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/699708

Reviewed: https://review.opendev.org/698108
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1563a15c8b4bcf1a602f72a549c8d9c56ed7da4e
Submitter: Zuul
Branch: stable/rocky

commit 1563a15c8b4bcf1a602f72a549c8d9c56ed7da4e
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 11:38:07 2019 -0500

    Add functional recreate test for bug 1852610

    It is possible to delete a source compute service which has
    pending migration-based allocations and servers in VERIFY_RESIZE
    status. Doing so deletes the compute service and compute node
    but orphans the source node resource provider along with its
    resource allocations held by the migration record while there
    is a pending resized server.

    This adds a simple cold migrate test which deletes the source
    compute service while the server is in VERIFY_RESIZE status and
    then tries to confirm the resize which fails.

    NOTE(mriedem): A couple of methods are lifted from ServerMovingTests
    since change Ie991d4b53e9bb5e7ec26da99219178ab7695abf6 is not in Rocky.

    Change-Id: I644608b4e197ddea31c5f264adb492f2c8931f04
    Related-Bug: #1852610
    (cherry picked from commit 94d3743b185d22c07504f5d878dff2f9ef42cee3)
    (cherry picked from commit 28d76cc7ae5c86d251915392b5b961a975b343ae)
    (cherry picked from commit 7d673872462f53d0ce5e651263253ec4057a2138)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/698110
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b6b2b3a35e1b95463fac3bbb2acd51d905328a2c
Submitter: Zuul
Branch: stable/rocky

commit b6b2b3a35e1b95463fac3bbb2acd51d905328a2c
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 12:16:53 2019 -0500

    Add functional recreate revert resize test for bug 1852610

    This builds on I644608b4e197ddea31c5f264adb492f2c8931f04 and
    adds a revert resize test which deletes the source compute service
    while the server is in VERIFY_RESIZE status and then reverts the
    resize. The results are a bit different from the confirm scenario
    because the confirm fails while the revert actually works which
    is more dumb luck based on where the compute service drops the
    move claim during the revert process (on the dest which still exists
    rather than the source).

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Ie991d4b53e9bb5e7ec26da99219178ab7695abf6 in Rocky.

    Change-Id: I2dcb1cb3e1f8ed469a4c5bf81ca5ca2fcf1fa73c
    Related-Bug: #1852610
    (cherry picked from commit f7dde6054e559752d8e9be044c32c0741f8f39c5)
    (cherry picked from commit 3774952410f98bfde014bd9fdc0897d4a9a6c50f)
    (cherry picked from commit 9983b2462401176db0c332f21bc8b24ba1d81503)

Reviewed: https://review.opendev.org/698113
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=30a635068512be558acf0f9c83185dc1aaad560f
Submitter: Zuul
Branch: stable/rocky

commit 30a635068512be558acf0f9c83185dc1aaad560f
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 14:19:26 2019 -0500

    Block deleting compute services with in-progress migrations

    This builds on I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    which made DELETE /os-services/{service_id} fail with a 409
    response if the host has instances on it. This change checks
    for in-progress migrations involving the nodes on the host,
    either as the source or destination nodes, and returns a 409
    error response if any are found.

    Failling to do this can lead to orphaned resource providers
    in placement and also failing to properly confirm or revert
    a pending resize or cold migration.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior.

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Ie991d4b53e9bb5e7ec26da99219178ab7695abf6 in Rocky.

    Change-Id: I70e06c607045a1c0842f13069e51fef438012a9c
    Closes-Bug: #1852610
    (cherry picked from commit 92fed026103b47fa2a76ea09204a4ba24c21e191)
    (cherry picked from commit a9650b3cbfc674e283964090fb64ac6297be5b78)
    (cherry picked from commit a0290858b717b4cefd0d6fc17acc2b143ab12ac4)

This issue was fixed in the openstack/nova 20.1.0 release.

This issue was fixed in the openstack/nova 19.1.0 release.

Reviewed: https://review.opendev.org/699705
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=922098044b37c66c51e83f4879c1f37ae999f196
Submitter: Zuul
Branch: stable/queens

commit 922098044b37c66c51e83f4879c1f37ae999f196
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 11:38:07 2019 -0500

    Add functional recreate test for bug 1852610

    It is possible to delete a source compute service which has
    pending migration-based allocations and servers in VERIFY_RESIZE
    status. Doing so deletes the compute service and compute node
    but orphans the source node resource provider along with its
    resource allocations held by the migration record while there
    is a pending resized server.

    This adds a simple cold migrate test which deletes the source
    compute service while the server is in VERIFY_RESIZE status and
    then tries to confirm the resize which fails.

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Iea283322124cb35fc0bc6d25f35548621e8c8c2f in Queens. As a result
    the helper methods are moved from ServerMovingTests to
    ProviderUsageBaseTestCase within test_servers.py.

    Change-Id: I644608b4e197ddea31c5f264adb492f2c8931f04
    Related-Bug: #1852610
    (cherry picked from commit 94d3743b185d22c07504f5d878dff2f9ef42cee3)
    (cherry picked from commit 28d76cc7ae5c86d251915392b5b961a975b343ae)
    (cherry picked from commit 7d673872462f53d0ce5e651263253ec4057a2138)
    (cherry picked from commit 1563a15c8b4bcf1a602f72a549c8d9c56ed7da4e)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/699708
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=917b5d383829851c6cbf7583cd0f3640c8ba2b9a
Submitter: Zuul
Branch: stable/queens

commit 917b5d383829851c6cbf7583cd0f3640c8ba2b9a
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 12:16:53 2019 -0500

    Add functional recreate revert resize test for bug 1852610

    This builds on I644608b4e197ddea31c5f264adb492f2c8931f04 and
    adds a revert resize test which deletes the source compute service
    while the server is in VERIFY_RESIZE status and then reverts the
    resize. The results are a bit different from the confirm scenario
    because the confirm fails while the revert actually works which
    is more dumb luck based on where the compute service drops the
    move claim during the revert process (on the dest which still exists
    rather than the source).

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Iea283322124cb35fc0bc6d25f35548621e8c8c2f in Queens. As a result
    the _resize_and_check_allocations method is added to
    ProviderUsageBaseTestCase within test_servers.py.

    Change-Id: I2dcb1cb3e1f8ed469a4c5bf81ca5ca2fcf1fa73c
    Related-Bug: #1852610
    (cherry picked from commit f7dde6054e559752d8e9be044c32c0741f8f39c5)
    (cherry picked from commit 3774952410f98bfde014bd9fdc0897d4a9a6c50f)
    (cherry picked from commit 9983b2462401176db0c332f21bc8b24ba1d81503)
    (cherry picked from commit b6b2b3a35e1b95463fac3bbb2acd51d905328a2c)

Reviewed: https://review.opendev.org/699718
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d88f353796813bf0ad5ec79ba4714af35e04e591
Submitter: Zuul
Branch: stable/queens

commit d88f353796813bf0ad5ec79ba4714af35e04e591
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 14 14:19:26 2019 -0500

    Block deleting compute services with in-progress migrations

    This builds on I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    which made DELETE /os-services/{service_id} fail with a 409
    response if the host has instances on it. This change checks
    for in-progress migrations involving the nodes on the host,
    either as the source or destination nodes, and returns a 409
    error response if any are found.

    Failling to do this can lead to orphaned resource providers
    in placement and also failing to properly confirm or revert
    a pending resize or cold migration.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior.

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Iea283322124cb35fc0bc6d25f35548621e8c8c2f in Queens so
    _revert_resize is added to ProviderUsageBaseTestCase
    within test_servers.py.

    Change-Id: I70e06c607045a1c0842f13069e51fef438012a9c
    Closes-Bug: #1852610
    (cherry picked from commit 92fed026103b47fa2a76ea09204a4ba24c21e191)
    (cherry picked from commit a9650b3cbfc674e283964090fb64ac6297be5b78)
    (cherry picked from commit a0290858b717b4cefd0d6fc17acc2b143ab12ac4)
    (cherry picked from commit 30a635068512be558acf0f9c83185dc1aaad560f)

This issue was fixed in the openstack/nova 18.3.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers