DELETE /os-services/{service_id} does not block for hosted instances

Bug #1763183 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Pike
Fix Committed
High
Matt Riedemann
Queens
Fix Committed
High
Matt Riedemann

Bug Description

This came up while reviewing the fix for bug 1756179:

https://review.openstack.org/#/c/554920/6/nova/api/openstack/compute/services.py@226

Full IRC conversation is here:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-04-11.log.html#t2018-04-11T20:32:13

The summary is that it's possible to delete a compute service and it's associated compute node record even if that compute node has instances on it.

Before placement, this wasn't a huge problem because you could evacuate the instances to another host or if you brought the host back up, it will recreate the service and compute node and the resource tracker will "heal" itself by finding instances running on that host and node combo:

https://github.com/openstack/nova/blob/2c5da2212c3fa3e589c4af171486a2097fd8c54e/nova/compute/resource_tracker.py#L714

The problem is after we started requiring placement, and creating allocations in the scheduler in Pike, those allocations are against the compute_nodes.uuid for the compute node resource provider. If the service and it's related compute node record are deleted, restarting the service will create a new service and compute node record with a new UUID which will result in a new resource provider in placement, and the instances running on that host will have allocations against the now orphaned resource provider. The new resource provider will be reporting incorrect consumption so scheduling will also be affected.

So we should block deleting a compute service (and it's node) here:

https://github.com/openstack/nova/blob/2c5da2212c3fa3e589c4af171486a2097fd8c54e/nova/api/openstack/compute/services.py#L213

If that host (node) has instances on it.

This problem goes back to Pike. Ocata is OK in that the resource tracker on Ocata computes will "heal" allocations during the update_available_resource periodic task (and when the compute service starts up), and in Ocata the FilterScheduler does not create allocations in Placement.

Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/560674

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/560626
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a80ac96362c8fafba1bfe71244b52ba2f082c86e
Submitter: Zuul
Branch: master

commit a80ac96362c8fafba1bfe71244b52ba2f082c86e
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 16:00:59 2018 -0400

    Add functional test for deleting a compute service

    This adds a functional test which asserts the things
    related to bug 1756179 where deleting a compute service
    does not also delete the related host mapping or resource
    provider resources.

    Also related to bug 1763183 in that it should not be
    possible to delete a compute service that has instances
    running on it since that will mess up resource tracking
    in Placement.

    Change-Id: I519c5abfe24b154998f481c8a86db239a75d4729
    Related-Bug: #1756179
    Related-Bug: #1763183

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/560674
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=42f62f1ed2ad76829eb9d40a8b9646a523f6381f
Submitter: Zuul
Branch: master

commit 42f62f1ed2ad76829eb9d40a8b9646a523f6381f
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 18:00:39 2018 -0400

    Block deleting compute services which are hosting instances

    This change makes "DELETE /os-services/{service_id}" fail
    with a 409 response when attempting to delete a nova-compute
    service which is still hosting instances.

    Deleting a compute service also results in deleting the
    related compute_nodes table entry for that service host.
    The compute node resource provider in placement is tied
    to the compute node via the UUID, and if we allow deleting
    the compute service and node then the resource provider for
    that node is effectively orphaned in Placement, along with
    the instances which have allocations against that resource
    provider.

    Furthermore, restarting the compute service will create a
    new service and compute_nodes record, and the compute node
    would have a new UUID and resource provider. This will
    affect scheduling for that host since Placement will be
    reporting it as having available capacity which in reality
    is not accurate.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior. Since this API did not previously expect to return
    a 409 response, the "expected_errors" decorator is updated
    and again, should not require a microversion per the
    guidelines:

    https://docs.openstack.org/nova/latest/contributor/microversions.html#when-a-microversion-is-not-needed

    Change-Id: I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    Closes-Bug: #1763183

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/563229

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/563234

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/563229
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bcd462e49b4d51e78e9f31c60cd9e4d9fd8f99f9
Submitter: Zuul
Branch: stable/queens

commit bcd462e49b4d51e78e9f31c60cd9e4d9fd8f99f9
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 16:00:59 2018 -0400

    Add functional test for deleting a compute service

    This adds a functional test which asserts the things
    related to bug 1756179 where deleting a compute service
    does not also delete the related host mapping or resource
    provider resources.

    Also related to bug 1763183 in that it should not be
    possible to delete a compute service that has instances
    running on it since that will mess up resource tracking
    in Placement.

    NOTE(mriedem): There are two changes in this backport:

    1. The ResourceClass fields moved in Rocky via change
       Iea182341f9419cb514a044f76864d6bec60a3683.

    2. The _get_provider_inventory method was added in change
       I5ee11274816cd9e4f0669e9e52468a29262c9020 in Rocky.

    Change-Id: I519c5abfe24b154998f481c8a86db239a75d4729
    Related-Bug: #1756179
    Related-Bug: #1763183
    (cherry picked from commit a80ac96362c8fafba1bfe71244b52ba2f082c86e)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/563234
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a817b78dc44cf2cb4157531b2d92b03a4d0ca7d1
Submitter: Zuul
Branch: stable/queens

commit a817b78dc44cf2cb4157531b2d92b03a4d0ca7d1
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 18:00:39 2018 -0400

    Block deleting compute services which are hosting instances

    This change makes "DELETE /os-services/{service_id}" fail
    with a 409 response when attempting to delete a nova-compute
    service which is still hosting instances.

    Deleting a compute service also results in deleting the
    related compute_nodes table entry for that service host.
    The compute node resource provider in placement is tied
    to the compute node via the UUID, and if we allow deleting
    the compute service and node then the resource provider for
    that node is effectively orphaned in Placement, along with
    the instances which have allocations against that resource
    provider.

    Furthermore, restarting the compute service will create a
    new service and compute_nodes record, and the compute node
    would have a new UUID and resource provider. This will
    affect scheduling for that host since Placement will be
    reporting it as having available capacity which in reality
    is not accurate.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior. Since this API did not previously expect to return
    a 409 response, the "expected_errors" decorator is updated
    and again, should not require a microversion per the
    guidelines:

    https://docs.openstack.org/nova/latest/contributor/microversions.html#when-a-microversion-is-not-needed

    Conflicts:
          nova/tests/functional/wsgi/test_services.py

    NOTE(mriedem): This is due to the rc_fields move from
    change Iea182341f9419cb514a044f76864d6bec60a3683 in Rocky.

    Change-Id: I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    Closes-Bug: #1763183
    (cherry picked from commit 42f62f1ed2ad76829eb9d40a8b9646a523f6381f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.5

This issue was fixed in the openstack/nova 17.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/580496

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/580491
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c8dd4c6195434a1e271ad8cd7fbd66d4801a4cba
Submitter: Zuul
Branch: stable/pike

commit c8dd4c6195434a1e271ad8cd7fbd66d4801a4cba
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 16:00:59 2018 -0400

    Add functional test for deleting a compute service

    This adds a functional test which asserts the things
    related to bug 1756179 where deleting a compute service
    does not also delete the related host mapping or resource
    provider resources.

    Also related to bug 1763183 in that it should not be
    possible to delete a compute service that has instances
    running on it since that will mess up resource tracking
    in Placement.

    Conflicts:
          nova/tests/functional/test_servers.py

    NOTE(mriedem): The conflict is due to not having changes
    Ica5453b3c5418df75cad8505efc37686b57bc6ff or
    Iacb9808ef7188e3419abfac9e8c5fb5a46c71c05 in Pike.

    Change-Id: I519c5abfe24b154998f481c8a86db239a75d4729
    Related-Bug: #1756179
    Related-Bug: #1763183
    (cherry picked from commit a80ac96362c8fafba1bfe71244b52ba2f082c86e)
    (cherry picked from commit bcd462e49b4d51e78e9f31c60cd9e4d9fd8f99f9)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/580496
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8cd1204873287d9f0196cbc48d8c408448c67c43
Submitter: Zuul
Branch: stable/pike

commit 8cd1204873287d9f0196cbc48d8c408448c67c43
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 18:00:39 2018 -0400

    Block deleting compute services which are hosting instances

    This change makes "DELETE /os-services/{service_id}" fail
    with a 409 response when attempting to delete a nova-compute
    service which is still hosting instances.

    Deleting a compute service also results in deleting the
    related compute_nodes table entry for that service host.
    The compute node resource provider in placement is tied
    to the compute node via the UUID, and if we allow deleting
    the compute service and node then the resource provider for
    that node is effectively orphaned in Placement, along with
    the instances which have allocations against that resource
    provider.

    Furthermore, restarting the compute service will create a
    new service and compute_nodes record, and the compute node
    would have a new UUID and resource provider. This will
    affect scheduling for that host since Placement will be
    reporting it as having available capacity which in reality
    is not accurate.

    A release note is included for the (justified) behavior
    change in the API. A new microversion should not be required
    for this since admins should not have to opt out of broken
    behavior. Since this API did not previously expect to return
    a 409 response, the "expected_errors" decorator is updated
    and again, should not require a microversion per the
    guidelines:

    https://docs.openstack.org/nova/latest/contributor/microversions.html#when-a-microversion-is-not-needed

    Conflicts:
          nova/api/openstack/compute/services.py

    NOTE(mriedem): The conflict is due to not having change
    I4802c5b38001a756448d4feb9ca336908821f591 in Pike.

    Change-Id: I0bd63b655ad3d3d39af8d15c781ce0a45efc8e3a
    Closes-Bug: #1763183
    (cherry picked from commit 42f62f1ed2ad76829eb9d40a8b9646a523f6381f)
    (cherry picked from commit a817b78dc44cf2cb4157531b2d92b03a4d0ca7d1)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.5

This issue was fixed in the openstack/nova 16.1.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.