deleting a nova-compute service leaves orphaned records in placement and host mapping

Bug #1756179 reported by Chris Friesen
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Surya Seetharaman
Ocata
Fix Released
Medium
huanhongda
Pike
Fix Committed
Medium
Matt Riedemann
Queens
Confirmed
Medium
Surya Seetharaman

Bug Description

Currently when deleting a nova-compute service via the API, we will delete the service and compute_node records in the DB, but the placement resource provider and host mapping records will be orphaned.

The orphaned resource provider records have been found to cause scheduler failures if you re-create the compute node with the same name (but a different UUID). It has been theorized that the stale host mapping records could end up pointing at the wrong cell.

In discussions on IRC (http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-15.log.html#t2018-03-15T19:30:13) it was proposed that we should

1. delete the RP in placement
2. delete the host mapping
3. delete the service/node

Optionally we could delete the compute node prior to deleting the service to make it explicit and because the ordering is slightly more logical, but this is not a requirement since it will be done implicitly as part of deleting the service.

Chris Friesen (cbf123)
summary: - deleting a nova-compute service leaves orphaned records in placement
+ deleting a nova-compute service leaves orphaned records in placement and
+ host mapping
Matt Riedemann (mriedem)
tags: added: api cells placement
removed: compute
Changed in nova:
importance: Undecided → Medium
status: New → Triaged
Changed in nova:
assignee: nobody → Surya Seetharaman (tssurya)
Revision history for this message
Matt Riedemann (mriedem) wrote :

The change will also require a release note since nova-api today does not require having the [placement] section of nova.conf configured, but this would require that in order for nova-api to talk to placement-api to delete things.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/554920

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/560626

Changed in nova:
assignee: Surya Seetharaman (tssurya) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Surya Seetharaman (tssurya)
Changed in nova:
assignee: Surya Seetharaman (tssurya) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/560626
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a80ac96362c8fafba1bfe71244b52ba2f082c86e
Submitter: Zuul
Branch: master

commit a80ac96362c8fafba1bfe71244b52ba2f082c86e
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 16:00:59 2018 -0400

    Add functional test for deleting a compute service

    This adds a functional test which asserts the things
    related to bug 1756179 where deleting a compute service
    does not also delete the related host mapping or resource
    provider resources.

    Also related to bug 1763183 in that it should not be
    possible to delete a compute service that has instances
    running on it since that will mess up resource tracking
    in Placement.

    Change-Id: I519c5abfe24b154998f481c8a86db239a75d4729
    Related-Bug: #1756179
    Related-Bug: #1763183

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/560706
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ea9d0af31395fbe1686fa681cd91226ee580796e
Submitter: Zuul
Branch: master

commit ea9d0af31395fbe1686fa681cd91226ee580796e
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400

    Delete allocations from API if nova-compute is down

    When performing a "local delete" of an instance, we
    need to delete the allocations that the instance has
    against any resource providers in Placement.

    It should be noted that without this change, restarting
    the nova-compute service will delete the allocations
    for its compute node (assuming the compute node UUID
    is the same as before the instance was deleted). That
    is shown in the existing functional test modified here.

    The more important reason for this change is that in
    order to fix bug 1756179, we need to make sure the
    resource provider allocations for a given compute node
    are gone by the time the compute service is deleted.

    This adds a new functional test and a release note for
    the new behavior and need to configure nova-api for
    talking to placement, which is idempotent if
    not configured thanks to the @safe_connect decorator
    used in SchedulerReportClient.

    Closes-Bug: #1679750
    Related-Bug: #1756179

    Change-Id: If507e23f0b7e5fa417041c3870d77786498f741d

Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Surya Seetharaman (tssurya)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/563229

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/563236

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/554920
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=589c495c1ae62129e20ab5e2641e330541eee01f
Submitter: Zuul
Branch: master

commit 589c495c1ae62129e20ab5e2641e330541eee01f
Author: Surya Seetharaman <email address hidden>
Date: Wed Mar 21 14:16:24 2018 +0100

    Cleanup RP and HM records while deleting a compute service.

    Currently when deleting a nova-compute service via the API, we will
    (soft) delete the service and compute_node records in the DB, but the
    placement resource provider and host mapping records will be orphaned.
    This patch deletes the resource provider and host_mapping records
    before deleting the service/compute node.

    Change-Id: I7b8622b178d5043ed1556d7bdceaf60f47e5ac80
    Closes-Bug: #1756179

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/563698

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/568925

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/568925
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3abd5f5797737b54c10ea85d4b833aff054d1bee
Submitter: Zuul
Branch: master

commit 3abd5f5797737b54c10ea85d4b833aff054d1bee
Author: Matt Riedemann <email address hidden>
Date: Wed May 16 14:28:26 2018 -0400

    Update placement upgrade docs for nova-api dependency on placement

    Change If507e23f0b7e5fa417041c3870d77786498f741d makes nova-api
    dependent on placement for deleting an instance when the nova-compute
    service on which that instance is running is down, also known as
    "local delete".

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 makes nova-api
    dependent on placement for deleting a nova-compute service record.

    Both changes are idempotent if nova-api isn't configured to use
    placement, but warnings will show up in the logs.

    This change updates the upgrade docs to mention the new dependency.

    Change-Id: I941a8f4b321e4c90a45f7d9fccb74489fae0d62d
    Related-Bug: #1679750
    Related-Bug: #1756179

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/563229
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bcd462e49b4d51e78e9f31c60cd9e4d9fd8f99f9
Submitter: Zuul
Branch: stable/queens

commit bcd462e49b4d51e78e9f31c60cd9e4d9fd8f99f9
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 16:00:59 2018 -0400

    Add functional test for deleting a compute service

    This adds a functional test which asserts the things
    related to bug 1756179 where deleting a compute service
    does not also delete the related host mapping or resource
    provider resources.

    Also related to bug 1763183 in that it should not be
    possible to delete a compute service that has instances
    running on it since that will mess up resource tracking
    in Placement.

    NOTE(mriedem): There are two changes in this backport:

    1. The ResourceClass fields moved in Rocky via change
       Iea182341f9419cb514a044f76864d6bec60a3683.

    2. The _get_provider_inventory method was added in change
       I5ee11274816cd9e4f0669e9e52468a29262c9020 in Rocky.

    Change-Id: I519c5abfe24b154998f481c8a86db239a75d4729
    Related-Bug: #1756179
    Related-Bug: #1763183
    (cherry picked from commit a80ac96362c8fafba1bfe71244b52ba2f082c86e)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/563236
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cba1a3e2c1b161204a3662a0d9fbf33da38aa7d3
Submitter: Zuul
Branch: stable/queens

commit cba1a3e2c1b161204a3662a0d9fbf33da38aa7d3
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400

    Delete allocations from API if nova-compute is down

    When performing a "local delete" of an instance, we
    need to delete the allocations that the instance has
    against any resource providers in Placement.

    It should be noted that without this change, restarting
    the nova-compute service will delete the allocations
    for its compute node (assuming the compute node UUID
    is the same as before the instance was deleted). That
    is shown in the existing functional test modified here.

    The more important reason for this change is that in
    order to fix bug 1756179, we need to make sure the
    resource provider allocations for a given compute node
    are gone by the time the compute service is deleted.

    This adds a new functional test and a release note for
    the new behavior and need to configure nova-api for
    talking to placement, which is idempotent if
    not configured thanks to the @safe_connect decorator
    used in SchedulerReportClient.

    Closes-Bug: #1679750
    Related-Bug: #1756179

    Change-Id: If507e23f0b7e5fa417041c3870d77786498f741d
    (cherry picked from commit ea9d0af31395fbe1686fa681cd91226ee580796e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/563698
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dede2de2bd482d0378a7acd81b65d93b1635e825
Submitter: Zuul
Branch: stable/queens

commit dede2de2bd482d0378a7acd81b65d93b1635e825
Author: Surya Seetharaman <email address hidden>
Date: Wed Mar 21 14:16:24 2018 +0100

    Cleanup RP and HM records while deleting a compute service.

    Currently when deleting a nova-compute service via the API, we will
    (soft) delete the service and compute_node records in the DB, but the
    placement resource provider and host mapping records will be orphaned.
    This patch deletes the resource provider and host_mapping records
    before deleting the service/compute node.

    Change-Id: I7b8622b178d5043ed1556d7bdceaf60f47e5ac80
    Closes-Bug: #1756179
    (cherry picked from commit 589c495c1ae62129e20ab5e2641e330541eee01f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.5

This issue was fixed in the openstack/nova 17.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b2

This issue was fixed in the openstack/nova 18.0.0.0b2 development milestone.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Is it possible to backport these fixes to stable/pike, too?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/580498

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/580499

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/580491
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c8dd4c6195434a1e271ad8cd7fbd66d4801a4cba
Submitter: Zuul
Branch: stable/pike

commit c8dd4c6195434a1e271ad8cd7fbd66d4801a4cba
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 16:00:59 2018 -0400

    Add functional test for deleting a compute service

    This adds a functional test which asserts the things
    related to bug 1756179 where deleting a compute service
    does not also delete the related host mapping or resource
    provider resources.

    Also related to bug 1763183 in that it should not be
    possible to delete a compute service that has instances
    running on it since that will mess up resource tracking
    in Placement.

    Conflicts:
          nova/tests/functional/test_servers.py

    NOTE(mriedem): The conflict is due to not having changes
    Ica5453b3c5418df75cad8505efc37686b57bc6ff or
    Iacb9808ef7188e3419abfac9e8c5fb5a46c71c05 in Pike.

    Change-Id: I519c5abfe24b154998f481c8a86db239a75d4729
    Related-Bug: #1756179
    Related-Bug: #1763183
    (cherry picked from commit a80ac96362c8fafba1bfe71244b52ba2f082c86e)
    (cherry picked from commit bcd462e49b4d51e78e9f31c60cd9e4d9fd8f99f9)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/580498
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cd50dcaf3e51722c9510d417c1724d8cdafe450b
Submitter: Zuul
Branch: stable/pike

commit cd50dcaf3e51722c9510d417c1724d8cdafe450b
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400

    Delete allocations from API if nova-compute is down

    When performing a "local delete" of an instance, we
    need to delete the allocations that the instance has
    against any resource providers in Placement.

    It should be noted that without this change, restarting
    the nova-compute service will delete the allocations
    for its compute node (assuming the compute node UUID
    is the same as before the instance was deleted). That
    is shown in the existing functional test modified here.

    The more important reason for this change is that in
    order to fix bug 1756179, we need to make sure the
    resource provider allocations for a given compute node
    are gone by the time the compute service is deleted.

    This adds a new functional test and a release note for
    the new behavior and need to configure nova-api for
    talking to placement, which is idempotent if
    not configured thanks to the @safe_connect decorator
    used in SchedulerReportClient.

    Closes-Bug: #1679750
    Related-Bug: #1756179

    Conflicts:
          nova/compute/api.py

    NOTE(mriedem): The compute/api conflict is due to not
    having change I393118861d1f921cc2d71011ddedaf43a2e8dbdf
    in Pike. In addition to this, the call to
    delete_allocation_for_instance() does not include the
    context parameter which was introduced in change
    If38e4a6d49910f0aa5016e1bcb61aac2be416fa7 which is
    also not in Pike.

    Change-Id: If507e23f0b7e5fa417041c3870d77786498f741d
    (cherry picked from commit ea9d0af31395fbe1686fa681cd91226ee580796e)
    (cherry picked from commit cba1a3e2c1b161204a3662a0d9fbf33da38aa7d3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/580499
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9fe847bdca39627b4d1741d2c5807ebca7101d2e
Submitter: Zuul
Branch: stable/pike

commit 9fe847bdca39627b4d1741d2c5807ebca7101d2e
Author: Surya Seetharaman <email address hidden>
Date: Wed Mar 21 14:16:24 2018 +0100

    Cleanup RP and HM records while deleting a compute service.

    Currently when deleting a nova-compute service via the API, we will
    (soft) delete the service and compute_node records in the DB, but the
    placement resource provider and host mapping records will be orphaned.
    This patch deletes the resource provider and host_mapping records
    before deleting the service/compute node.

    Change-Id: I7b8622b178d5043ed1556d7bdceaf60f47e5ac80
    Closes-Bug: #1756179
    (cherry picked from commit 589c495c1ae62129e20ab5e2641e330541eee01f)
    (cherry picked from commit dede2de2bd482d0378a7acd81b65d93b1635e825)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/603749

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.5

This issue was fixed in the openstack/nova 16.1.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/603749
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=94848ca40f9bce9c53b820bbfe66eb3d2d181afc
Submitter: Zuul
Branch: stable/ocata

commit 94848ca40f9bce9c53b820bbfe66eb3d2d181afc
Author: Surya Seetharaman <email address hidden>
Date: Wed Mar 21 14:16:24 2018 +0100

    Cleanup RP and HM records while deleting a compute service.

    Currently when deleting a nova-compute service via the API, we will
    (soft) delete the service and compute_node records in the DB, but the
    placement resource provider and host mapping records will be orphaned.
    This patch deletes the resource provider and host_mapping records
    before deleting the service/compute node.

    Change-Id: I7b8622b178d5043ed1556d7bdceaf60f47e5ac80
    Closes-Bug: #1756179
    (cherry picked from commit 589c495c1ae62129e20ab5e2641e330541eee01f)
    (cherry picked from commit dede2de2bd482d0378a7acd81b65d93b1635e825)
    (cherry picked from commit 9fe847bdca39627b4d1741d2c5807ebca7101d2e)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.5

This issue was fixed in the openstack/nova 15.1.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/657021

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/657021
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8921a470ee797cd75def387ce184ee0ffaae6517
Submitter: Zuul
Branch: master

commit 8921a470ee797cd75def387ce184ee0ffaae6517
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 15:40:30 2019 -0400

    Avoid unnecessary joins in delete_resource_provider

    If cascade=True, we're getting all of the instances on the
    compute node just to use the uuid, which will by default
    join on the info_cache and security_groups for the instances.
    This is a simple optimization to avoid those unnecessary joins.

    A TODO is left to further optimize this with a new InstanceList
    query method to just get the instance uuids on a given host/node
    combo, but that requires an RPC change which we can't backport.

    Change-Id: Ie121210456a240c257979d3269db115ddae2d23b
    Related-Bug: #1811726
    Related-Bug: #1756179

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers