Bug #1811726 “Deleting compute service only deletes “first” iron...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-01-14:

#1

I don't think [3] is doing what you'd expect for cleaning up orphans since it won't return compute nodes records from the database if the API deleted them:

https://github.com/openstack/nova/blob/da98f4ba4554139b3901103aa0d26876b11e1d9a/nova/compute/manager.py#L7777

So that for loop here:

https://github.com/openstack/nova/blob/da98f4ba4554139b3901103aa0d26876b11e1d9a/nova/compute/manager.py#L7758

Could just be an empty list, right?

tags:

added: api compute ironic placement

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-01-14:

#2

Re comment 1, in other words, that orphan cleanup code is meant (I think) for cases like the admin deleting ironic nodes via the ironic API so the ironic driver no longer reports them, and nova needs to cleanup the related compute nodes records in the DB (so the admin didn't do anything about deleting the compute service record in the API).

Changed in nova:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-01-14:

#3

FYI, this applies all the way back to Ocata:

https://review.openstack.org/#/q/I7b8622b178d5043ed1556d7bdceaf60f47e5ac80

summary:	- Deleting compute service only deletes "first" ironic node from placement + Deleting compute service only deletes "first" ironic node resource + provider from placement
Changed in nova:
importance:	Medium → Low

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-01-14:

#4

I'd say this is medium/low severity since we do correlate compute_nodes / resource providers / ironic node resources via the same uuid, which comes from the ironic node uuid, so although we orphan any other ironic node resource providers in this case, if a compute service is managing those compute node records (ironic nodes) anywhere, it should re-use the same resource providers based on the same uuid. The possible downside is if you have the scheduler configured with a low value for limiting placement allocation candidates and placement is reporting resource providers that aren't actually being managed by any compute service - in that case you could get NoValidHost errors.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-01-14:

#5

Related IRC discussion:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-01-14.log.html#t2019-01-14T21:35:45

Changed in nova:
importance:	Low → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-05-03:

#6

Bug 1817833 could be related and my comment 4 might not be accurate if we can't recreate the providers (or fail if they already exist).

Matt Riedemann (mriedem) on 2019-05-03

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-03: Fix proposed to nova (master)

#7

Fix proposed to branch: master
Review: https://review.opendev.org/657016

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-03: Related fix proposed to nova (master)

#8

Related fix proposed to branch: master
Review: https://review.opendev.org/657021

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-03:

#9

Related fix proposed to branch: master
Review: https://review.opendev.org/657070

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-21: Fix merged to nova (master)

#10

Reviewed: https://review.opendev.org/657016
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=650fe118d128f09f78552b82abc114bb4b84930e
Submitter: Zuul
Branch: master

commit 650fe118d128f09f78552b82abc114bb4b84930e
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.

    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.

Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
Closes-Bug: #1811726

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-21: Related fix merged to nova (master)

#11

Reviewed: https://review.opendev.org/657021
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8921a470ee797cd75def387ce184ee0ffaae6517
Submitter: Zuul
Branch: master

commit 8921a470ee797cd75def387ce184ee0ffaae6517
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 15:40:30 2019 -0400

Avoid unnecessary joins in delete_resource_provider

    If cascade=True, we're getting all of the instances on the
    compute node just to use the uuid, which will by default
    join on the info_cache and security_groups for the instances.
    This is a simple optimization to avoid those unnecessary joins.

    A TODO is left to further optimize this with a new InstanceList
    query method to just get the instance uuids on a given host/node
    combo, but that requires an RPC change which we can't backport.

    Change-Id: Ie121210456a240c257979d3269db115ddae2d23b
    Related-Bug: #1811726
    Related-Bug: #1756179

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-21: Fix proposed to nova (stable/stein)

#12

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/666840

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-21: Fix proposed to nova (stable/rocky)

#13

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/666854

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-21: Fix proposed to nova (stable/queens)

#14

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/666862

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-21: Fix proposed to nova (stable/pike)

#15

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/666865

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-21: Fix proposed to nova (stable/ocata)

#16

Fix proposed to branch: stable/ocata
Review: https://review.opendev.org/666874

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-01: Fix merged to nova (stable/stein)

#17

Reviewed: https://review.opendev.org/666840
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e9889be94ee58a6da78ef96d09d8f5282b86f648
Submitter: Zuul
Branch: stable/stein

commit e9889be94ee58a6da78ef96d09d8f5282b86f648
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.

    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.

    Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
    Closes-Bug: #1811726
    (cherry picked from commit 650fe118d128f09f78552b82abc114bb4b84930e)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-04: Fix merged to nova (stable/rocky)

#18

Reviewed: https://review.opendev.org/666854
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=64d52788831f0e676f0b9bf9780eeadb38232def
Submitter: Zuul
Branch: stable/rocky

commit 64d52788831f0e676f0b9bf9780eeadb38232def
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.

    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.

    Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
    Closes-Bug: #1811726
    (cherry picked from commit 650fe118d128f09f78552b82abc114bb4b84930e)
    (cherry picked from commit e9889be94ee58a6da78ef96d09d8f5282b86f648)

Reviewed:  https://review.opendev.org/666854
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=64d52788831f0e676f0b9bf9780eeadb38232def
Submitter: Zuul
Branch:    stable/rocky

commit 64d52788831f0e676f0b9bf9780eeadb38232def
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service
    
    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.
    
    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.
    
    Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
    Closes-Bug: #1811726
    (cherry picked from commit 650fe118d128f09f78552b82abc114bb4b84930e)
    (cherry picked from commit e9889be94ee58a6da78ef96d09d8f5282b86f648)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-20: Related fix merged to nova (master)

#19

Reviewed: https://review.opendev.org/657070
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4adface8b9c1cdb04f9401d4bbd1c2dc795f94a5
Submitter: Zuul
Branch: master

commit 4adface8b9c1cdb04f9401d4bbd1c2dc795f94a5
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 18:03:39 2019 -0400

Optimize SchedulerReportClient.delete_resource_provider

    This is used to optimize the SchedulerReportClient
    delete_resource_provider method when deleting a compute
    node and its related resource provider which happens
    in both the API when a compute service is deleted and
    in the compute service in the update_available_resource
    method for orphan nodes (so the new InstanceList query
    method needs to be remotable).

The actual DB query stuff in here is tested in the
nova.tests.functional.wsgi.test_services code.

Change-Id: Id033e8f95f853ccfc34bd350be6df11c1bfb1b7d
Related-Bug: #1811726

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-09: Fix merged to nova (stable/queens)

#20

Reviewed: https://review.opendev.org/666862
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b2f438bab4a0fdaa37b1a57cff0b27fb2a3f3437
Submitter: Zuul
Branch: stable/queens

commit b2f438bab4a0fdaa37b1a57cff0b27fb2a3f3437
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.

    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.

Conflicts:
nova/tests/unit/api/openstack/compute/test_services.py

NOTE(mriedem): The conflict is due to not having change
Ie1aa45a508fbba16e969f216398dee15b37a9569 in Queens.

    Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
    Closes-Bug: #1811726
    (cherry picked from commit 650fe118d128f09f78552b82abc114bb4b84930e)
    (cherry picked from commit e9889be94ee58a6da78ef96d09d8f5282b86f648)
    (cherry picked from commit 64d52788831f0e676f0b9bf9780eeadb38232def)

Reviewed:  https://review.opendev.org/666862
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b2f438bab4a0fdaa37b1a57cff0b27fb2a3f3437
Submitter: Zuul
Branch:    stable/queens

commit b2f438bab4a0fdaa37b1a57cff0b27fb2a3f3437
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service
    
    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.
    
    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.
    
    Conflicts:
          nova/tests/unit/api/openstack/compute/test_services.py
    
    NOTE(mriedem): The conflict is due to not having change
    Ie1aa45a508fbba16e969f216398dee15b37a9569 in Queens.
    
    Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
    Closes-Bug: #1811726
    (cherry picked from commit 650fe118d128f09f78552b82abc114bb4b84930e)
    (cherry picked from commit e9889be94ee58a6da78ef96d09d8f5282b86f648)
    (cherry picked from commit 64d52788831f0e676f0b9bf9780eeadb38232def)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-12: Fix included in openstack/nova 19.0.2

#21

This issue was fixed in the openstack/nova 19.0.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-12: Fix included in openstack/nova 18.2.2

#22

This issue was fixed in the openstack/nova 18.2.2 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-19: Fix included in openstack/nova 17.0.12

#23

This issue was fixed in the openstack/nova 17.0.12 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-27: Fix included in openstack/nova 20.0.0.0rc1

#24

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Fix merged to nova (stable/pike)

#25

Reviewed: https://review.opendev.org/666865
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bd2e0972993ffd732cde0a3a6154424e74d44f0a
Submitter: Zuul
Branch: stable/pike

commit bd2e0972993ffd732cde0a3a6154424e74d44f0a
Author: Matt Riedemann <email address hidden>
Date: Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.

    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.

    Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
    Closes-Bug: #1811726
    (cherry picked from commit 650fe118d128f09f78552b82abc114bb4b84930e)
    (cherry picked from commit e9889be94ee58a6da78ef96d09d8f5282b86f648)
    (cherry picked from commit 64d52788831f0e676f0b9bf9780eeadb38232def)
    (cherry picked from commit b2f438bab4a0fdaa37b1a57cff0b27fb2a3f3437)

Reviewed:  https://review.opendev.org/666865
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bd2e0972993ffd732cde0a3a6154424e74d44f0a
Submitter: Zuul
Branch:    stable/pike

commit bd2e0972993ffd732cde0a3a6154424e74d44f0a
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Fri May 3 15:23:57 2019 -0400

Delete resource providers for all nodes when deleting compute service
    
    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting the
    compute node resource provider associated with a compute node when
    deleting a nova-compute service. However, it would only delete the
    first compute node associated with the service which means for an
    ironic compute service that is managing multiple nodes, the resource
    providers were not cleaned up in placement. This fixes the issue by
    iterating all the compute nodes and cleaning up their providers.
    Note this could be potentially a lot of nodes, but we don't really
    have many good options here but to iterate them and clean them up
    one at a time.
    
    Note that this is best-effort but because of how the
    SchedulerReportClient.delete_resource_provider method ignores
    ResourceProviderInUse errors, and we could have stale allocations
    on the host for which delete_resource_provider is not accounting,
    namely allocations from evacuated instances (or incomplete migrations
    though you can't migrate baremetal instances today), we could still
    delete the compute service and orphan those in-use providers. That,
    however, is no worse than before this change where we did not try
    to cleanup all providers. The issue described above is being tracked
    with bug 1829479 and will be dealt with separately.
    
    Change-Id: I9e852e25ea89f32bf19cdaeb1f5dac8f749f5dbc
    Closes-Bug: #1811726
    (cherry picked from commit 650fe118d128f09f78552b82abc114bb4b84930e)
    (cherry picked from commit e9889be94ee58a6da78ef96d09d8f5282b86f648)
    (cherry picked from commit 64d52788831f0e676f0b9bf9780eeadb38232def)
    (cherry picked from commit b2f438bab4a0fdaa37b1a57cff0b27fb2a3f3437)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-15: Change abandoned on nova (stable/ocata)

#26

Change abandoned by "Lee Yarwood <email address hidden>" on branch: stable/ocata
Review: https://review.opendev.org/c/openstack/nova/+/666874

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-01: Fix included in openstack/nova pike-eol

#27

This issue was fixed in the openstack/nova pike-eol release.

OpenStack Compute (nova)

Deleting compute service only deletes "first" ironic node resource provider from placement

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Matt Riedemann
Ocata	In Progress	Medium	Matt Riedemann
Pike	Fix Released	Medium	Matt Riedemann
Queens	Fix Committed	Medium	Matt Riedemann
Rocky	Fix Committed	Medium	Matt Riedemann
Stein	Fix Committed	Medium	Matt Riedemann