Scheduler doesn't filter out deleted compute node records based on placement RP UUIDs

Bug #1793533 reported by Mohammed Naser on 2018-09-20
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Dan Smith
Ocata
Medium
Mohammed Naser
Pike
Medium
Mohammed Naser
Queens
Medium
Mohammed Naser
Rocky
Medium
Mohammed Naser

Bug Description

If you are taking a nova-compute service out of service permanently, the logical steps would be:

1) Take down the service
2) Delete it from the service list (nova service-delete <uuid>)

However, this does not delete the compute node record which stays forever, leading to the scheduler to always complain about it as well:

2018-09-20 13:15:45.312 131035 WARNING nova.scheduler.host_manager [req-c4a7c383-c606-48a7-b870-cc143710114a 234412d3482f4707877ca696e105bf5b acb15d2ffaae4eda98580c7b874d7f89 - default default] No compute service record found for host <snip>.vexxhost.net

https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L716-L720

We should be deleting the compute node if a nova-compute binary is deleted, or that section should automatically clean up while warning (because service records can be rebuilt anyways?)

Matt Riedemann (mriedem) wrote :

Are you sure you're stopping the nova-compute service before deleting the actual service record via the API?

https://developer.openstack.org/api-ref/compute/#delete-compute-service

Otherwise the ResourceTracker in the compute process will recreate the compute node.

The Service.destroy is called from the API here:

https://github.com/openstack/nova/blob/d87852ae6a1987b6faa3cb5851f9758b47ef4636/nova/api/openstack/compute/services.py#L251

Which eventually calls the DB API to delete the associated compute node record:

https://github.com/openstack/nova/blob/d87852ae6a1987b6faa3cb5851f9758b47ef4636/nova/db/sqlalchemy/api.py#L404

Changed in nova:
status: New → Invalid
Matt Riedemann (mriedem) wrote :

Sounds like this was a duplicate of bug 1756179.

Matt Riedemann (mriedem) wrote :

The related issue is that the scheduler was not filtering out deleted compute node records when pulling them from the cell DB:

https://github.com/openstack/nova/blob/d87852ae6a1987b6faa3cb5851f9758b47ef4636/nova/objects/compute_node.py#L443

Because ^ that query doesn't filter out deleted records. Granted, if the resource provider record in placement was cleaned up properly, we wouldn't have gotten that far anyway, but it's still an issue.

Changed in nova:
status: Invalid → Triaged
importance: Undecided → Medium
summary: - Deleting a service with nova-compute binary doesn't remove compute node
+ Scheduler doesn't filter out deleted compute node records based on
+ placement RP UUIDs
Matt Riedemann (mriedem) on 2018-09-20
Changed in nova:
assignee: nobody → Dan Smith (danms)

Fix proposed to branch: master
Review: https://review.openstack.org/604108

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/604108
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=37f3444c32ccb72076a1a6549c183f40c33fe684
Submitter: Zuul
Branch: master

commit 37f3444c32ccb72076a1a6549c183f40c33fe684
Author: Dan Smith <email address hidden>
Date: Thu Sep 20 07:15:25 2018 -0700

    Filter deleted computes from get_all_by_uuids()

    Fix ComputeNodeList.get_all_by_uuids() to use model_query() so that
    deleted compute nodes are filtered from the results. Without this,
    a stale result from placement could cause us to choose a compute
    node as a scheduling destination that has since been deleted.

    Change-Id: I811e84af46d678c3fdbf94ee400eabe659fc3d4e
    Closes-Bug: #1793533

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/604367
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=518fdc7a22dd3dabbc9dd032e816e0aecdd75150
Submitter: Zuul
Branch: stable/rocky

commit 518fdc7a22dd3dabbc9dd032e816e0aecdd75150
Author: Dan Smith <email address hidden>
Date: Thu Sep 20 07:15:25 2018 -0700

    Filter deleted computes from get_all_by_uuids()

    Fix ComputeNodeList.get_all_by_uuids() to use model_query() so that
    deleted compute nodes are filtered from the results. Without this,
    a stale result from placement could cause us to choose a compute
    node as a scheduling destination that has since been deleted.

    Change-Id: I811e84af46d678c3fdbf94ee400eabe659fc3d4e
    Closes-Bug: #1793533
    (cherry picked from commit 37f3444c32ccb72076a1a6549c183f40c33fe684)

Reviewed: https://review.openstack.org/604448
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d01336a332a4080f26ac70d131ecf3ed66e129a9
Submitter: Zuul
Branch: stable/queens

commit d01336a332a4080f26ac70d131ecf3ed66e129a9
Author: Dan Smith <email address hidden>
Date: Thu Sep 20 07:15:25 2018 -0700

    Filter deleted computes from get_all_by_uuids()

    Fix ComputeNodeList.get_all_by_uuids() to use model_query() so that
    deleted compute nodes are filtered from the results. Without this,
    a stale result from placement could cause us to choose a compute
    node as a scheduling destination that has since been deleted.

    Change-Id: I811e84af46d678c3fdbf94ee400eabe659fc3d4e
    Closes-Bug: #1793533
    (cherry picked from commit 37f3444c32ccb72076a1a6549c183f40c33fe684)
    (cherry picked from commit 518fdc7a22dd3dabbc9dd032e816e0aecdd75150)

Reviewed: https://review.openstack.org/604449
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=976e52a49c7b43a5bb9ba86bf7970bf7185ef285
Submitter: Zuul
Branch: stable/pike

commit 976e52a49c7b43a5bb9ba86bf7970bf7185ef285
Author: Dan Smith <email address hidden>
Date: Thu Sep 20 07:15:25 2018 -0700

    Filter deleted computes from get_all_by_uuids()

    Fix ComputeNodeList.get_all_by_uuids() to use model_query() so that
    deleted compute nodes are filtered from the results. Without this,
    a stale result from placement could cause us to choose a compute
    node as a scheduling destination that has since been deleted.

    Change-Id: I811e84af46d678c3fdbf94ee400eabe659fc3d4e
    Closes-Bug: #1793533
    (cherry picked from commit 37f3444c32ccb72076a1a6549c183f40c33fe684)
    (cherry picked from commit 518fdc7a22dd3dabbc9dd032e816e0aecdd75150)
    (cherry picked from commit d01336a332a4080f26ac70d131ecf3ed66e129a9)

Reviewed: https://review.openstack.org/604451
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=81fe10ffbce67977577e04c72cd48bdf9b0e498c
Submitter: Zuul
Branch: stable/ocata

commit 81fe10ffbce67977577e04c72cd48bdf9b0e498c
Author: Dan Smith <email address hidden>
Date: Thu Sep 20 07:15:25 2018 -0700

    Filter deleted computes from get_all_by_uuids()

    Fix ComputeNodeList.get_all_by_uuids() to use model_query() so that
    deleted compute nodes are filtered from the results. Without this,
    a stale result from placement could cause us to choose a compute
    node as a scheduling destination that has since been deleted.

    Conflicts:
            nova/tests/functional/db/test_compute_node.py

    test_compute_node.test_get_by_hypervisor_type did not exist in
    Ocata so had to be cleared out.

    Change-Id: I811e84af46d678c3fdbf94ee400eabe659fc3d4e
    Closes-Bug: #1793533
    (cherry picked from commit 37f3444c32ccb72076a1a6549c183f40c33fe684)
    (cherry picked from commit 518fdc7a22dd3dabbc9dd032e816e0aecdd75150)
    (cherry picked from commit d01336a332a4080f26ac70d131ecf3ed66e129a9)
    (cherry picked from commit 976e52a49c7b43a5bb9ba86bf7970bf7185ef285)

This issue was fixed in the openstack/nova 18.0.2 release.

This issue was fixed in the openstack/nova 16.1.6 release.

This issue was fixed in the openstack/nova 15.1.5 release.

This issue was fixed in the openstack/nova 17.0.7 release.

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers