Check compute_id existence when nova-compute reports info to placement

Bug #1817833 reported by xulei on 2019-02-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann

Bug Description

Description
===========
According to https://bugs.launchpad.net/nova/+bug/1756179, Currently we delete a nova-compute service, will delete compute_node records, resource provider records and host mapping records in DB. I found if deleting service when nova-compute service is active, it's no problem for deleting compute_node records and resource_provider records in DB, but nova-compute will continue to report the old resource_provider uuid. So when we restart nova-compute to recover service, will rasie ResourceProviderCreationFailed.

Steps to reproduce
==================
1. Check enviroment and resource_provider table.
# nova service-list | grep 'nova-compute'
| 3d9092b0-e164-4094-8672-1c855971218d | nova-compute | devstack-q | nova | enabled | up |
MariaDB [placement]> select uuid,name from resource_providers;
+--------------------------------------+------------+
| uuid | name |
+--------------------------------------+------------+
| edfff022-c19f-4720-85f9-fd947ae36b07 | devstack-q |
+--------------------------------------+------------+

2. Deleting a compute service when nova-compute process is running, check resource_provider table.
# nova service-delete 3d9092b0-e164-4094-8672-1c855971218d
MariaDB [placement]> select * from resource_providers;
Empty set (0.00 sec)

3. Wait a minute, restart nova-compute process.
# systemctl restart devstack@n-cpu

Expected result
===============
nova-compute work properly and report to resource_provider with new uuid.

Actual result
===============
nova-compute raise 409 when creae a new uuid resource_provider, and report 'No resource provider with uuid 52943fd2-d700-416f-9e16-7fe4744979b3 found'.

I found if nova-compute running, it will resume the old uuid to resource_providers when this uuid is gone. So
current resource_provider uuid in DB is still 'edfff022-c19f-4720-85f9-fd947ae36b07'. Then nova-compute will try to create a new resource provider with name 'devstack-q'. Unfortunately, the name column in tables is unique.

So I think we should check compute_id existence first, then update resource_provider_tree. If not exist, rasie ComputeHostNotFound instead of reporting.

xulei (605423512-j) on 2019-02-27
Changed in nova:
assignee: nobody → xulei (605423512-j)
tags: added: placement
Matt Riedemann (mriedem) wrote :

Which release are you testing this against? master (stein)?

xulei (605423512-j) wrote :

I found problems in Pike (our product based on Pike), and also affect master branch.

Fix proposed to branch: master
Review: https://review.openstack.org/641899

Changed in nova:
status: New → In Progress
Matt Riedemann (mriedem) wrote :

The docs explicitly say that the nova-compute service needs to be stopped before you delete the resource:

https://developer.openstack.org/api-ref/compute/?expanded=delete-compute-service-detail#delete-compute-service

Otherwise the running compute service will try to recreate the compute_nodes table and resource providers records.

Matt Riedemann (mriedem) wrote :

Having said that, I see something missed in the fix for bug 1756179 here:

https://github.com/openstack/nova/blob/b9bcbab86b8314fbaaeb2d2af6282d4a612aeb8d/nova/api/openstack/compute/services.py#L270

That does not account for ironic where the compute service could be managing more than one node and will only delete the resource provider in placement for the first compute node in the list:

https://github.com/openstack/nova/blob/b9bcbab86b8314fbaaeb2d2af6282d4a612aeb8d/nova/objects/service.py#L313

Matt Riedemann (mriedem) wrote :

Ah there is a bug for the issue I mentioned in comment 5:

https://bugs.launchpad.net/nova/+bug/1811726

Matt Riedemann (mriedem) wrote :

Bug 1829479 might be related somehow.

Reviewed: https://review.opendev.org/663737
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2629d65fbc15d8698f98117e0d6072810f70da03
Submitter: Zuul
Branch: master

commit 2629d65fbc15d8698f98117e0d6072810f70da03
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833

Fix proposed to branch: master
Review: https://review.opendev.org/678100

Changed in nova:
assignee: xulei (605423512-j) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2019-08-22
Changed in nova:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers