orphan allocations cause orphan resource providers and prevents compute service deletion

Bug #2054329 reported by Robert Franzke
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Unassigned

Bug Description

Description
===========
It can happen, that there are orphan allocations against a resource provider.
E.g. when something went wrong during a migration.

During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
This causes orphan resource-providers.

This is based on the try-catch around the deletion of the resource-provider:
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321

If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.

If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.

Steps to reproduce
==================
1. Generate orphaned allocations on a resource provider
Can be done by generating a random allocation:
```
openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
```
2. Delete the nova-compute-service via the nova-api
3. Restart the nova-compute service, so a new nova-compute-service is created
4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.

Expected result
===============
No erros in the logs regarding not finding a resource-provider based on its ID.
The deletion of the recreated nova-compute-service should be succesfull.

Actual result
=============
We see erros in the log regarding not finding the resource provider:
```
An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
```
We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.

Environment
===========
We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.

description: updated
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

This is a known issue that we recently fixed by ensuring that you can't change the hostname silently : https://specs.openstack.org/openstack/nova-specs/specs/2023.1/implemented/stable-compute-uuid.html

That series won't be backported to Zed so I'd recommend you to upgrade to Antelope. In the meantime, you can do some janitory on the orphaned resources by using the 'nova-manage placement audit' command which will tell you which placement resources are zombies.

Changed in nova:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.