OpenStack Compute (nova)

orphan allocations cause orphan resource providers and prevents compute service deletion

Bug #2054329 reported by Robert Franzke on 2024-02-19

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Won't Fix	Undecided	Unassigned

Bug Description

Description
===========
It can happen, that there are orphan allocations against a resource provider.
E.g. when something went wrong during a migration.

During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell.
When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull.
This causes orphan resource-providers.

This is based on the try-catch around the deletion of the resource-provider:
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321

If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname.
This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider.

If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch.
This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name.

Steps to reproduce
==================
1. Generate orphaned allocations on a resource provider
Can be done by generating a random allocation:
```
openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id>
```
2. Delete the nova-compute-service via the nova-api
3. Restart the nova-compute service, so a new nova-compute-service is created
4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID
5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted.

Expected result
===============
No erros in the logs regarding not finding a resource-provider based on its ID.
The deletion of the recreated nova-compute-service should be succesfull.

Actual result
=============
We see erros in the log regarding not finding the resource provider:
```
An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e
```
We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID.

Environment
===========
We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.

See original description

Robert Franzke (robertfranzke) on 2024-02-19

description:

updated

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2024-03-19:

This is a known issue that we recently fixed by ensuring that you can't change the hostname silently : https://specs.openstack.org/openstack/nova-specs/specs/2023.1/implemented/stable-compute-uuid.html

That series won't be backported to Zed so I'd recommend you to upgrade to Antelope. In the meantime, you can do some janitory on the orphaned resources by using the 'nova-manage placement audit' command which will tell you which placement resources are zombies.

Changed in nova:
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.