OpenStack Compute (nova)

Bug #2054329
Activity log

Activity log for bug #2054329

Date	Who	What changed	Old value	New value	Message
2024-02-19 16:23:12	Robert Franzke	bug			added bug
2024-02-19 16:24:45	Robert Franzke	description	Description =========== It can happen, that there are orphan allocations against a resource provider. E.g. when something went wrong during a migration. During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell. When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull. This causes orphan resource-providers. This is based on the try-catch around the deletion of the resource-provider: https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321 If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname. This causes a mismatch between the ID of the nova-compute-service and the resource provider. If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch. This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name. Steps to reproduce ================== 1. Generate orphaned allocations on a resource provider Can be done by generating a random allocation: ``` openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id> ``` 2. Delete the nova-compute-service via the nova-api 3. Restart the nova-compute service, so a new nova-compute-service is created 4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID 5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted. Expected result =============== No erros in the logs regarding not finding a resource-provider based on its ID. The deletion of the recreated nova-compute-service should be succesfull. Actual result ============= We see erros in the log regarding not finding the resource provider: ``` An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e ``` We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID. Environment =========== We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.	Description =========== It can happen, that there are orphan allocations against a resource provider. E.g. when something went wrong during a migration. During the deletion of a nova-compute-service, the nova-api tries to delete the resource-provider in placement aswell. When the resource provider has still allocations against it, the deletion of the resource-provider will fail but the deletion of the nova-compute-service will be successfull. This causes orphan resource-providers. This is based on the try-catch around the deletion of the resource-provider: https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321 If a new nova-compute-service with the same hostname gets created, it will not create a new resource provider as there is already one with the correct hostname. This causes a mismatch between the ID of the nova-compute-service and the ID of the resource-provider. If you now try to delete the new nova-compute-service, it will generate an 'ValueError', due to this mismatch. This also happens for all other requests to placement, where the resource_provider is referenced via the UUID instead of the name. Steps to reproduce ================== 1. Generate orphaned allocations on a resource provider Can be done by generating a random allocation: ``` openstack resource provider allocation set <random-uuid> --allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id <your-project-id> --user-id <your-user-id> ``` 2. Delete the nova-compute-service via the nova-api 3. Restart the nova-compute service, so a new nova-compute-service is created 4. You will start to see erros in the logs of placement/nova-api, regarding not finding the resource provider with the old UUID 5. Delete the nova-compute-service via the nova-api, this will generate a 500 error and the nova-compute-service is not deleted. Expected result =============== No erros in the logs regarding not finding a resource-provider based on its ID. The deletion of the recreated nova-compute-service should be succesfull. Actual result ============= We see erros in the log regarding not finding the resource provider: ``` An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be synchronized when the update_available_resource periodic task runs. Error: Failed to get traits for resource provider with UUID d5d7cf1c-51ea-4139-9fc3-6007ba58441e ``` We are not able to delete the newly created nova-compute-service, due to a ValueError as it is not able to find the resource-provider based on the nova-compute-service UUID. Environment =========== We are running Openstack Zed, but based on the Code the issue should be still present on the master branch.
2024-03-19 15:10:42	Sylvain Bauza	nova: status	New	Won't Fix