Nova doesn't clean claims after evacuation

Bug #2077009 reported by Viktor Křivák
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned

Bug Description

When a VM is evacuated from the failed server without stating explicit destination (i.e. letting the scheduler decide) claims to the old hypervisor in placement are never deleted.

How to replicate:
- Place VM on the hypervisor
- Check that claims are OK:
+--------------------------------------+------------+-------------------------------+----------------------------------+----------------------------------+
| resource_provider | generation | resources | project_id | user_id |
+--------------------------------------+------------+-------------------------------+----------------------------------+----------------------------------+
| 229cce5f-3b87-438a-baa9-539be0fc9bd8 | 5 | {'VCPU': 1, 'MEMORY_MB': 256} | 4facfb06808a4621b4f47123a0184a4a | 15da82817e56446198fcdd870a45d8f4 |
+--------------------------------------+------------+-------------------------------+----------------------------------+----------------------------------+
- Stop the hypervisor and after nova pronounce hypervisor dead run evacuation without stating the destination
- Check claims again
+--------------------------------------+------------+-------------------------------+----------------------------------+----------------------------------+
| resource_provider | generation | resources | project_id | user_id |
+--------------------------------------+------------+-------------------------------+----------------------------------+----------------------------------+
| 229cce5f-3b87-438a-baa9-539be0fc9bd8 | 6 | {'VCPU': 1, 'MEMORY_MB': 256} | 4facfb06808a4621b4f47123a0184a4a | 15da82817e56446198fcdd870a45d8f4 |
| 5395932e-b5e0-4a0c-be6a-7328af751642 | 14 | {'VCPU': 1, 'MEMORY_MB': 256} | 4facfb06808a4621b4f47123a0184a4a | 15da82817e56446198fcdd870a45d8f4 |
+--------------------------------------+------------+-------------------------------+----------------------------------+----------------------------------+

Result: Claims to the old hypervisor have not been deleted
Expected results: Only claims for new hypervisor exist

It is possible regression of https://bugs.launchpad.net/nova/+bug/1896463
It probably happened when the resource tracker was improved and the whole migration procedure was rewritten. Migration/resize work because claims deletion happens in confirm/revert action, however, evacuation doesn't have anything like that and so it's never deleted.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/926292

Changed in nova:
status: New → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :

they are cleaned up when the failed compute is fixed as part of init host today.

that is the current expected behaiovr

Revision history for this message
Viktor Křivák (viktor-krivak) wrote :

But this will break any additional evacuation.
Assume that you have 10 hypervisor, one die, you evacuate let say 9 vms and they are evenly spread. Now another hypervisor die and you have a problem, because VM cannot be evacuated anymore. It will fail with message: Instance has a complex allocation and cannot be moved (or something like that).

Also another possibility is that you never replace failed hypervisor. Let say it is an old HW and it just completely fail. You don't need to replace it because you have enough capacity. In that case you will need to manually fix placement for everything that was evacuated.

Revision history for this message
Michal Arbet (michalarbet) wrote :

+1 Good point

Revision history for this message
sean mooney (sean-k-mooney) wrote :

yep its posibel you never replace it and it can prevent additional evacuations.

I'm just poinitn out that howe this is meant to be cleaned up today is via fixing the broken host and having ti clean it up.

if you do a compute service delete we also remove any allocation related to that comptue node in placement as part of that process and we have heal allocation command to fix them too.

so while I'm not against improving the situation we need to be careful not to break the existing flow that operators expect which is the allocation are correctly cleaned up when the failed not is put back in service.

that means we need to have functional test coverage for both workflows

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

If the error is along the line of
> Instance has a complex allocation and cannot be moved (or something like that).

Then it is either a bug in the logic that decides what can be evacuated or the computes you are evacuated to in the first place has such a provider tree that block the further evacuation.

Could you reproduce it and collect the allocation of the instance from placement from all computes? Also could you provide the exact logs nova prints when rejecting the evacuation?

Having allocation on multiple compute alone should not result in the complex allocation error.

Also do you happen to try to force the destination host with old microversion. See https://github.com/openstack/nova/blob/61f44e992ee2e64e81999c2d57b57e357c2b6c32/releasenotes/notes/remove-live-migrate-evacuate-force-flag-cb50608d5930585c.yaml#L4-L8

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I was not able to reproduce the subsequent evacuation issue in my env. I was able to evacuate the same instance 2 times without recovering any of the computes.

https://paste.opendev.org/show/bIu4YQb7rA7MV6QiV7xM/

So I suspect something is missing from the reproduction as the existing multi node allocation does not prevent the evacuation.

I tried this with Antelope version.

Revision history for this message
Viktor Křivák (viktor-krivak) wrote :

You need to specify the destination host for the evacuation,

openstack --os-compute-api-version 2.29 server evacuate --host devstack-compute1 7b21820d-31db-4927-833e-d8f065e40da8

This will result in following error:
...
Sep 03 14:41:17 devstack-control nova-conductor[255596]: ERROR oslo_messaging.rpc.server nova.exception.NoValidHost: No valid host was found. Unable to move instance 7b21820d-31db-4927-833e-d8f065e40da8 to host devstack-compute1. The instance has complex allocations on the source host so move cannot be forced.

Sorry, when I write a bug description I thought that duplicit claims is the error and don't include how to fully reproduce whole behaviour.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.