OpenStack Compute (nova)

Comment 15 for bug 1896463

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2020-09-30:

#15

I did some analysis about the other move operations to see if they are affected by the same issue.

During cold migrate and resize the instance.host is set to the dest host when the instance reaches the VERIFY_RESIZE state. At this point the migration is set to 'finished' but finished is not an end state of the migration. A migration in finished state still considered in progress by the resource tracker. Later when the resize is confirmed (or reverted) then the migration status is put to 'confirmed' ('reverted') and this is an end state. So a simple delay in the resource tracker is not enough to trigger the race. A long enough delay is needed in the RT that allows the user to actually confirm the migration by hand. Or the automatic confirmation task to be configured and actually run. This makes the race pretty unlikely.

In case of live migration the process is a lot more complex. There is multiple steps in the live migration process that grabs the COMPUTE_RESOURCE_SEMAPHORE in the resource tracker (e.g. the PCI claim) so simply starting and then slowing down the update_available_resources task in the resource tracker does not work as a reproduction. I gave up on trying to reproduce the same problem but I'm not convinced that such fault does not exist for live migration. A reproduction should trigger a periodic task after the last COMPUTE_RESOURCE_SEMAPHORE grabbing live migration step but before the live migration finishes.

The unshelve (after shelve offload) is not affected by the bug as it uses the same instance_claim as the boot process. the instance_claim will set the instance.host under the COMPUTE_RESOURCE_SEMAPHORE lock so there is no chance for an overlapping update_available_resource periodic run.