_init_instance recovery from failed in-progress resize leaves old records and allocations

Bug #1836369 reported by Matt Riedemann on 2019-07-12
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Unassigned

Bug Description

This came up in review here:

https://review.opendev.org/#/c/667177/8/nova/virt/libvirt/driver.py@9175

And was confirmed in a functional test patch here:

https://review.opendev.org/#/c/670393/

There are some known things that get left behind when recovering a guest on the source host when the source compute service host crashes while the virt driver's migrate_disk_and_power_off are running:

- the instance.migration_context
- the instance.new_flavor
- the instance.system_metadata 'old_vm_state' key

The migration_context might be the only real issue there since it's used in the API for routing os-server-external-events. Those fields all get set during _prep_resize on the dest host.

Probably the bigger issue is the migration-based allocations don't get cleaned up. This means the source host allocations are still tracked against the migration record for the old flavor, and the dest host allocations for the new flavor are tracked by the instance, even though the instance isn't running on the dest host.

Matt Riedemann (mriedem) wrote :

Marked as Low severity since this is extremely latent behavior.

Changed in nova:
importance: Undecided → Low

Reviewed: https://review.opendev.org/670393
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8db712fe040b15f2b8bc5538338658d3aac246e3
Submitter: Zuul
Branch: master

commit 8db712fe040b15f2b8bc5538338658d3aac246e3
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 11 17:11:32 2019 -0400

    Add functional test for resize crash compute restart revert

    During review of change I51673e58fc8d5f051df911630f6d7a928d123a5b
    there was discussion about the RESIZE_MIGRATING crashed resize
    cleanup on restart of the compute service and how it may or may
    not work but is likely missing some things to cleanup like fields
    set on the instance during prep_resize and resource allocations
    in placement.

    This adds a functional test to hit that code and make assertions
    about what it does and does not cleanup after the crashed resize.

    Related-Bug: #1836369

    Change-Id: I107d842520c088b4859a3b36621ce6bd8e970475

Matt Riedemann (mriedem) on 2019-08-13
tags: added: placement
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers