_init_instance recovery from failed in-progress resize leaves old records and allocations

Bug #1836369 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Low
Unassigned

Bug Description

This came up in review here:

https://review.opendev.org/#/c/667177/8/nova/virt/libvirt/driver.py@9175

And was confirmed in a functional test patch here:

https://review.opendev.org/#/c/670393/

There are some known things that get left behind when recovering a guest on the source host when the source compute service host crashes while the virt driver's migrate_disk_and_power_off are running:

- the instance.migration_context
- the instance.new_flavor
- the instance.system_metadata 'old_vm_state' key

The migration_context might be the only real issue there since it's used in the API for routing os-server-external-events. Those fields all get set during _prep_resize on the dest host.

Probably the bigger issue is the migration-based allocations don't get cleaned up. This means the source host allocations are still tracked against the migration record for the old flavor, and the dest host allocations for the new flavor are tracked by the instance, even though the instance isn't running on the dest host.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Marked as Low severity since this is extremely latent behavior.

Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/670393
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8db712fe040b15f2b8bc5538338658d3aac246e3
Submitter: Zuul
Branch: master

commit 8db712fe040b15f2b8bc5538338658d3aac246e3
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 11 17:11:32 2019 -0400

    Add functional test for resize crash compute restart revert

    During review of change I51673e58fc8d5f051df911630f6d7a928d123a5b
    there was discussion about the RESIZE_MIGRATING crashed resize
    cleanup on restart of the compute service and how it may or may
    not work but is likely missing some things to cleanup like fields
    set on the instance during prep_resize and resource allocations
    in placement.

    This adds a functional test to hit that code and make assertions
    about what it does and does not cleanup after the crashed resize.

    Related-Bug: #1836369

    Change-Id: I107d842520c088b4859a3b36621ce6bd8e970475

Matt Riedemann (mriedem)
tags: added: placement
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/687532

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/687563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/687874

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.opendev.org/687913

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/687532
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c565cab47945146cbbd1f1a75f42fef0fc531483
Submitter: Zuul
Branch: stable/stein

commit c565cab47945146cbbd1f1a75f42fef0fc531483
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 11 17:11:32 2019 -0400

    Add functional test for resize crash compute restart revert

    During review of change I51673e58fc8d5f051df911630f6d7a928d123a5b
    there was discussion about the RESIZE_MIGRATING crashed resize
    cleanup on restart of the compute service and how it may or may
    not work but is likely missing some things to cleanup like fields
    set on the instance during prep_resize and resource allocations
    in placement.

    This adds a functional test to hit that code and make assertions
    about what it does and does not cleanup after the crashed resize.

    Related-Bug: #1836369

    Change-Id: I107d842520c088b4859a3b36621ce6bd8e970475
    (cherry picked from commit 8db712fe040b15f2b8bc5538338658d3aac246e3)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/687563
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f04fc63ffcde238f31c28c0fc5886716b669d718
Submitter: Zuul
Branch: stable/rocky

commit f04fc63ffcde238f31c28c0fc5886716b669d718
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 11 17:11:32 2019 -0400

    Add functional test for resize crash compute restart revert

    During review of change I51673e58fc8d5f051df911630f6d7a928d123a5b
    there was discussion about the RESIZE_MIGRATING crashed resize
    cleanup on restart of the compute service and how it may or may
    not work but is likely missing some things to cleanup like fields
    set on the instance during prep_resize and resource allocations
    in placement.

    This adds a functional test to hit that code and make assertions
    about what it does and does not cleanup after the crashed resize.

    stable/rocky:
    * The integrated_helper change is needed as
    Ic062446e5c620c89aec3065b34bcdc6bf5966275 is missing from rocky.

    Related-Bug: #1836369

    Change-Id: I107d842520c088b4859a3b36621ce6bd8e970475
    (cherry picked from commit 8db712fe040b15f2b8bc5538338658d3aac246e3)
    (cherry picked from commit c565cab47945146cbbd1f1a75f42fef0fc531483)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/687874
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dae6fbfce61b91d44e514d73b6e9979ff0c15fae
Submitter: Zuul
Branch: stable/queens

commit dae6fbfce61b91d44e514d73b6e9979ff0c15fae
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 11 17:11:32 2019 -0400

    Add functional test for resize crash compute restart revert

    During review of change I51673e58fc8d5f051df911630f6d7a928d123a5b
    there was discussion about the RESIZE_MIGRATING crashed resize
    cleanup on restart of the compute service and how it may or may
    not work but is likely missing some things to cleanup like fields
    set on the instance during prep_resize and resource allocations
    in placement.

    This adds a functional test to hit that code and make assertions
    about what it does and does not cleanup after the crashed resize.

    Conflicts:
        nova/tests/functional/integrated_helpers.py
    Note: file conflict is due to not having patch
    Iea283322124cb35fc0bc6d25f35548621e8c8c2f in queens, and this is why
    the test_init_host change is needed as well.

    Related-Bug: #1836369

    Change-Id: I107d842520c088b4859a3b36621ce6bd8e970475
    (cherry picked from commit 8db712fe040b15f2b8bc5538338658d3aac246e3)
    (cherry picked from commit c565cab47945146cbbd1f1a75f42fef0fc531483)
    (cherry picked from commit f04fc63ffcde238f31c28c0fc5886716b669d718)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.opendev.org/687913
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4ddb1bddd6b2c3f5c317471fe736f936158dfeff
Submitter: Zuul
Branch: stable/pike

commit 4ddb1bddd6b2c3f5c317471fe736f936158dfeff
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 11 17:11:32 2019 -0400

    Add functional test for resize crash compute restart revert

    During review of change I51673e58fc8d5f051df911630f6d7a928d123a5b
    there was discussion about the RESIZE_MIGRATING crashed resize
    cleanup on restart of the compute service and how it may or may
    not work but is likely missing some things to cleanup like fields
    set on the instance during prep_resize and resource allocations
    in placement.

    This adds a functional test to hit that code and make assertions
    about what it does and does not cleanup after the crashed resize.

    Needed changes in test_init_host.py compared to Queens:
    * The last assert, self.assertNotIn(server['id'], source_allocations),
      in the functional reproduce does not fails so it is removed.
      The rest of the test asserts still shows that at least part of the
      bug exists in pike.

    Related-Bug: #1836369

    Change-Id: I107d842520c088b4859a3b36621ce6bd8e970475
    (cherry picked from commit 8db712fe040b15f2b8bc5538338658d3aac246e3)
    (cherry picked from commit c565cab47945146cbbd1f1a75f42fef0fc531483)
    (cherry picked from commit f04fc63ffcde238f31c28c0fc5886716b669d718)
    (cherry picked from commit dae6fbfce61b91d44e514d73b6e9979ff0c15fae)

tags: added: in-stable-pike
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.