Hypervisor resource usage on source still shows old flavor usage after resize confirm until update_available_resource periodic runs

Bug #1818914 reported by Matt Riedemann on 2019-03-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Matt Riedemann

Bug Description

I actually uncovered this due to some failing functional tests for cross-cell resize:

https://review.openstack.org/#/c/641176/2/nova/compute/resource_tracker.py@503

But this goes back to https://review.openstack.org/#/c/370374/ for bug 1641750 and StarlingX has already fixed it:

https://github.com/starlingx-staging/stx-nova/blob/master/nova/compute/resource_tracker.py#L728

The issue is that if you set the "update_resources_interval" to some higher value, let's say 10 minutes, and resize an instance and immediately confirm it, because let's say "resize_confirm_window" is set to 1 second, then the GET /os-hypervisors/{hypervisor_id} results for things like "local_gb_used", "memory_mb_used" and "vcpus_used" will still show usage for the old flavor even though the instance is running on the dest host with the new flavor and is gone from the source host. That doesn't get fixed until the update_available_resource periodic task runs on the source again (or the source nova-compute service is restarted).

This is because the source compute resource tracker is not tracking the migration in it's "tracked_migrations" dict. The resize claim happens on the dest host and that's where the migration is "tracked". The ResourceTracker on the source is tracking the instance in 'tracked_instances' rather than 'tracked_migrations'.

On the source host when the RT code is called here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1063

"tracked = uuid in self.tracked_instances" will be True because the instance is on the source until it gets resized to the dest and then the host value changes here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4500

But in the RT this means we won't get the itype here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1125

So the source RT doesn't track the migration here:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L1146

This is important because later in confirm_resize (on the source host) when it calls RT.drop_move_claim:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/manager.py#L4014

That will only update resource usage and decrement the old flavor if it's a tracked migration:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L478

As noted from the TODO in the elif block below:

https://github.com/openstack/nova/blob/eaa29f71ef01f5da2edfa79886a302f8a5f352ae/nova/compute/resource_tracker.py#L489

This is semi-low priority given how latent it is and the fact it's self-healing since the next run of the update_available_resource periodic will fix the resource usage on the source host, but in a busy cloud it could mean the difference between a server being able to build on that source host or not based on it's tracked resource usage.

Matt Riedemann (mriedem) on 2019-03-06
Changed in nova:
status: New → Triaged
tags: added: starlingx
Matt Riedemann (mriedem) wrote :

Part of this is also sort of reported in a very old bug 1498126.

Fix proposed to branch: master
Review: https://review.openstack.org/641806

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress

Reviewed: https://review.opendev.org/641521
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=54877e06f13b13134f2030557cd779f947a43c24
Submitter: Zuul
Branch: master

commit 54877e06f13b13134f2030557cd779f947a43c24
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 6 18:46:22 2019 -0500

    Add functional recreate test for bug 1818914

    The confirm resize flow in the compute manager
    runs on the source host. It calls RT.drop_move_claim
    to drop resource usage from the source host for the
    old flavor. The problem with drop_move_claim is it
    only decrements the old flavor from the reported usage
    if the instance is in RT.tracked_migrations, which will
    only be there on the source host if the update_available_resource
    periodic task runs before the resize is confirmed, otherwise
    the instance is still just tracked in RT.tracked_instances on
    the source host. This leaves the source compute incorrectly
    reporting resource usage for the old flavor until the next
    periodic runs, which could be a large window if resizes are
    configured to automatically confirm, e.g. resize_confirm_window=1,
    and the periodic interval is big, e.g. update_resources_interval=600.

    This change adds a functional recreate test for the bug which will
    be updated in the change that fixes the bug.

    Change-Id: I4aac187283c2f341b5c2712be85f722156e14f63
    Related-Bug: #1818914
    Related-Bug: #1498126

Reviewed: https://review.opendev.org/641806
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Submitter: Zuul
Branch: master

commit ad9f37350ad1f4e598a9a5df559b9160db1a11c1
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 7 16:07:18 2019 -0500

    Update usage in RT.drop_move_claim during confirm resize

    The confirm resize flow in the compute manager
    runs on the source host. It calls RT.drop_move_claim
    to drop resource usage from the source host for the
    old flavor. The problem with drop_move_claim is it
    only decrements the old flavor from the reported usage
    if the instance is in RT.tracked_migrations, which will
    only be there on the source host if the update_available_resource
    periodic task runs before the resize is confirmed, otherwise
    the instance is still just tracked in RT.tracked_instances on
    the source host. This leaves the source compute incorrectly
    reporting resource usage for the old flavor until the next
    periodic runs, which could be a large window if resizes are
    configured to automatically confirm, e.g. resize_confirm_window=1,
    and the periodic interval is big, e.g. update_resources_interval=600.

    This fixes the issue by also updating usage in drop_move_claim
    when the instance is not in tracked_migrations but is in
    tracked_instances.

    Because of the tight coupling with the instance.migration_context
    we need to ensure the migration_context still exists before
    drop_move_claim is called during confirm_resize, so a test wrinkle
    is added to enforce that.

    test_drop_move_claim_on_revert also needed some updating for
    reality because of how drop_move_claim is called during
    revert_resize.

    And finally, the functional recreate test is updated to show the
    bug is fixed.

    Change-Id: Ia6d8a7909081b0b856bd7e290e234af7e42a2b38
    Closes-Bug: #1818914
    Related-Bug: #1641750
    Related-Bug: #1498126

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers