rebuild to same host with a different image results in erroneously doing a Claim

Bug #1750618 reported by Chris Friesen on 2018-02-20
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Matt Riedemann
Ocata
High
Tony Breeds
Pike
High
Matt Riedemann
Queens
High
Matt Riedemann

Bug Description

As of stable/pike if we do a rebuild-to-same-node with a new image, it results in ComputeManager.rebuild_instance() being called with "scheduled_node=<hostname>" and "recreate=False". This results in a new Claim, which seems wrong since we're not changing the flavor and that claim could fail if the compute node is already full.

The comments in ComputeManager.rebuild_instance() make it appear that it expects both "recreate" and "scheduled_node" to be None for the rebuild-to-same-host case otherwise it will do a Claim. However, if we rebuild to a different image it ends up going through the scheduler which means that "scheduled_node" is not None.

Matt Riedemann (mriedem) on 2018-02-20
tags: added: rebuild
Matt Riedemann (mriedem) wrote :

This is a regression introduced with change I11746d1ea996a0f18b7c54b4c9c21df58cc4714b which was backported all the way to stable/newton upstream:

https://review.openstack.org/#/q/I11746d1ea996a0f18b7c54b4c9c21df58cc4714b

Changed in nova:
importance: Undecided → High
status: New → Triaged

Fix proposed to branch: master
Review: https://review.openstack.org/546268

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress

Is this extra Claim only affect the rebuild action, or can it leaks and affect the project quota or the compute capacity after the rebuild is completed?

Chris Friesen (cbf123) wrote :

It looks like it's mostly only affecting the rebuild action. In the compute_nodes table in the nova DB I'm seeing "memory_mb_used" be 1024 when it should be 512, but the CPU/disk usage is where it should be so I'm not sure what's going on.

Chris Friesen (cbf123) wrote :

Actually, it's showing as consuming 512MB under "memory_mb_used" even when the instance is gone, so I think this might be intentional.

Reviewed: https://review.openstack.org/546268
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a39029076c7997236a7f999682fb1e998c474204
Submitter: Zuul
Branch: master

commit a39029076c7997236a7f999682fb1e998c474204
Author: Matt Riedemann <email address hidden>
Date: Tue Feb 20 13:48:12 2018 -0500

    Only attempt a rebuild claim for an evacuation to a new host

    Change I11746d1ea996a0f18b7c54b4c9c21df58cc4714b changed the
    behavior of the API and conductor when rebuilding an instance
    with a new image such that the image is run through the scheduler
    filters again to see if it will work on the existing host that
    the instance is running on.

    As a result, conductor started passing 'scheduled_node' to the
    compute which was using it for logic to tell if a claim should be
    attempted. We don't need to do a claim for a rebuild since we're
    on the same host.

    This removes the scheduled_node logic from the claim code, as we
    should only ever attempt a claim if we're evacuating, which we
    can determine based on the 'recreate' parameter.

    Change-Id: I7fde8ce9dea16679e76b0cb2db1427aeeec0c222
    Closes-Bug: #1750618

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/550545
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3c5e519a8875a9766a0d3a06cb76cceba26634e6
Submitter: Zuul
Branch: stable/queens

commit 3c5e519a8875a9766a0d3a06cb76cceba26634e6
Author: Matt Riedemann <email address hidden>
Date: Tue Feb 20 13:48:12 2018 -0500

    Only attempt a rebuild claim for an evacuation to a new host

    Change I11746d1ea996a0f18b7c54b4c9c21df58cc4714b changed the
    behavior of the API and conductor when rebuilding an instance
    with a new image such that the image is run through the scheduler
    filters again to see if it will work on the existing host that
    the instance is running on.

    As a result, conductor started passing 'scheduled_node' to the
    compute which was using it for logic to tell if a claim should be
    attempted. We don't need to do a claim for a rebuild since we're
    on the same host.

    This removes the scheduled_node logic from the claim code, as we
    should only ever attempt a claim if we're evacuating, which we
    can determine based on the 'recreate' parameter.

    Change-Id: I7fde8ce9dea16679e76b0cb2db1427aeeec0c222
    Closes-Bug: #1750618
    (cherry picked from commit a39029076c7997236a7f999682fb1e998c474204)

Reviewed: https://review.openstack.org/550555
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9890f3f69622489a0fd57cb1df354d7aa60161e0
Submitter: Zuul
Branch: stable/pike

commit 9890f3f69622489a0fd57cb1df354d7aa60161e0
Author: Matt Riedemann <email address hidden>
Date: Tue Feb 20 13:48:12 2018 -0500

    Only attempt a rebuild claim for an evacuation to a new host

    Change I11746d1ea996a0f18b7c54b4c9c21df58cc4714b changed the
    behavior of the API and conductor when rebuilding an instance
    with a new image such that the image is run through the scheduler
    filters again to see if it will work on the existing host that
    the instance is running on.

    As a result, conductor started passing 'scheduled_node' to the
    compute which was using it for logic to tell if a claim should be
    attempted. We don't need to do a claim for a rebuild since we're
    on the same host.

    This removes the scheduled_node logic from the claim code, as we
    should only ever attempt a claim if we're evacuating, which we
    can determine based on the 'recreate' parameter.

    Conflicts:
          nova/compute/manager.py

    NOTE(mriedem): The conflict is due to change
    I0883c2ba1989c5d5a46e23bcbcda53598707bcbc in Queens.

    Change-Id: I7fde8ce9dea16679e76b0cb2db1427aeeec0c222
    Closes-Bug: #1750618
    (cherry picked from commit a39029076c7997236a7f999682fb1e998c474204)
    (cherry picked from commit 3c5e519a8875a9766a0d3a06cb76cceba26634e6)

This issue was fixed in the openstack/nova 17.0.2 release.

This issue was fixed in the openstack/nova 16.1.1 release.

Reviewed: https://review.openstack.org/550560
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=286dd2c23c7c361c6538be65941e3c19e83e6d52
Submitter: Zuul
Branch: stable/ocata

commit 286dd2c23c7c361c6538be65941e3c19e83e6d52
Author: Matt Riedemann <email address hidden>
Date: Tue Feb 20 13:48:12 2018 -0500

    Only attempt a rebuild claim for an evacuation to a new host

    Change I11746d1ea996a0f18b7c54b4c9c21df58cc4714b changed the
    behavior of the API and conductor when rebuilding an instance
    with a new image such that the image is run through the scheduler
    filters again to see if it will work on the existing host that
    the instance is running on.

    As a result, conductor started passing 'scheduled_node' to the
    compute which was using it for logic to tell if a claim should be
    attempted. We don't need to do a claim for a rebuild since we're
    on the same host.

    This removes the scheduled_node logic from the claim code, as we
    should only ever attempt a claim if we're evacuating, which we
    can determine based on the 'recreate' parameter.

    Conflicts:
          nova/tests/functional/test_servers.py

    NOTE(mriedem): test_rebuild_with_new_image does not exist in
    Ocata and does not apply to Ocata since it is primarily
    testing allocations getting created in Placement via the
    FilterScheduler, which was new in Pike. As a result the change
    to that test is not part of this backport, but a similar assertion
    is added to an existing rebuild unit test.

    Change-Id: I7fde8ce9dea16679e76b0cb2db1427aeeec0c222
    Closes-Bug: #1750618
    (cherry picked from commit a39029076c7997236a7f999682fb1e998c474204)
    (cherry picked from commit 3c5e519a8875a9766a0d3a06cb76cceba26634e6)
    (cherry picked from commit 9890f3f69622489a0fd57cb1df354d7aa60161e0)

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers