conductor rebuild_instance does not properly handle image_ref if request_spec is not provided

Bug #1727855 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Ocata
Fix Committed
Medium
Matt Riedemann
Pike
Fix Committed
Medium
Matt Riedemann

Bug Description

Maybe this doesn't actually matter for rebuild, but the image_ref used in this code:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/conductor/manager.py#L830

Is a string image id, it's not a dict or ImageMeta object, it comes through the rebuild action API from the user.

It's important, however, for how scheduler_utils.build_request_spec uses it here:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/scheduler/utils.py#L79

Because I was trying to figure out if the image parameter to build_request_spec is an ImageMeta object, dict or string - since the code appears to assume it's a dict if not provided.

Conductor will then call RequestSpec.from_primitives and the image is used here:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/objects/request_spec.py#L250

And eventually ignored here since it's an unexpected type:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/objects/request_spec.py#L135

Matt Riedemann (mriedem)
tags: added: scheduler
removed: sch
Changed in nova:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

The other thing is, this code in the conductor is wrong anyway since it's using the "new" image_ref requested for the rebuild:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/conductor/manager.py#L830

But if we're trying to reconstruct a request_spec for the instance, we should be using the orig_image_ref parameter since that's coming off the instance:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/compute/api.py#L2866

Before the instance.image_ref is updated from the API request:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/compute/api.py#L2918

Note that if we're getting here because of evacuate, the image_ref parameter isn't even passed:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/compute/api.py#L3978

So maybe conductor's rebuild_instance method should be passing None to scheduler_utils.build_request_spec and if image is not provided, build_request_spec can get the image from the instance using instance.image_meta.

Anyway, as noted, rebuild doesn't even care about the RequestSpec.image field so I might be overthinking this.

Revision history for this message
Matt Riedemann (mriedem) wrote :

In fact, the ComputeManager.rebuild_instance method doesn't even take a request_spec parameter.

The request_spec is only used in the conductor rebuild_instance method if we're evacuating and need to ask the scheduler for another host:

https://github.com/openstack/nova/blob/d36dcd52c24c32418fd358d245688c86664025d5/nova/conductor/manager.py#L843

Which means the scheduler needs to pick a host that satisfies the original instance image metadata constraints for things like host aggregates.

So if you're evacuating and there is no request spec for the instance, we're just picking a host randomly without actually considering the image meta at all.

Changed in nova:
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/515530

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Two things :

#1 the conditional branch with "if request_spec" makes that code to be run if and only if the existing instance isn't having a RequestSpec record. There are two reasons for that : either it's a very old instance that was created post-Mitaka (IIRC when we started to persist the ReqSpec) or we're in a cellsv1 case where we could have the API DB not containing the record because of another cell. Both cases are very unlikely to happen because we wrote an online data migration method in Mitaka that operators needed to run for creating all RequestSpec records for existing instances.
So in theory, we could remove that section of code and just assume that all instances are having a ReqSpec record - but I'd love to see operators confirming that doesn't break their clouds.

#2 That's actually a good catch, we pass image_ref=None when we evacuate (so that means evacuations of old instances not having a ReqSpec record don't verify the correct image, hence the bug) but for the rebuild case (I mean the rebuild API) we need to check the new image that is passed by the user, right? In that case, I can see the API passing image_ref=image_href to the conductor API (which is the new image reference), so I don't see why we should use the nested image from the instance ?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/516400

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/516404

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/515530
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d2690d6b038e200efed05bf7773898a0a8bb01d7
Submitter: Zuul
Branch: master

commit d2690d6b038e200efed05bf7773898a0a8bb01d7
Author: Matt Riedemann <email address hidden>
Date: Thu Oct 26 17:33:35 2017 -0400

    Pass the correct image to build_request_spec in conductor.rebuild_instance

    If we're calling build_request_spec in conductor.rebuild_instance,
    it's because we are evacuating and the instance is so old it does
    not have a request spec. We need the request_spec to pass to the
    scheduler to pick a destination host for the evacuation.

    For evacuate, nova-api does not pass any image reference parameters,
    and even if it did, those are image IDs, not an image meta dict that
    build_request_spec expects, so this code has just always been wrong.

    This change fixes the problem by passing a primitive version of
    the instance.image_meta which build_request_spec will then return
    back to conductor and that gets used to build a RequestSpec object
    from primitives.

    It's important to use the correct image meta so that the scheduler
    can properly filter hosts using things like the
    AggregateImagePropertiesIsolation and ImagePropertiesFilter filters.

    Change-Id: I0c8ce65016287de7be921c312493667a8c7f762e
    Closes-Bug: #1727855

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b2

This issue was fixed in the openstack/nova 17.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/516400
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc44c48943f8ce66bbbdc2050ed2dc47778cf477
Submitter: Zuul
Branch: stable/pike

commit dc44c48943f8ce66bbbdc2050ed2dc47778cf477
Author: Matt Riedemann <email address hidden>
Date: Thu Oct 26 17:33:35 2017 -0400

    Pass the correct image to build_request_spec in conductor.rebuild_instance

    If we're calling build_request_spec in conductor.rebuild_instance,
    it's because we are evacuating and the instance is so old it does
    not have a request spec. We need the request_spec to pass to the
    scheduler to pick a destination host for the evacuation.

    For evacuate, nova-api does not pass any image reference parameters,
    and even if it did, those are image IDs, not an image meta dict that
    build_request_spec expects, so this code has just always been wrong.

    This change fixes the problem by passing a primitive version of
    the instance.image_meta which build_request_spec will then return
    back to conductor and that gets used to build a RequestSpec object
    from primitives.

    It's important to use the correct image meta so that the scheduler
    can properly filter hosts using things like the
    AggregateImagePropertiesIsolation and ImagePropertiesFilter filters.

    Change-Id: I0c8ce65016287de7be921c312493667a8c7f762e
    Closes-Bug: #1727855
    (cherry picked from commit d2690d6b038e200efed05bf7773898a0a8bb01d7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.0

This issue was fixed in the openstack/nova 16.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/516404
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ea53d9f1ea8bab797250f7add535fe16236f9a34
Submitter: Zuul
Branch: stable/ocata

commit ea53d9f1ea8bab797250f7add535fe16236f9a34
Author: Matt Riedemann <email address hidden>
Date: Thu Oct 26 17:33:35 2017 -0400

    Pass the correct image to build_request_spec in conductor.rebuild_instance

    If we're calling build_request_spec in conductor.rebuild_instance,
    it's because we are evacuating and the instance is so old it does
    not have a request spec. We need the request_spec to pass to the
    scheduler to pick a destination host for the evacuation.

    For evacuate, nova-api does not pass any image reference parameters,
    and even if it did, those are image IDs, not an image meta dict that
    build_request_spec expects, so this code has just always been wrong.

    This change fixes the problem by passing a primitive version of
    the instance.image_meta which build_request_spec will then return
    back to conductor and that gets used to build a RequestSpec object
    from primitives.

    It's important to use the correct image meta so that the scheduler
    can properly filter hosts using things like the
    AggregateImagePropertiesIsolation and ImagePropertiesFilter filters.

    Conflicts:
          nova/conductor/manager.py

    NOTE(mriedem): Conflict is due to e211fca55a11c80058d5d78e31dc3ad466d7edfd
    not being in Ocata.

    Change-Id: I0c8ce65016287de7be921c312493667a8c7f762e
    Closes-Bug: #1727855
    (cherry picked from commit d2690d6b038e200efed05bf7773898a0a8bb01d7)
    (cherry picked from commit dc44c48943f8ce66bbbdc2050ed2dc47778cf477)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.1

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.