Report client can try to ensure a standard resource class

Bug #1746615 reported by Eric Fried
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Eric Fried

Bug Description

SchedulerReportClient.set_inventory_for_provider uses this logic [1] to pre-create custom resource classes found in the input inventory.

        list(map(self._ensure_resource_class,
                 (rc_name for rc_name in inv_data
                  if rc_name not in fields.ResourceClass.STANDARD)))

The problem is that this relies on the local system's notion of the set of standard resource classes. If the placement service is running newer code, standard resource classes may have been added that the compute service doesn't know about yet. If a consumer attempts to use such a resource class, the above block will attempt to _ensure_resource_class on it, which attempts to create it in placement, which results in a 400 (because the schema requires it to begin with CUSTOM_), which results in InvalidResourceClass exception, and the consumer can't use that resource class.

We should never be relying on placement code directly on the compute node. (This includes introspecting os-traits.) We should always be asking the placement service.

[1] https://github.com/openstack/nova/blob/894dd5b87674d396c7221f9d1bca3a9a983dfeb8/nova/scheduler/client/report.py#L983-L985

Tags: placement
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/539732

Changed in nova:
assignee: nobody → Eric Fried (efried)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/539738

Revision history for this message
Chris Dent (cdent) wrote :

> We should never be relying on placement code directly on the compute node. (This includes introspecting os-traits.) We should always be asking the placement service.

While this statement might be correct, it isn't aligned with how we've discussed os-traits in the past, now how we've made use of the 'fields.ResourceClass.VCPU' (and others) fields in practice.

The idea has been that the os-traits package itself is the authority and which version of that something has available to it is the controlling factor. This does present the challenge you describe but the mechanics of fixing it so that placement becomes the authority for resource classes or traits becomes weird: where do the constants that are used in "client" code come from?

It's not always the case that the client is retrieving something from placement. Sometimes it just wants to say "I have _this_ resource class and _these_ traits".

What's your expectation there?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/539732
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=caeab76468258c1c9b3661df04e72988ec98b316
Submitter: Zuul
Branch: master

commit caeab76468258c1c9b3661df04e72988ec98b316
Author: Eric Fried <email address hidden>
Date: Wed Jan 31 17:37:37 2018 -0600

    Test case: new standard resource class unusable

    If the placement service code is newer than that of the compute service,
    new standard resource classes may have been introduced in placement
    which compute doesn't yet know about. These resource classes will be
    unusable due to the noted bug.

    This change set adds a test case demonstrating the bug.

    Change-Id: I71ce6796beb23c4bb3d637dab5ca0e620274b2fc
    Partial-Bug: #1746615

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/539738
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=02a1551f6455792ee050d6009c94514402b7df04
Submitter: Zuul
Branch: master

commit 02a1551f6455792ee050d6009c94514402b7df04
Author: Eric Fried <email address hidden>
Date: Wed Jan 31 18:06:17 2018 -0600

    Ensure resource classes correctly

    Previously, SchedulerReportClient.set_inventory_for_provider used this
    logic to pre-create custom resource classes found in the input
    inventory.

            list(map(self._ensure_resource_class,
                     (rc_name for rc_name in inv_data
                      if rc_name not in fields.ResourceClass.STANDARD)))

    ...where _ensure_resource_class would always attempt to create the
    resource class it was given, and raise an exception if that failed for
    any reason. Note in particular that attempting to create an
    already-extant "standard" resource class will fail just the same as
    attempting to create any nonexistent resource class that doesn't conform
    to the schema (i.e. beginning with "CUSTOM_").

    The problem is that this relies on the local system's notion of the set
    of standard resource classes. If the placement service is running newer
    code, standard resource classes may have been added that the compute
    service doesn't know about yet.

    This change set solves the problem by only attempting to create resource
    classes with the 'CUSTOM_' prefix.

    self._ensure_resource_class (which worked on a single resource class) is
    replaced with self._ensure_resource_classes (plural) which accepts a
    list of *all* the desired resource classes (including the standard
    ones), filters down to the CUSTOM_* ones, and attempts to create the
    remainder.

    Note that if placement ever decides to start allowing creation of
    resource classes with prefixes other than CUSTOM_, it will have to do so
    under a new microversion, so we can't future-proof this by e.g.
    prefetching all the known resource classes from placement and attempting
    to create any not found in that list.

    In the process of doing this rework, obsolete code paths relying on
    pre-1.7 placement microversions are removed.

    Change-Id: I600ed262e1b5d11facbf461e28291193665280cf
    Closes-Bug: #1746615

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Eric Fried (efried) wrote :

> The idea has been that the os-traits package itself is the authority and which version of that
> something has available to it is the controlling factor. This does present the challenge you
> describe but the mechanics of fixing it so that placement becomes the authority for resource
> classes or traits becomes weird: where do the constants that are used in "client" code come from?
>
> It's not always the case that the client is retrieving something from placement. Sometimes it
> just wants to say "I have _this_ resource class and _these_ traits".
>
> What's your expectation there?

For the sake of closure, I wanted to note that cdent and I discussed this (somewhere). I eventually conceded that, whereas the hole I describe is theoretically possible, it's unlikely enough to come up in real life, and would be difficult enough to close, that it's not worth worrying about.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.