debugging why NoValidHost with placement challenging

Bug #1786519 reported by Chris Dent
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Eric Fried

Bug Description

With the advent of placement, the FilterScheduler no longer provides granular information about which class of resource (disk, VCPU, RAM) is not available in sufficient quantities to allow a host to be found.

This is because placement is now making those choices and does not (yet) break down the results of its queries into easy to understand chunks. If it returns zero results all you know is "we didn't have enough resources". Nothing about which resources.

This can be fixed by changing the way in queries are made so that there are a series of queries. After each one a report of how many results are left can be made.

While this relatively straightforward to do for the (currently-)common simple non-nested and non-sharing providers situation it will be more difficult for the non-simple cases. Therefore, it makes sense to have different code paths for simple and non-simple allocation candidate queries. This will also result in performance gains for the common case.

See this email thread for additional discussion and reports of problems in the wild: http://lists.openstack.org/pipermail/openstack-dev/2018-August/132735.html

Changed in nova:
assignee: Jay Pipes (jaypipes) → Chris Dent (cdent)
status: Confirmed → In Progress
Chris Dent (cdent)
Changed in nova:
assignee: Chris Dent (cdent) → Jay Pipes (jaypipes)
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is not a regression in Rocky and I don't think the rocky-rc-potential tag is appropriate. This is a latent issue since Pike and not something we should rush into RC2 as a bug fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/590150
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc26780ef88a256aa9b581d4e6fe710af0afe0a1
Submitter: Zuul
Branch: master

commit dc26780ef88a256aa9b581d4e6fe710af0afe0a1
Author: Tetsuro Nakamura <email address hidden>
Date: Tue Aug 7 23:38:07 2018 +0900

    Adds a test for _get_provider_ids_matching()

    This patch adds a test for _get_provider_ids_matching()
    to verify it works correctly with required traits.

    Related-Bug: #1786519
    Change-Id: I2512e361f5eaa4e60701be7c8bf57b2e0a02a146

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/590388
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Submitter: Zuul
Branch: master

commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Author: Jay Pipes <email address hidden>
Date: Thu Aug 9 10:46:20 2018 -0400

    placement: use simple code paths when possible

    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.

    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.

    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.

    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.

    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1

Changed in nova:
assignee: Jay Pipes (jaypipes) → Eric Fried (efried)
Changed in nova:
assignee: Eric Fried (efried) → Jay Pipes (jaypipes)
Matt Riedemann (mriedem)
tags: added: serviceability
removed: rocky-rc-potential
Changed in nova:
assignee: Jay Pipes (jaypipes) → Eric Fried (efried)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/590041
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b5ab9f5acec172d16e46876f60ca338434483905
Submitter: Zuul
Branch: master

commit b5ab9f5acec172d16e46876f60ca338434483905
Author: Jay Pipes <email address hidden>
Date: Wed Aug 8 17:11:25 2018 -0400

    [placement] split gigantor SQL query, add logging

    This patch modifies the code paths for the non-granular request group
    allocation candidates processing. It removes the giant multi-join SQL
    query and replaces it with multiple calls to
    _get_providers_with_resource(), logging the number of matched providers
    for each resource class requested and filter (on required traits,
    forbidden traits and aggregate memebership).

    Here are some examples of the debug output:

    - A request for three resources with no aggregate or trait filters:

     found 7 providers with available 5 VCPU
     found 9 providers with available 1024 MEMORY_MB
     found 5 providers after filtering by previous result
     found 8 providers with available 1500 DISK_GB
     found 2 providers after filtering by previous result

    - The same request, but with a required trait that nobody has, shorts
      out quickly:

     found 0 providers after applying required traits filter (['HW_CPU_X86_AVX2'])

    - A request for one resource with aggregates and forbidden (but no
      required) traits:

     found 2 providers after applying aggregates filter ([['3ed8fb2f-4793-46ee-a55b-fdf42cb392ca']])
     found 1 providers after applying forbidden traits filter ([u'CUSTOM_TWO', u'CUSTOM_THREE'])
     found 3 providers with available 4 VCPU
     found 1 providers after applying initial aggregate and trait filters

    Co-authored-by: Eric Fried <email address hidden>
    Closes-Bug: #1786519
    Change-Id: If9ddb8a6d2f03392f3cc11136c4a0b026212b95b

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/600447

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/602202

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/600447
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1cae9b8d4392ef597d93f2934ce18ead8828da98
Submitter: Zuul
Branch: stable/rocky

commit 1cae9b8d4392ef597d93f2934ce18ead8828da98
Author: Jay Pipes <email address hidden>
Date: Thu Aug 9 10:46:20 2018 -0400

    placement: use simple code paths when possible

    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.

    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.

    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.

    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.

    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1
    (cherry picked from commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/602202
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d026d4d5c58d5e41eccc53edeadd292e6436deab
Submitter: Zuul
Branch: stable/rocky

commit d026d4d5c58d5e41eccc53edeadd292e6436deab
Author: Jay Pipes <email address hidden>
Date: Wed Aug 8 17:11:25 2018 -0400

    [placement] split gigantor SQL query, add logging

    This patch modifies the code paths for the non-granular request group
    allocation candidates processing. It removes the giant multi-join SQL
    query and replaces it with multiple calls to
    _get_providers_with_resource(), logging the number of matched providers
    for each resource class requested and filter (on required traits,
    forbidden traits and aggregate memebership).

    Here are some examples of the debug output:

    - A request for three resources with no aggregate or trait filters:

     found 7 providers with available 5 VCPU
     found 9 providers with available 1024 MEMORY_MB
     found 5 providers after filtering by previous result
     found 8 providers with available 1500 DISK_GB
     found 2 providers after filtering by previous result

    - The same request, but with a required trait that nobody has, shorts
      out quickly:

     found 0 providers after applying required traits filter (['HW_CPU_X86_AVX2'])

    - A request for one resource with aggregates and forbidden (but no
      required) traits:

     found 2 providers after applying aggregates filter ([['3ed8fb2f-4793-46ee-a55b-fdf42cb392ca']])
     found 1 providers after applying forbidden traits filter ([u'CUSTOM_TWO', u'CUSTOM_THREE'])
     found 3 providers with available 4 VCPU
     found 1 providers after applying initial aggregate and trait filters

    Co-authored-by: Eric Fried <email address hidden>
    Closes-Bug: #1786519
    Change-Id: If9ddb8a6d2f03392f3cc11136c4a0b026212b95b
    (cherry picked from commit b5ab9f5acec172d16e46876f60ca338434483905)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.1

This issue was fixed in the openstack/nova 18.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.