Bug #1786519 “debugging why NoValidHost with placement challengi... : Bugs : OpenStack Compute (nova)

OpenStack Infra (hudson-openstack) on 2018-08-10

Changed in nova:
assignee:	Jay Pipes (jaypipes) → Chris Dent (cdent)
status:	Confirmed → In Progress

Chris Dent (cdent) on 2018-08-10

Changed in nova:
assignee:	Chris Dent (cdent) → Jay Pipes (jaypipes)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-13:

#1

This is not a regression in Rocky and I don't think the rocky-rc-potential tag is appropriate. This is a latent issue since Pike and not something we should rush into RC2 as a bug fix.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-13: Related fix merged to nova (master)

#2

Reviewed: https://review.openstack.org/590150
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc26780ef88a256aa9b581d4e6fe710af0afe0a1
Submitter: Zuul
Branch: master

commit dc26780ef88a256aa9b581d4e6fe710af0afe0a1
Author: Tetsuro Nakamura <email address hidden>
Date: Tue Aug 7 23:38:07 2018 +0900

Adds a test for _get_provider_ids_matching()

This patch adds a test for _get_provider_ids_matching()
to verify it works correctly with required traits.

Related-Bug: #1786519
Change-Id: I2512e361f5eaa4e60701be7c8bf57b2e0a02a146

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-13: Fix merged to nova (master)

#3

Reviewed: https://review.openstack.org/590388
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Submitter: Zuul
Branch: master

commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Author: Jay Pipes <email address hidden>
Date: Thu Aug 9 10:46:20 2018 -0400

placement: use simple code paths when possible

    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.

    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.

    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.

    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.

    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1

Reviewed:  https://review.openstack.org/590388
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Submitter: Zuul
Branch:    master

commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa
Author: Jay Pipes <jaypipes@gmail.com>
Date:   Thu Aug 9 10:46:20 2018 -0400

placement: use simple code paths when possible
    
    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.
    
    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.
    
    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.
    
    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.
    
    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1

OpenStack Infra (hudson-openstack) on 2018-08-14

Changed in nova:
assignee:	Jay Pipes (jaypipes) → Eric Fried (efried)

OpenStack Infra (hudson-openstack) on 2018-08-15

Changed in nova:
assignee:	Eric Fried (efried) → Jay Pipes (jaypipes)

Matt Riedemann (mriedem) on 2018-08-16

tags:

added: serviceability
removed: rocky-rc-potential

OpenStack Infra (hudson-openstack) on 2018-08-24

Changed in nova:
assignee:	Jay Pipes (jaypipes) → Eric Fried (efried)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-08-27:

#4

Reviewed: https://review.openstack.org/590041
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b5ab9f5acec172d16e46876f60ca338434483905
Submitter: Zuul
Branch: master

commit b5ab9f5acec172d16e46876f60ca338434483905
Author: Jay Pipes <email address hidden>
Date: Wed Aug 8 17:11:25 2018 -0400

[placement] split gigantor SQL query, add logging

    This patch modifies the code paths for the non-granular request group
    allocation candidates processing. It removes the giant multi-join SQL
    query and replaces it with multiple calls to
    _get_providers_with_resource(), logging the number of matched providers
    for each resource class requested and filter (on required traits,
    forbidden traits and aggregate memebership).

Here are some examples of the debug output:

- A request for three resources with no aggregate or trait filters:

     found 7 providers with available 5 VCPU
     found 9 providers with available 1024 MEMORY_MB
     found 5 providers after filtering by previous result
     found 8 providers with available 1500 DISK_GB
     found 2 providers after filtering by previous result

- The same request, but with a required trait that nobody has, shorts
out quickly:

found 0 providers after applying required traits filter (['HW_CPU_X86_AVX2'])

- A request for one resource with aggregates and forbidden (but no
required) traits:

     found 2 providers after applying aggregates filter ([['3ed8fb2f-4793-46ee-a55b-fdf42cb392ca']])
     found 1 providers after applying forbidden traits filter ([u'CUSTOM_TWO', u'CUSTOM_THREE'])
     found 3 providers with available 4 VCPU
     found 1 providers after applying initial aggregate and trait filters

    Co-authored-by: Eric Fried <email address hidden>
    Closes-Bug: #1786519
    Change-Id: If9ddb8a6d2f03392f3cc11136c4a0b026212b95b

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-06: Fix proposed to nova (stable/rocky)

#5

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/600447

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-12:

#6

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/602202

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-14: Fix merged to nova (stable/rocky)

#7

Reviewed: https://review.openstack.org/600447
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1cae9b8d4392ef597d93f2934ce18ead8828da98
Submitter: Zuul
Branch: stable/rocky

commit 1cae9b8d4392ef597d93f2934ce18ead8828da98
Author: Jay Pipes <email address hidden>
Date: Thu Aug 9 10:46:20 2018 -0400

placement: use simple code paths when possible

    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.

    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.

    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.

    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.

    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1
    (cherry picked from commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa)

Reviewed:  https://review.openstack.org/600447
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1cae9b8d4392ef597d93f2934ce18ead8828da98
Submitter: Zuul
Branch:    stable/rocky

commit 1cae9b8d4392ef597d93f2934ce18ead8828da98
Author: Jay Pipes <jaypipes@gmail.com>
Date:   Thu Aug 9 10:46:20 2018 -0400

placement: use simple code paths when possible
    
    Somewhere in the past release, we started using extremely complex code
    paths involving sharing providers, anchor providers, and nested resource
    provider calculations when we absolutely don't need to do so.
    
    There was a _has_provider_trees() function in the
    nova/api/openstack/placement/objects/resource_provider.py file that used
    to be used for top-level switching between a faster, simpler approach to
    finding allocation candidates for a simple search of resources and
    traits when no sharing providers and no nesting was used. That was
    removed at some point and all code paths -- even for simple "get me
    these amounts of these resources" when no trees or sharing providers are
    present (which is the vast majority of OpenStack deployments) -- were
    going through the complex tree-search-and-match queries and algorithms.
    
    This patch changes that so that when there's a request for some
    resources and there's no trees or sharing providers, we do the simple
    code path. Hopefully this gets our performance for the simple, common
    cases back to where we were pre-Rocky.
    
    This change is a prerequisite for the following change which adds
    debugging output to help diagnose which resource classes are running
    out of inventory when GET /allocation_candidates returns 0 results.
    That code is not possible without the changes here as they only
    work if we can identify when a "simpler approach" is possible and
    call that simpler code.
    
    Related-Bug: #1786055
    Partial-Bug: #1786519
    Change-Id: I1fdbcdb7a1dd51e738924c8a30238237d7ac74e1
    (cherry picked from commit 9ea340eb0d3bdb103bd64ca40b999bd2b10b80aa)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-14:

#8

Reviewed: https://review.openstack.org/602202
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d026d4d5c58d5e41eccc53edeadd292e6436deab
Submitter: Zuul
Branch: stable/rocky

commit d026d4d5c58d5e41eccc53edeadd292e6436deab
Author: Jay Pipes <email address hidden>
Date: Wed Aug 8 17:11:25 2018 -0400

[placement] split gigantor SQL query, add logging

    This patch modifies the code paths for the non-granular request group
    allocation candidates processing. It removes the giant multi-join SQL
    query and replaces it with multiple calls to
    _get_providers_with_resource(), logging the number of matched providers
    for each resource class requested and filter (on required traits,
    forbidden traits and aggregate memebership).

Here are some examples of the debug output:

- A request for three resources with no aggregate or trait filters:

     found 7 providers with available 5 VCPU
     found 9 providers with available 1024 MEMORY_MB
     found 5 providers after filtering by previous result
     found 8 providers with available 1500 DISK_GB
     found 2 providers after filtering by previous result

- The same request, but with a required trait that nobody has, shorts
out quickly:

found 0 providers after applying required traits filter (['HW_CPU_X86_AVX2'])

- A request for one resource with aggregates and forbidden (but no
required) traits:

     found 2 providers after applying aggregates filter ([['3ed8fb2f-4793-46ee-a55b-fdf42cb392ca']])
     found 1 providers after applying forbidden traits filter ([u'CUSTOM_TWO', u'CUSTOM_THREE'])
     found 3 providers with available 4 VCPU
     found 1 providers after applying initial aggregate and trait filters

    Co-authored-by: Eric Fried <email address hidden>
    Closes-Bug: #1786519
    Change-Id: If9ddb8a6d2f03392f3cc11136c4a0b026212b95b
    (cherry picked from commit b5ab9f5acec172d16e46876f60ca338434483905)

Reviewed:  https://review.openstack.org/602202
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d026d4d5c58d5e41eccc53edeadd292e6436deab
Submitter: Zuul
Branch:    stable/rocky

commit d026d4d5c58d5e41eccc53edeadd292e6436deab
Author: Jay Pipes <jaypipes@gmail.com>
Date:   Wed Aug 8 17:11:25 2018 -0400

[placement] split gigantor SQL query, add logging
    
    This patch modifies the code paths for the non-granular request group
    allocation candidates processing. It removes the giant multi-join SQL
    query and replaces it with multiple calls to
    _get_providers_with_resource(), logging the number of matched providers
    for each resource class requested and filter (on required traits,
    forbidden traits and aggregate memebership).
    
    Here are some examples of the debug output:
    
    - A request for three resources with no aggregate or trait filters:
    
     found 7 providers with available 5 VCPU
     found 9 providers with available 1024 MEMORY_MB
     found 5 providers after filtering by previous result
     found 8 providers with available 1500 DISK_GB
     found 2 providers after filtering by previous result
    
    - The same request, but with a required trait that nobody has, shorts
      out quickly:
    
     found 0 providers after applying required traits filter (['HW_CPU_X86_AVX2'])
    
    - A request for one resource with aggregates and forbidden (but no
      required) traits:
    
     found 2 providers after applying aggregates filter ([['3ed8fb2f-4793-46ee-a55b-fdf42cb392ca']])
     found 1 providers after applying forbidden traits filter ([u'CUSTOM_TWO', u'CUSTOM_THREE'])
     found 3 providers with available 4 VCPU
     found 1 providers after applying initial aggregate and trait filters
    
    Co-authored-by: Eric Fried <efried@us.ibm.com>
    Closes-Bug: #1786519
    Change-Id: If9ddb8a6d2f03392f3cc11136c4a0b026212b95b
    (cherry picked from commit b5ab9f5acec172d16e46876f60ca338434483905)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-24: Fix included in openstack/nova 18.0.1

#9

This issue was fixed in the openstack/nova 18.0.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-22: Fix included in openstack/nova 19.0.0.0rc1

#10

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

OpenStack Compute (nova)

debugging why NoValidHost with placement challenging

Bug Description

Other bug subscribers

Remote bug watches