OpenStack Compute (nova)

Comment 19 for bug 1739323

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-08: Related fix merged to nova (stable/pike)

#19

Reviewed: https://review.openstack.org/539005
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Submitter: Zuul
Branch: stable/pike

commit bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Author: melanie witt <email address hidden>
Date: Sat Jan 13 21:49:54 2018 +0000

Stop globally caching host states in scheduler HostManager

    Currently, in the scheduler HostManager, we cache host states in
    a map global to all requests. This used to be okay because we were
    always querying the entire compute node list for every request to
    pass on to filtering. So we cached the host states globally and
    updated them per request and removed "dead nodes" from the cache
    (compute nodes still in the cache that were not returned from
    ComputeNodeList.get_all).

    As of Ocata, we started filtering our ComputeNodeList query based on
    an answer from placement about which resource providers could satisfy
    the request, instead of querying the entire compute node list every
    time. This is much more efficient (don't consider compute nodes that
    can't possibly fulfill the request) BUT it doesn't play well with the
    global host state cache. We started seeing "Removing dead compute node"
    messages in the logs, signaling removal of compute nodes from the
    global cache when compute nodes were actually available.

    If request A comes in and all compute nodes can satisfy its request,
    then request B arrives concurrently and no compute nodes can satisfy
    its request, the request B request will remove all the compute nodes
    from the global host state cache and then request A will get "no valid
    hosts" at the filtering stage because get_host_states_by_uuids returns
    a generator that hands out hosts from the global host state cache.

    This removes the global host state cache from the scheduler HostManager
    and instead generates a fresh host state map per request and uses that
    to return hosts from the generator. Because we're filtering the
    ComputeNodeList based on a placement query per request, each request
    can have a completely different set of compute nodes that can fulfill
    it, so we're not gaining much by caching host states anyway.

Co-Authored-By: Dan Smith <email address hidden>

Closes-Bug: #1742827
Related-Bug: #1739323

Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11
(cherry picked from commit c98ac6adc561d70d34c724703a437b8435e6ddfa)

Reviewed:  https://review.openstack.org/539005
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Submitter: Zuul
Branch:    stable/pike

commit bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Author: melanie witt <melwittt@gmail.com>
Date:   Sat Jan 13 21:49:54 2018 +0000

Stop globally caching host states in scheduler HostManager
    
    Currently, in the scheduler HostManager, we cache host states in
    a map global to all requests. This used to be okay because we were
    always querying the entire compute node list for every request to
    pass on to filtering. So we cached the host states globally and
    updated them per request and removed "dead nodes" from the cache
    (compute nodes still in the cache that were not returned from
    ComputeNodeList.get_all).
    
    As of Ocata, we started filtering our ComputeNodeList query based on
    an answer from placement about which resource providers could satisfy
    the request, instead of querying the entire compute node list every
    time. This is much more efficient (don't consider compute nodes that
    can't possibly fulfill the request) BUT it doesn't play well with the
    global host state cache. We started seeing "Removing dead compute node"
    messages in the logs, signaling removal of compute nodes from the
    global cache when compute nodes were actually available.
    
    If request A comes in and all compute nodes can satisfy its request,
    then request B arrives concurrently and no compute nodes can satisfy
    its request, the request B request will remove all the compute nodes
    from the global host state cache and then request A will get "no valid
    hosts" at the filtering stage because get_host_states_by_uuids returns
    a generator that hands out hosts from the global host state cache.
    
    This removes the global host state cache from the scheduler HostManager
    and instead generates a fresh host state map per request and uses that
    to return hosts from the generator. Because we're filtering the
    ComputeNodeList based on a placement query per request, each request
    can have a completely different set of compute nodes that can fulfill
    it, so we're not gaining much by caching host states anyway.
    
    Co-Authored-By: Dan Smith <dansmith@redhat.com>
    
    Closes-Bug: #1742827
    Related-Bug: #1739323
    
    Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11
    (cherry picked from commit c98ac6adc561d70d34c724703a437b8435e6ddfa)