Bug #1739323 “KeyError in host_manager for _get_host_states” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-20:

#1

I assume your second to last link was this?

https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L716

tags:

added: scheduler

Revision history for this message

Mohammed Naser (mnaser) wrote on 2017-12-20:

#2

Download full text (3.5 KiB)

Correct!

Stack trace added:

== cut ==
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server [req-1a597b99-3f7d-4ecd-be77-56e7c6d39d00 ab901f6a03e64d56a78bf9de0a382709 12c36e260d8e4bb2913965203b1b491f - default default] Exception during message handling: KeyError: (u'compute-21-ca-ymq-2.vexxhost.net', u'compute-21-ca-ymq-2.vexxhost.net')
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 232, in inner
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 149, in select_destinations
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server alloc_reqs_by_rp_uuid, provider_summaries)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 88, in select_destinations
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server alloc_reqs_by_rp_uuid, provider_summaries)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 174, in _schedule
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server hosts = self._get_sorted_hosts(spec_obj, hosts, num)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 301, in _get_sorted_hosts
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server spec_obj, index)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py", line 589, in get_filtered_hosts
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server hosts, spec_obj, index)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/filters.py", line 69, in get_filtered_objects
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server list_objs = list(objs)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/host_m...

Correct!

Stack trace added:

== cut ==
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server [req-1a597b99-3f7d-4ecd-be77-56e7c6d39d00 ab901f6a03e64d56a78bf9de0a382709 12c36e260d8e4bb2913965203b1b491f - default default] Exception during message handling: KeyError: (u'compute-21-ca-ymq-2.vexxhost.net', u'compute-21-ca-ymq-2.vexxhost.net')
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 232, in inner
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 149, in select_destinations
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     alloc_reqs_by_rp_uuid, provider_summaries)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 88, in select_destinations
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     alloc_reqs_by_rp_uuid, provider_summaries)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 174, in _schedule
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     hosts = self._get_sorted_hosts(spec_obj, hosts, num)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 301, in _get_sorted_hosts
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     spec_obj, index)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py", line 589, in get_filtered_hosts
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     hosts, spec_obj, index)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/filters.py", line 69, in get_filtered_objects
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     list_objs = list(objs)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py", line 705, in <genexpr>
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server     return (self.host_state_map[host] for host in seen_nodes)
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server KeyError: (u'<server-hostname-removed>', u'<server-hostname-removed>')
2017-12-20 00:38:58.539 9457 ERROR oslo_messaging.rpc.server 
== cut ==

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-20:

#3

Looks like this has been around since Grizzly:

https://github.com/openstack/nova/commit/b0582603a401b2e71348e9a862b83acc426097df

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-20:

#4

https://github.com/openstack/nova/commit/4660333d0d97d8e00cf290ea1d4ed932f5edc1dc#diff-978b9f8734365934eaf8fbb01f11a7d7L624

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-20:

#5

Actually I'm not sure how you're hitting this, unless a concurrent thread is modifying the host_state_map dict at the same time.

When we're looping, we add the state_key (host,node) entry to host_state_map if it's not in there:

https://github.com/openstack/nova/blob/3ebc358002d3f62a9ff0a3b0f50fea4f2a5c2c91/nova/scheduler/host_manager.py#L698

and then we add that key to seen_nodes:

https://github.com/openstack/nova/blob/3ebc358002d3f62a9ff0a3b0f50fea4f2a5c2c91/nova/scheduler/host_manager.py#L708

so everything in seen_nodes should also be in host_state_map when we return the tuple of HostState objects:

https://github.com/openstack/nova/blob/3ebc358002d3f62a9ff0a3b0f50fea4f2a5c2c91/nova/scheduler/host_manager.py#L718

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-20:

#6

There is nothing else in the HostManager code that modifies the host_state_map so I don't see how it could be getting modified concurrently.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-20:

#7

You aren't by chance running with some multiple scheduler thread workers patch are you?

no longer affects:	nova/ocata
no longer affects:	nova/pike
Changed in nova:
status:	Confirmed → Incomplete

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-12-20:

#8

I think the HostManagerChangedNodesTestCase unit tests would have also caught a breakage here.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-20: Fix proposed to nova (master)

#9

Fix proposed to branch: master
Review: https://review.openstack.org/529352

Changed in nova:
assignee:	nobody → Matthew Booth (mbooth-9)
status:	Incomplete → In Progress

OpenStack Infra (hudson-openstack) on 2017-12-20

Changed in nova:
assignee:	Matthew Booth (mbooth-9) → Matt Riedemann (mriedem)

Matt Riedemann (mriedem) on 2017-12-20

Changed in nova:
assignee:	Matt Riedemann (mriedem) → Matthew Booth (mbooth-9)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-21: Fix proposed to nova (stable/pike)

#10

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/529626

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-21: Fix proposed to nova (stable/ocata)

#11

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/529635

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-12-21: Fix merged to nova (master)

#12

Reviewed: https://review.openstack.org/529352
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d72b33b986525a9b2c7aa08b609ae386de1d0e89
Submitter: Zuul
Branch: master

commit d72b33b986525a9b2c7aa08b609ae386de1d0e89
Author: Matthew Booth <email address hidden>
Date: Sat Dec 16 20:27:08 2017 +0000

Fix an error in _get_host_states when deleting a compute node

    _get_host_states returns a generator which closes over seen_nodes, which
    is local, and self.host_state_map, which is global. It also modifies
    self.host_state_map, and will remove entries whose compute nodes are no
    longer present.

    If a compute node is deleted while a filter is still evaluating the
    generator returned by _get_host_states, the entry in self.host_state_map
    will be deleted if _get_host_states is called again. This will cause a
    KeyError when the first generator comes to evaluate the entry for the
    deleted compute node.

    We fix this by modifying the returned generator expression to check
    that a host_state_map entry still exists before returning it. An
    existing unit test is modified to exhibit the bug.

Change-Id: Ibb7c43a0abc433f93fc3de71146263e6d5923666
Closes-Bug: #1739323

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-09: Fix merged to nova (stable/pike)

#13

Reviewed: https://review.openstack.org/529626
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b6a1a192955da3d18f50dfa3130c95fc15d7a66a
Submitter: Zuul
Branch: stable/pike

commit b6a1a192955da3d18f50dfa3130c95fc15d7a66a
Author: Matthew Booth <email address hidden>
Date: Sat Dec 16 20:27:08 2017 +0000

Fix an error in _get_host_states when deleting a compute node

    _get_host_states returns a generator which closes over seen_nodes, which
    is local, and self.host_state_map, which is global. It also modifies
    self.host_state_map, and will remove entries whose compute nodes are no
    longer present.

    If a compute node is deleted while a filter is still evaluating the
    generator returned by _get_host_states, the entry in self.host_state_map
    will be deleted if _get_host_states is called again. This will cause a
    KeyError when the first generator comes to evaluate the entry for the
    deleted compute node.

    We fix this by modifying the returned generator expression to check
    that a host_state_map entry still exists before returning it. An
    existing unit test is modified to exhibit the bug.

    Change-Id: Ibb7c43a0abc433f93fc3de71146263e6d5923666
    Closes-Bug: #1739323
    (cherry picked from commit d72b33b986525a9b2c7aa08b609ae386de1d0e89)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-13: Fix merged to nova (stable/ocata)

#14

Reviewed: https://review.openstack.org/529635
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=52679374a404b14578c834f47d6d4d35e85df2e1
Submitter: Zuul
Branch: stable/ocata

commit 52679374a404b14578c834f47d6d4d35e85df2e1
Author: Matthew Booth <email address hidden>
Date: Sat Dec 16 20:27:08 2017 +0000

Fix an error in _get_host_states when deleting a compute node

    _get_host_states returns a generator which closes over seen_nodes, which
    is local, and self.host_state_map, which is global. It also modifies
    self.host_state_map, and will remove entries whose compute nodes are no
    longer present.

    If a compute node is deleted while a filter is still evaluating the
    generator returned by _get_host_states, the entry in self.host_state_map
    will be deleted if _get_host_states is called again. This will cause a
    KeyError when the first generator comes to evaluate the entry for the
    deleted compute node.

    We fix this by modifying the returned generator expression to check
    that a host_state_map entry still exists before returning it. An
    existing unit test is modified to exhibit the bug.

    Conflicts:
          nova/scheduler/filter_scheduler.py
          nova/scheduler/host_manager.py

    NOTE(mriedem): The conflict in filter_scheduler.py is due to
    48268c73e3f43fa763d071422816942942987f4a in Pike which is not
    needed here. The conflict in host_manager.py is due to
    d1de5385233ce4379b17a7404557c6724dc37cd4 in Pike which is not
    needed here.

    Change-Id: Ibb7c43a0abc433f93fc3de71146263e6d5923666
    Closes-Bug: #1739323
    (cherry picked from commit d72b33b986525a9b2c7aa08b609ae386de1d0e89)
    (cherry picked from commit b6a1a192955da3d18f50dfa3130c95fc15d7a66a)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-26: Fix included in openstack/nova 17.0.0.0b3

#15

This issue was fixed in the openstack/nova 17.0.0.0b3 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-27: Related fix merged to nova (master)

#16

Reviewed: https://review.openstack.org/533371
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c98ac6adc561d70d34c724703a437b8435e6ddfa
Submitter: Zuul
Branch: master

commit c98ac6adc561d70d34c724703a437b8435e6ddfa
Author: melanie witt <email address hidden>
Date: Sat Jan 13 21:49:54 2018 +0000

Stop globally caching host states in scheduler HostManager

    Currently, in the scheduler HostManager, we cache host states in
    a map global to all requests. This used to be okay because we were
    always querying the entire compute node list for every request to
    pass on to filtering. So we cached the host states globally and
    updated them per request and removed "dead nodes" from the cache
    (compute nodes still in the cache that were not returned from
    ComputeNodeList.get_all).

    As of Ocata, we started filtering our ComputeNodeList query based on
    an answer from placement about which resource providers could satisfy
    the request, instead of querying the entire compute node list every
    time. This is much more efficient (don't consider compute nodes that
    can't possibly fulfill the request) BUT it doesn't play well with the
    global host state cache. We started seeing "Removing dead compute node"
    messages in the logs, signaling removal of compute nodes from the
    global cache when compute nodes were actually available.

    If request A comes in and all compute nodes can satisfy its request,
    then request B arrives concurrently and no compute nodes can satisfy
    its request, the request B request will remove all the compute nodes
    from the global host state cache and then request A will get "no valid
    hosts" at the filtering stage because get_host_states_by_uuids returns
    a generator that hands out hosts from the global host state cache.

    This removes the global host state cache from the scheduler HostManager
    and instead generates a fresh host state map per request and uses that
    to return hosts from the generator. Because we're filtering the
    ComputeNodeList based on a placement query per request, each request
    can have a completely different set of compute nodes that can fulfill
    it, so we're not gaining much by caching host states anyway.

Co-Authored-By: Dan Smith <email address hidden>

Closes-Bug: #1742827
Related-Bug: #1739323

Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11

Reviewed:  https://review.openstack.org/533371
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c98ac6adc561d70d34c724703a437b8435e6ddfa
Submitter: Zuul
Branch:    master

commit c98ac6adc561d70d34c724703a437b8435e6ddfa
Author: melanie witt <melwittt@gmail.com>
Date:   Sat Jan 13 21:49:54 2018 +0000

Stop globally caching host states in scheduler HostManager
    
    Currently, in the scheduler HostManager, we cache host states in
    a map global to all requests. This used to be okay because we were
    always querying the entire compute node list for every request to
    pass on to filtering. So we cached the host states globally and
    updated them per request and removed "dead nodes" from the cache
    (compute nodes still in the cache that were not returned from
    ComputeNodeList.get_all).
    
    As of Ocata, we started filtering our ComputeNodeList query based on
    an answer from placement about which resource providers could satisfy
    the request, instead of querying the entire compute node list every
    time. This is much more efficient (don't consider compute nodes that
    can't possibly fulfill the request) BUT it doesn't play well with the
    global host state cache. We started seeing "Removing dead compute node"
    messages in the logs, signaling removal of compute nodes from the
    global cache when compute nodes were actually available.
    
    If request A comes in and all compute nodes can satisfy its request,
    then request B arrives concurrently and no compute nodes can satisfy
    its request, the request B request will remove all the compute nodes
    from the global host state cache and then request A will get "no valid
    hosts" at the filtering stage because get_host_states_by_uuids returns
    a generator that hands out hosts from the global host state cache.
    
    This removes the global host state cache from the scheduler HostManager
    and instead generates a fresh host state map per request and uses that
    to return hosts from the generator. Because we're filtering the
    ComputeNodeList based on a placement query per request, each request
    can have a completely different set of compute nodes that can fulfill
    it, so we're not gaining much by caching host states anyway.
    
    Co-Authored-By: Dan Smith <dansmith@redhat.com>
    
    Closes-Bug: #1742827
    Related-Bug: #1739323
    
    Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-29: Related fix proposed to nova (stable/pike)

#17

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/539005

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-29: Related fix proposed to nova (stable/ocata)

#18

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/539013

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-08: Related fix merged to nova (stable/pike)

#19

Reviewed: https://review.openstack.org/539005
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Submitter: Zuul
Branch: stable/pike

commit bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Author: melanie witt <email address hidden>
Date: Sat Jan 13 21:49:54 2018 +0000

Stop globally caching host states in scheduler HostManager

    Currently, in the scheduler HostManager, we cache host states in
    a map global to all requests. This used to be okay because we were
    always querying the entire compute node list for every request to
    pass on to filtering. So we cached the host states globally and
    updated them per request and removed "dead nodes" from the cache
    (compute nodes still in the cache that were not returned from
    ComputeNodeList.get_all).

    As of Ocata, we started filtering our ComputeNodeList query based on
    an answer from placement about which resource providers could satisfy
    the request, instead of querying the entire compute node list every
    time. This is much more efficient (don't consider compute nodes that
    can't possibly fulfill the request) BUT it doesn't play well with the
    global host state cache. We started seeing "Removing dead compute node"
    messages in the logs, signaling removal of compute nodes from the
    global cache when compute nodes were actually available.

    If request A comes in and all compute nodes can satisfy its request,
    then request B arrives concurrently and no compute nodes can satisfy
    its request, the request B request will remove all the compute nodes
    from the global host state cache and then request A will get "no valid
    hosts" at the filtering stage because get_host_states_by_uuids returns
    a generator that hands out hosts from the global host state cache.

    This removes the global host state cache from the scheduler HostManager
    and instead generates a fresh host state map per request and uses that
    to return hosts from the generator. Because we're filtering the
    ComputeNodeList based on a placement query per request, each request
    can have a completely different set of compute nodes that can fulfill
    it, so we're not gaining much by caching host states anyway.

Co-Authored-By: Dan Smith <email address hidden>

Closes-Bug: #1742827
Related-Bug: #1739323

Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11
(cherry picked from commit c98ac6adc561d70d34c724703a437b8435e6ddfa)

Reviewed:  https://review.openstack.org/539005
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Submitter: Zuul
Branch:    stable/pike

commit bf65fdd59ea7d12f7420fcfe2602a5f00cc61055
Author: melanie witt <melwittt@gmail.com>
Date:   Sat Jan 13 21:49:54 2018 +0000

Stop globally caching host states in scheduler HostManager
    
    Currently, in the scheduler HostManager, we cache host states in
    a map global to all requests. This used to be okay because we were
    always querying the entire compute node list for every request to
    pass on to filtering. So we cached the host states globally and
    updated them per request and removed "dead nodes" from the cache
    (compute nodes still in the cache that were not returned from
    ComputeNodeList.get_all).
    
    As of Ocata, we started filtering our ComputeNodeList query based on
    an answer from placement about which resource providers could satisfy
    the request, instead of querying the entire compute node list every
    time. This is much more efficient (don't consider compute nodes that
    can't possibly fulfill the request) BUT it doesn't play well with the
    global host state cache. We started seeing "Removing dead compute node"
    messages in the logs, signaling removal of compute nodes from the
    global cache when compute nodes were actually available.
    
    If request A comes in and all compute nodes can satisfy its request,
    then request B arrives concurrently and no compute nodes can satisfy
    its request, the request B request will remove all the compute nodes
    from the global host state cache and then request A will get "no valid
    hosts" at the filtering stage because get_host_states_by_uuids returns
    a generator that hands out hosts from the global host state cache.
    
    This removes the global host state cache from the scheduler HostManager
    and instead generates a fresh host state map per request and uses that
    to return hosts from the generator. Because we're filtering the
    ComputeNodeList based on a placement query per request, each request
    can have a completely different set of compute nodes that can fulfill
    it, so we're not gaining much by caching host states anyway.
    
    Co-Authored-By: Dan Smith <dansmith@redhat.com>
    
    Closes-Bug: #1742827
    Related-Bug: #1739323
    
    Change-Id: I40c17ed88f50ecbdedc4daf368fff10e90e7be11
    (cherry picked from commit c98ac6adc561d70d34c724703a437b8435e6ddfa)

tags:

added: in-stable-pike

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-15: Fix included in openstack/nova 16.1.0

#20

This issue was fixed in the openstack/nova 16.1.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-02: Fix included in openstack/nova 15.1.1

#21

This issue was fixed in the openstack/nova 15.1.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-28: Change abandoned on nova (stable/ocata)

#22

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/539013
Reason: Looks dead. I'm cleaning up old changes to get ready for ocata-em.

OpenStack Compute (nova)

KeyError in host_manager for _get_host_states

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	High	Matthew Booth
Ocata	Fix Committed	High	Matt Riedemann
Pike	Fix Committed	High	Matt Riedemann