KeyError when booting multi-stagger-instances

Bug #1809061 reported by John Smith
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Description
===========
Bulk boot multi instances in a short time, if the amount of resources required is not the same,
and the number of resources owned by the compute node is also different, there maybe a KeyError
in nova-scheduler.log .

Steps to reproduce
==================
For example, I have four compute nodes:
host1-3, with 24 cpus and 120G ram
host4, with 12 cpus and 40G ram

And i will boot 12 instances at the same time in different cmd,
one of them need 16 cpus and 48G ram, others need 1 cpus and 1G ram.

Then the fault appeared, some of instances ERROR.

Expected result
===============
all instance boot success.

Actual result
=============
some instances ERROR.

Environment
===========
OpenStack version:
Queens

Hypervisor:
Libvirt + KVM

Storage:
LVM

Networking:
Neutron with OpenVSwitch

Logs & Configs
==============
In nova-scheduler.log

2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server [req-7051b4f3-bfdc-4ca0-9436-8fc4448867c8 c3dba5032e49416896c7050ef6c3cad4 de45f83097b64290923d871f7350fd6e - detion during message handling: KeyError: (u'host4', u'host4')
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 232, in inner
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/manager.py", line 179, in select_destinations
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server alloc_reqs_by_rp_uuid, provider_summaries)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 88, in select_destinations
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server alloc_reqs_by_rp_uuid, provider_summaries)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 167, in _schedule
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server placement_return_available_hosts = list(hosts)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py", line 794, in <genexpr>
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server return (self.host_state_map[host] for host in seen_nodes)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server KeyError: (u'host4', u'host4')

Tags: scheduler
Revision history for this message
Matt Riedemann (mriedem) wrote :

Are you able to reproduce with debug logging enabled and then provide (attach) the full scheduler log so that we can get the details about which filters and such are used?

tags: added: scheduler
Revision history for this message
Matt Riedemann (mriedem) wrote :

Also, I'm not seeing this code in stable/queens:

2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/filter_scheduler.py", line 167, in _schedule
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server placement_return_available_hosts = list(hosts)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/scheduler/host_manager.py", line 794, in <genexpr>
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server return (self.host_state_map[host] for host in seen_nodes)
2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server KeyError: (u'host4', u'host4')

Can you give a more specific version? 17.0.8? 17.0.0?

Do you have any customizations to the scheduler code?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh I guess it's the generator here:

https://github.com/openstack/nova/blob/17.0.0/nova/scheduler/host_manager.py#L710

Hmm, this change sounds related to this:

https://github.com/openstack/nova/commit/c98ac6adc561d70d34c724703a437b8435e6ddfa#diff-978b9f8734365934eaf8fbb01f11a7d7

But that's in the 17.0.0 queens release. So I don't see how we could hit a KeyError there when that host_state_map is not global.

Revision history for this message
melanie witt (melwitt) wrote :

Yeah, judging from the code line in the backtrace:

  2018-12-10 15:05:15.029 26837 ERROR oslo_messaging.rpc.server return (self.host_state_map[host] for host in seen_nodes)

they're running a version earlier than Queens or had pulled Queens from master before fixes landed.

This bug looks like a duplicate of:

  https://bugs.launchpad.net/nova/+bug/1739323

which was fixed by this change:

  https://review.openstack.org/529352

which was backported to Pike and Ocata at the time.

Revision history for this message
John Smith (wang-zengzhi) wrote :

Dear melanie witt, thank you for your reply.
I'm so sorry that I confirmed a wrong version of our environment.
Yes,its from a earlier master before Queens.

This is really the same as bug #1739323.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.