keyerror when refreshing instance security groups

Bug #1484738 reported by Sam Morrison
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Matt Riedemann
Juno
High
Matt Riedemann
Kilo
High
Matt Riedemann

Bug Description

On a clean kilo install using source security groups I am seeing the following trace on boot and delete

a2413f7] Deallocating network for instance _deallocate_network /usr/lib/python2.7/dist-packages/nova/compute/manager.py:2098
2015-08-14 09:46:06.688 11618 ERROR oslo_messaging.rpc.dispatcher [req-b8f44d34-96b2-4e40-ac22-15ccc6e44e59 - - - - -] Exception during message handling: 'metadata'
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher executor_callback))
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher executor_callback)
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6772, in refresh_instance_security_rules
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher return self.manager.refresh_instance_security_rules(ctxt, instance)
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 434, in decorated_function
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher args = (_load_instance(args[0]),) + args[1:]
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 425, in _load_instance
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher expected_attrs=metas)
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/objects/instance.py", line 506, in _from_db_object
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher instance['metadata'] = utils.instance_meta(db_inst)
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 817, in instance_meta
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher if isinstance(instance['metadata'], dict):
2015-08-14 09:46:06.688 11618 TRACE oslo_messaging.rpc.dispatcher KeyError: 'metadata'

Changed in nova:
assignee: nobody → Zhenzan Zhou (zhenzan-zhou)
tags: added: security-groups
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Sam Morrison:
Could you please add more information about how this can reproduced? What does your setup look like? Are all services running on the Kilo release or is it mixed Kilo/Liberty code? Can you reproduce it on Liberty as well?

Revision history for this message
Zhenzan Zhou (zhenzan-zhou) wrote :

I tried on my devstack environment with latest Liberty code, cannot reproduce it. Need more info as asked by Markus.

Changed in nova:
status: New → Incomplete
Revision history for this message
Sam Morrison (sorrison) wrote :

This is using stable/kilo and happens when I boot or delete an instance with source security groups and an instance with the source exists already.

Liberty isn't released yet so haven't looked at that.

Revision history for this message
Sam Morrison (sorrison) wrote :

info provided

Changed in nova:
status: Incomplete → New
Revision history for this message
Andrew Bogott (andrewbogott) wrote :

This happens to me each time I try to create or delete an instance. It appears to be preventing any notifications from being sent, which is playing hell with designate.

Revision history for this message
Andrew Bogott (andrewbogott) wrote :

Running a kilo/stable install which I just now upgraded from (working) Juno.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks similar to bug 1452510 which was invalidated for some reason, no explanation.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Andrews trace from comment 6:

2015-09-10 02:20:20.408 6436 INFO nova.virt.libvirt.driver [req-b79a0b96-5fee-4178-8db6-6516305307c5 novaadmin testlabs - - -] [instance: d8ebcc1b-b1bf-461b-9c13-592b02848ff5] Creating image
2015-09-10 02:20:20.559 6436 ERROR oslo_messaging.rpc.dispatcher [req-dd34a875-8f3d-4dbd-bf85-87ddddc53189 - - - - -] Exception during message handling: 'metadata'
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher executor_callback))
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher executor_callback)
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args)
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6772, in refresh_instance_security_rules
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher return self.manager.refresh_instance_security_rules(ctxt, instance)
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 434, in decorated_function
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher args = (_load_instance(args[0]),) + args[1:]
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 425, in _load_instance
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher expected_attrs=metas)
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/objects/instance.py", line 506, in _from_db_object
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher instance['metadata'] = utils.instance_meta(db_inst)
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 817, in instance_meta
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher if isinstance(instance['metadata'], dict):
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher KeyError: 'metadata'
2015-09-10 02:20:20.559 6436 TRACE oslo_messaging.rpc.dispatcher

Revision history for this message
Matt Riedemann (mriedem) wrote :

Andrew/Sam - which version of kilo? trunk stable/kilo or 2015.1? 2015.1.1?

Also, using neutron or nova-network?

Revision history for this message
Matt Riedemann (mriedem) wrote :

(9:32:31 PM) mriedem: andrewbogott: btw, are you on trunk stable/kilo or like 2015.1?
(9:32:43 PM) andrewbogott: I’m using the cloud archive
(9:32:58 PM) andrewbogott: nova-compute 1:2015.1.1-0ubuntu1~cloud2
(9:34:00 PM) mriedem: nova-network?
(9:34:14 PM) andrewbogott: afraid so :(
(9:35:43 PM) mriedem: i was wondering if it was https://review.openstack.org/#/c/211162/
(9:36:32 PM) mriedem: i think that was actually an AttributeError though, not a KeyError

Changed in nova:
status: New → Confirmed
importance: Undecided → High
tags: added: kilo-backport-potential
tags: added: network
Matt Riedemann (mriedem)
tags: added: unified-objects
Revision history for this message
Matt Riedemann (mriedem) wrote :

The code obviously looks wrong, if you look at the call chain, we're in:

https://github.com/openstack/nova/blob/stable/kilo/nova/compute/manager.py#L6798

Which is calling the method in the compute manager which has the object_compat decorator:

https://github.com/openstack/nova/blob/stable/kilo/nova/compute/manager.py#L1345

Then we get here:

https://github.com/openstack/nova/blob/stable/kilo/nova/compute/manager.py#L421

And metas is:

metas = ['metadata', 'system_metadata'] and used above:

expected_attrs=metas

But metas wouldn't be set there?

So we get the KeyError and then try this:

args = (_load_instance(args[0]),) + args[1:]

but again metas isn't set, right?

I guess it must be in scope though otherwise there would be a different error.

Plus we get here:

https://github.com/openstack/nova/blob/stable/kilo/nova/objects/instance.py#L506

Which means metadata was asked for, but it's not in the instance from the database so it blows up. Which means that whatever in the nova-api was calling refresh_instance_security_rules wasn't originally asking for 'metadata' so it's not in the db instance passed to refresh_instance_security_rules which goes into the object_compat decorator and we fail.

Revision history for this message
Sam Morrison (sorrison) wrote :

We're using stable/kilo, haven't resynced from upstream for about a week but having a look I can't see anything that we're missing that would fix it.

Revision history for this message
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/187537/ is the reason why this isn't a problem in liberty - because we pass an instance object from compute API to the compute manager so when the object_compat decorator is hit, we already have an instance object and we don't need to call _from_db_object.

That change has RPC API version changes though so we can't backport it to stable/kilo.

We should just have checks in the object_compat decorator to not try and load attributes that aren't actually in the db instance (dict).

Matt Riedemann (mriedem)
Changed in nova:
status: Confirmed → In Progress
assignee: Zhenzan Zhou (zhenzan-zhou) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/222022

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/222023

Matt Riedemann (mriedem)
tags: removed: kilo-backport-potential
tags: added: compute
removed: network security-groups unified-objects
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/222026

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/222022
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9369aab04e37b7818d49b00e65857be8b3564e9e
Submitter: Jenkins
Branch: master

commit 9369aab04e37b7818d49b00e65857be8b3564e9e
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 9 20:29:09 2015 -0700

    Don't expect meta attributes in object_compat that aren't in the db obj

    The object_compat decorator expects to get the Instance object with
    'metadata' and 'system_metadata' attributes but if those aren't in the
    db instance dict object, Instance._from_db_object will fail with a
    KeyError.

    In Kilo this happens per refresh_instance_security_rules because in the
    compute API code, the instance passed to refresh_instance_security_rules
    comes from the call to get the security group(s) which joins on the
    instances column, but that doesn't join on the metadata/system_metadata
    fields for the instances. So when the instances get to object_compat in
    the compute manager and the db instance dict is converted to the
    Instance object, it expects fields that aren't in the dict and we get
    the KeyError.

    The refresh_instance_security_rules case is fixed in Liberty per commit
    12fbe6f082ef9b70b89302e15daa12e851e507a7 - in that case the compute API
    passes Instance objects to the compute manager so object_compat doesn't
    have anything to do, _load_instance just sees instance_or_dict isn't a
    dict and ignores it.

    We're making this change since (1) it's an obviously wrong assumption in
    object_compat and should be fixed and (2) we need to backport this fix
    to stable/kilo since it's an upgrade impact for users there.

    Closes-Bug: #1484738

    Change-Id: I36a954c095a9aa35879200784dc18e35edf689e6

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/222026
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc5ced25c61fc742c513abc03b92e2d8f8042d8c
Submitter: Jenkins
Branch: stable/juno

commit dc5ced25c61fc742c513abc03b92e2d8f8042d8c
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 9 20:29:09 2015 -0700

    Don't expect meta attributes in object_compat that aren't in the db obj

    The object_compat decorator expects to get the Instance object with
    'metadata' and 'system_metadata' attributes but if those aren't in the
    db instance dict object, Instance._from_db_object will fail with a
    KeyError.

    In Kilo this happens per refresh_instance_security_rules because in the
    compute API code, the instance passed to refresh_instance_security_rules
    comes from the call to get the security group(s) which joins on the
    instances column, but that doesn't join on the metadata/system_metadata
    fields for the instances. So when the instances get to object_compat in
    the compute manager and the db instance dict is converted to the
    Instance object, it expects fields that aren't in the dict and we get
    the KeyError.

    The refresh_instance_security_rules case is fixed in Liberty per commit
    12fbe6f082ef9b70b89302e15daa12e851e507a7 - in that case the compute API
    passes Instance objects to the compute manager so object_compat doesn't
    have anything to do, _load_instance just sees instance_or_dict isn't a
    dict and ignores it.

    We're making this change since (1) it's an obviously wrong assumption in
    object_compat and should be fixed and (2) we need to backport this fix
    to stable/kilo since it's an upgrade impact for users there.

    Closes-Bug: #1484738

    Conflicts:
            nova/tests/unit/compute/test_compute.py

    NOTE(mriedem): The conflict is due to the unit tests being moved in
    kilo, otherwise this is unchanged.

    Change-Id: I36a954c095a9aa35879200784dc18e35edf689e6
    (cherry picked from commit 9369aab04e37b7818d49b00e65857be8b3564e9e)
    (cherry picked from commit 08d1153d3be9f8d59aa0acc03eedd45a1697ed7b)

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/222023
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=08d1153d3be9f8d59aa0acc03eedd45a1697ed7b
Submitter: Jenkins
Branch: stable/kilo

commit 08d1153d3be9f8d59aa0acc03eedd45a1697ed7b
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 9 20:29:09 2015 -0700

    Don't expect meta attributes in object_compat that aren't in the db obj

    The object_compat decorator expects to get the Instance object with
    'metadata' and 'system_metadata' attributes but if those aren't in the
    db instance dict object, Instance._from_db_object will fail with a
    KeyError.

    In Kilo this happens per refresh_instance_security_rules because in the
    compute API code, the instance passed to refresh_instance_security_rules
    comes from the call to get the security group(s) which joins on the
    instances column, but that doesn't join on the metadata/system_metadata
    fields for the instances. So when the instances get to object_compat in
    the compute manager and the db instance dict is converted to the
    Instance object, it expects fields that aren't in the dict and we get
    the KeyError.

    The refresh_instance_security_rules case is fixed in Liberty per commit
    12fbe6f082ef9b70b89302e15daa12e851e507a7 - in that case the compute API
    passes Instance objects to the compute manager so object_compat doesn't
    have anything to do, _load_instance just sees instance_or_dict isn't a
    dict and ignores it.

    We're making this change since (1) it's an obviously wrong assumption in
    object_compat and should be fixed and (2) we need to backport this fix
    to stable/kilo since it's an upgrade impact for users there.

    Closes-Bug: #1484738

    Change-Id: I36a954c095a9aa35879200784dc18e35edf689e6
    (cherry picked from commit 9369aab04e37b7818d49b00e65857be8b3564e9e)

Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-rc1
status: Fix Committed → Fix Released
Revision history for this message
Matt Riedemann (mriedem) wrote :

OSSA bug 1491307 is fixed by the same fix for this bug.

tags: added: security-groups
Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-rc1 → 12.0.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers