nova-compute: instance created in self-referencing secgroup produces KeyError

Bug #1182131 reported by Brano Zarnovican
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dan Smith
Havana
Fix Released
Undecided
Unassigned
Icehouse
Fix Released
Undecided
Unassigned

Bug Description

Hi,

Steps to reproduce:

1) create a security group that is referencing itself, for example
euca-create-group test2
euca-authorize test2 -P tcp -p 22 -s 0.0.0.0/0
euca-authorize test2 -P tcp -p 6666 -o test2

2) create any instance in this security group

euca-run-instance .. -g test2 ..

Expected result:
no stackstrace to be thrown
Actual result:
stacktrace with KeyError appears in the log. The iptable rules are created correctly and instance ends up in running state.

File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 390, in refresh_instance_security_rules
  return self.driver.refresh_instance_security_rules(instance)
File "/usr/lib/python2.6/site-packages/nova/virt/libvirt/driver.py", line 2269, in refresh_instance_security_rules
  self.firewall_driver.refresh_instance_security_rules(instance)
File "/usr/lib/python2.6/site-packages/nova/virt/firewall.py", line 440, in refresh_instance_security_rules
  self.do_refresh_instance_rules(instance)
File "/usr/lib/python2.6/site-packages/nova/virt/firewall.py", line 457, in do_refresh_instance_rules
  network_info = self.network_infos[instance['id']]
KeyError: 4168

It seems that self.network_infos is accessed in wrong order for the security group that is referencing itself. The stacktrace is from 'do_refresh_instance_rules' which expects network info to be already present for the instance that is being created. Reported KeyError is the id of newly created instance. The dictionary entry is added few seconds later processing the same request.

Fortunately, this issue does not appear to have any negative impact aside the stacktrace in the log.

Openstack version: Folsom 2012.2.4

Attaching verbose log from nova-compute.

Regards,

Brano Zarnovican

Revision history for this message
Brano Zarnovican (zarnovican) wrote :
Revision history for this message
Avinash Prasad (avinash-prasad) wrote :

I am not able to re-produce the stack-trace on stable/grizzly at all. It works correctly for me.
I ran the following,

euca-add-group test3 -d test3
GROUP test3 test3

euca-authorize test3 -P tcp -p 22 -s 0.0.0.0/0
GROUP test3
PERMISSION test3 ALLOWS tcp 22 22 FROM CIDR 0.0.0.0/0

euca-authorize test3 -P tcp -p 6666 -o test3
GROUP test3
PERMISSION test3 ALLOWS tcp 6666 6666 GRPNAME test3 FROM CIDR 0.0.0.0/0

euca-run-instances <image_id> -g test3
gives all teh relevant information about the instance and its status

Revision history for this message
Brano Zarnovican (zarnovican) wrote :

Unfortunately, I'm not able to test it on Grizzly. I'm able to reproduce it consistently also on Folsom build from EPEL.

# rpm -qi openstack-nova
Name : openstack-nova Relocations: (not relocatable)
Version : 2012.2.3 Vendor: Fedora Project
Release : 1.el6 Build Date: Mon 04 Feb 2013 03:00:18 PM CET
Install Date: Thu 23 May 2013 09:23:27 AM CEST Build Host: ppc12.phx2.fedoraproject.org
Group : Applications/System Source RPM: openstack-nova-2012.2.3-1.el6.src.rpm
...

Changed in nova:
status: New → Incomplete
Revision history for this message
Adin Scannell (amscanne) wrote :
Download full text (3.5 KiB)

I am also seeing this bug (or something very similar) on Grizzly:

2013-05-28 07:58:33.969 ERROR nova.openstack.common.rpc.amqp [req-8d5edb8b-ba0c-4552-9e8b-e6ec47f2fa75 e5195d7000ab4f55a1b542d311a59506 5dc6d827aeef43c4b5165e3e045e42c2] Exception during message handling
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp Traceback (most recent call last):
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/amqp.py", line 430, in _process_data
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp rval = self.proxy.dispatch(ctxt, version, method, **args)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/dispatcher.py", line 133, in dispatch
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp return getattr(proxyobj, method)(ctxt, **kwargs)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 117, in wrapped
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp temp_level, payload)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp self.gen.next()
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 94, in wrapped
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp return f(self, context, *args, **kw)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 657, in refresh_instance_security_rules
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp return _sync_refresh()
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py", line 242, in inner
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp retval = f(*args, **kwargs)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 656, in _sync_refresh
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp return self.driver.refresh_instance_security_rules(instance)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2778, in refresh_instance_security_rules
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp self.firewall_driver.refresh_instance_security_rules(instance)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/virt/firewall.py", line 453, in refresh_instance_security_rules
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp self.do_refresh_instance_rules(instance)
2013-05-28 07:58:33.969 27617 TRACE nova.openstack.common.rpc.amqp File "/usr/lib/...

Read more...

Eric Brown (ericwb)
summary: - nova-compue: instance created in self-referencing secgroup produces
+ nova-compute: instance created in self-referencing secgroup produces
KeyError
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
status: Incomplete → Confirmed
tags: added: compute network
Revision history for this message
Matt Riedemann (mriedem) wrote :

As pointed out in the bug description, this doesn't appear to cause issues. The logstash query proves that, there is still a 94% success rate on the job when this shows up.

no longer affects: nova/folsom
Matt Riedemann (mriedem)
tags: added: havana-backport-potential icehouse-backport-potential
Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
importance: Undecided → High
status: Confirmed → Triaged
status: Triaged → In Progress
assignee: nobody → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/104335

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/104325
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=72dd81343e73baf838bc58d413413f0d57018f15
Submitter: Jenkins
Branch: master

commit 72dd81343e73baf838bc58d413413f0d57018f15
Author: Dan Smith <email address hidden>
Date: Wed Jul 2 12:44:48 2014 -0700

    Avoid referencing stale instance/network_info dicts in firewall

    This makes the virt.firewall code cleaner in terms of referencing
    the cached instance and network_info code it stores. Before this
    patch, concurrent instance operations could modify these two dicts
    so that while we're iterating instances, the network_info dict
    is suddenly missing information we need.

    The right fix for this is to use instance objects and their
    associated info_cache objects, but that's a larger fix and one
    not as well-suited to backporting to previous releases which
    suffer from this as well.

    The approach taken here is that we store the instance and
    network_info cache together in the same dict that we can pop()
    from atomically (this is not really necessary, but helps to
    prevent introducing more of these cases). When we iterate over
    the contents, we iterate over a copy of the keys, being careful
    not to let a suddenly-missing key break us, and passing the
    details all the way down the stack instead of having deeper calls
    hit the cache dicts again.

    Change-Id: I33366f50024a82451842d045b830ab19b59879c3
    Closes-bug: #1182131

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Sean Dague (sdague) wrote :

I think this bug is fixed, however the underlying races in firewall manip seem to still be there.

Changed in nova:
status: Fix Committed → Confirmed
status: Confirmed → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/104581

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/104581
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6aa368b99249d01f8fd7183c15d11986ad6a6fb7
Submitter: Jenkins
Branch: master

commit 6aa368b99249d01f8fd7183c15d11986ad6a6fb7
Author: Dan Smith <email address hidden>
Date: Thu Jul 3 08:09:39 2014 -0700

    Avoid re-adding iptables rules for instances that have disappeared

    The remove_filters_for_instance() method fails silently if the
    instance's chain is gone (i.e. it's been deleted). If this
    happens while we're refreshing security group rules, we will
    not notice this case and re-add stale rules for an old instance,
    breaking our firewall for new instances.

    This adds a quick check after we've captured the lock to see if
    the associated chain exists, and bails if it doesn't.

    Change-Id: Ic75988939f82de49735d85fe99a9eecd4baf45c9
    Related-bug: #1182131

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/icehouse)

Related fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/106792

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/104335
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9f5d2a6a65edb2a5862c7f07b29ac700cec61f84
Submitter: Jenkins
Branch: stable/icehouse

commit 9f5d2a6a65edb2a5862c7f07b29ac700cec61f84
Author: Dan Smith <email address hidden>
Date: Wed Jul 2 12:44:48 2014 -0700

    Avoid referencing stale instance/network_info dicts in firewall

    This makes the virt.firewall code cleaner in terms of referencing
    the cached instance and network_info code it stores. Before this
    patch, concurrent instance operations could modify these two dicts
    so that while we're iterating instances, the network_info dict
    is suddenly missing information we need.

    The right fix for this is to use instance objects and their
    associated info_cache objects, but that's a larger fix and one
    not as well-suited to backporting to previous releases which
    suffer from this as well.

    The approach taken here is that we store the instance and
    network_info cache together in the same dict that we can pop()
    from atomically (this is not really necessary, but helps to
    prevent introducing more of these cases). When we iterate over
    the contents, we iterate over a copy of the keys, being careful
    not to let a suddenly-missing key break us, and passing the
    details all the way down the stack instead of having deeper calls
    hit the cache dicts again.

    Conflicts:
     nova/virt/firewall.py

    Change-Id: Ib1cfd68fca7789ca4a3602c25e6b8a956303fff9
    Closes-bug: #1182131
    (cherry picked from commit 72dd81343e73baf838bc58d413413f0d57018f15)

tags: added: in-stable-icehouse
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/106792
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=132d5a2064cfdc579a2e51f07e10e8eafbdf6327
Submitter: Jenkins
Branch: stable/icehouse

commit 132d5a2064cfdc579a2e51f07e10e8eafbdf6327
Author: Dan Smith <email address hidden>
Date: Thu Jul 3 08:09:39 2014 -0700

    Avoid re-adding iptables rules for instances that have disappeared

    The remove_filters_for_instance() method fails silently if the
    instance's chain is gone (i.e. it's been deleted). If this
    happens while we're refreshing security group rules, we will
    not notice this case and re-add stale rules for an old instance,
    breaking our firewall for new instances.

    This adds a quick check after we've captured the lock to see if
    the associated chain exists, and bails if it doesn't.

    Change-Id: Ic75988939f82de49735d85fe99a9eecd4baf45c9
    Related-bug: #1182131
    (cherry picked from commit 6aa368b99249d01f8fd7183c15d11986ad6a6fb7)

Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Chuck Short (zulcss)
tags: removed: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/115267

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/havana)

Related fix proposed to branch: stable/havana
Review: https://review.openstack.org/115268

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/115268
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d469733e59872a897737db6ed5fd92d57986aee4
Submitter: Jenkins
Branch: stable/havana

commit d469733e59872a897737db6ed5fd92d57986aee4
Author: Dan Smith <email address hidden>
Date: Thu Jul 3 08:09:39 2014 -0700

    Avoid re-adding iptables rules for instances that have disappeared

    The remove_filters_for_instance() method fails silently if the
    instance's chain is gone (i.e. it's been deleted). If this
    happens while we're refreshing security group rules, we will
    not notice this case and re-add stale rules for an old instance,
    breaking our firewall for new instances.

    This adds a quick check after we've captured the lock to see if
    the associated chain exists, and bails if it doesn't.

    Conflicts:
     nova/virt/firewall.py

    Required changes:
    - unit test was updated because in Havana, IptablesFirewallDriver has
      .instances instead of .instance_infos, and a separate .network_infos
      attribute.

    Change-Id: Ic75988939f82de49735d85fe99a9eecd4baf45c9
    Related-bug: #1182131
    (cherry picked from commit 6aa368b99249d01f8fd7183c15d11986ad6a6fb7)
    (cherry picked from commit 132d5a2064cfdc579a2e51f07e10e8eafbdf6327)

tags: added: in-stable-havana
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/115267
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8e2d6963cafcead0fe61673ece354f646de0f630
Submitter: Jenkins
Branch: stable/havana

commit 8e2d6963cafcead0fe61673ece354f646de0f630
Author: Dan Smith <email address hidden>
Date: Wed Jul 2 12:44:48 2014 -0700

    Avoid referencing stale instance/network_info dicts in firewall

    This makes the virt.firewall code cleaner in terms of referencing
    the cached instance and network_info code it stores. Before this
    patch, concurrent instance operations could modify these two dicts
    so that while we're iterating instances, the network_info dict
    is suddenly missing information we need.

    The right fix for this is to use instance objects and their
    associated info_cache objects, but that's a larger fix and one
    not as well-suited to backporting to previous releases which
    suffer from this as well.

    The approach taken here is that we store the instance and
    network_info cache together in the same dict that we can pop()
    from atomically (this is not really necessary, but helps to
    prevent introducing more of these cases). When we iterate over
    the contents, we iterate over a copy of the keys, being careful
    not to let a suddenly-missing key break us, and passing the
    details all the way down the stack instead of having deeper calls
    hit the cache dicts again.

    Conflicts:
     nova/virt/firewall.py

    Change-Id: Ib1cfd68fca7789ca4a3602c25e6b8a956303fff9
    Closes-bug: #1182131
    (cherry picked from commit 72dd81343e73baf838bc58d413413f0d57018f15)
    (cherry picked from commit 9f5d2a6a65edb2a5862c7f07b29ac700cec61f84)

Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.