infinite recursion when deleting an instance with no network interfaces

Bug #1251792 reported by Peter Feiner
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Peter Feiner
Havana
Fix Released
High
Ihar Hrachyshka

Bug Description

In some situations when an instance has "no network information" (a phrase that I'm using loosely), deleting the instance results in infinite recursion. The stack looks like this:

2013-11-15 18:50:28.995 DEBUG nova.network.neutronv2.api [req-28f48294-0877-4f09-bcc1-7595dbd4c15a demo demo] File "/usr/lib/python2.7/dist-packages/eventlet/greenpool.py", line 80, in _spawn_n_impl
    func(*args, **kwargs)
  File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_data
    **args)
  File "/opt/stack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatch
    result = getattr(proxyobj, method)(ctxt, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 354, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/exception.py", line 73, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/nova/nova/compute/manager.py", line 230, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 295, in decorated_function
    function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 259, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 1984, in terminate_instance
    do_terminate_instance(instance, bdms)
  File "/opt/stack/nova/nova/openstack/common/lockutils.py", line 248, in inner
    return f(*args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 1976, in do_terminate_instance
    reservations=reservations)
  File "/opt/stack/nova/nova/hooks.py", line 105, in inner
    rv = f(*args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 1919, in _delete_instance
    self._shutdown_instance(context, db_inst, bdms)
  File "/opt/stack/nova/nova/compute/manager.py", line 1829, in _shutdown_instance
    network_info = self._get_instance_nw_info(context, instance)
  File "/opt/stack/nova/nova/compute/manager.py", line 868, in _get_instance_nw_info
    instance)
  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 449, in get_instance_nw_info
    result = self._get_instance_nw_info(context, instance, networks)
  File "/opt/stack/nova/nova/network/api.py", line 64, in wrapper
    nw_info=res)

RECURSION STARTS HERE

  File "/opt/stack/nova/nova/network/api.py", line 77, in update_instance_cache_with_nw_info
    nw_info = api._get_instance_nw_info(context, instance)
  File "/opt/stack/nova/nova/network/api.py", line 64, in wrapper
    nw_info=res)

... REPEATS AD NAUSEUM ...

  File "/opt/stack/nova/nova/network/api.py", line 77, in update_instance_cache_with_nw_info
    nw_info = api._get_instance_nw_info(context, instance)
  File "/opt/stack/nova/nova/network/api.py", line 64, in wrapper
    nw_info=res)
  File "/opt/stack/nova/nova/network/api.py", line 77, in update_instance_cache_with_nw_info
    nw_info = api._get_instance_nw_info(context, instance)
  File "/opt/stack/nova/nova/network/api.py", line 49, in wrapper
    res = f(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 459, in _get_instance_nw_info
    LOG.debug('%s', ''.join(traceback.format_stack()))

Here's a step-by-step explanation of how the infinite recursion arises:

1. somebody calls nova.network.neutronv2.api.API.get_instance_nw_info

2. in the above call, the network info is successfully retrieved as result = self._get_instance_nw_info(context, instance, networks)

3. however, since the instance has "no network information", result is the empty list (i.e., [])

4. the result is put in the cache by calling nova.network.api.update_instance_cache_with_nw_info

5. update_instance_cache_with_nw_info is supposed to add the result to the cache, but due to a bug in update_instance_cache_with_nw_info, it recursively calls api.get_instance_nw_info, which brings us back to step 1. The bug is the check before the recursive call:

    if not nw_info:
        nw_info = api._get_instance_nw_info(context, instance)

which erroneously equates [] and None. Hence the check should be "if nw_info is None:"

I should clarify that the instance _did_ have network information at some point (i.e., I booted it normally with a NIC), however, some time after I issued a "nova delete" request, the network information was gone (i.e., in nova list, the networks column was empty for the instance while it was in the deleting task state).

I came across this problem when doing performance testing with the latest openstack code (i.e., the master branches as of this morning of all of the github.com/openstack/* projects).

There's an outstanding max recursion issue (https://bugs.launchpad.net/nova/+bug/1251778) that could very well be caused by this bug in update_instance_cache_with_nw_info. Note that in the bug report you don't see repeated calls to update_instance_cache_with_nw_info because LOG.exception only shows the stack trace from the try: frame to the exceptional frame, whereas I used traceback.format_stack() which prints everything.

Although The fix is simple enough, I'm not going to fire off a review immediately because I haven't put much thought in how to test it.

Tags: api network
Matt Riedemann (mriedem)
tags: added: api network
Changed in nova:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

Peter, thanks for the analysis. This seems nasty enough that I don't know there would be a ton of nit picking over unit tests. Seems like the [] vs None could just be used in existing tests to verify the fix.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually this looks related: https://review.openstack.org/#/c/56504/

That merged this morning, do you know if you have that patch in your setup when testing? If so and there are still issues, your bug is still valid.

Revision history for this message
Peter Feiner (pete5) wrote :

Matt, I didn't have that change in my tree. However, it wouldn't have made a difference since the patch doesn't actually eliminate the recursion. There were two calls to update_instance_cache_with_nw_info: the one in get_instance_nw_info that the patch removed and the one at the end of the @refresh_cache wrapper on _get_instance_nw_info.

I'll submit a patch shortly to deal with nw_info=[].

Revision history for this message
Peter Feiner (pete5) wrote :

It turns out that the unit tests are pretty broken WRT the instance info cache.

_StringException: Traceback (most recent call last):
  File "/opt/stack/nova/nova/tests/network/test_neutronv2.py", line 477, in test_get_instance_nw_info_1
    self._get_instance_nw_info(1)
  File "/opt/stack/nova/nova/tests/network/test_neutronv2.py", line 457, in _get_instance_nw_info
    nw_inf = api.get_instance_nw_info(self.context, self.instance)
  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 451, in get_instance_nw_info
    update_cells=False)
  File "/opt/stack/nova/nova/network/api.py", line 84, in update_instance_cache_with_nw_info
    ic.save(update_cells=update_cells)
  File "/opt/stack/nova/nova/objects/base.py", line 152, in wrapper
    return fn(self, ctxt, *args, **kwargs)
  File "/opt/stack/nova/nova/objects/instance_info_cache.py", line 91, in save
    {'network_info': nw_info_json})
  File "/usr/lib/python2.7/dist-packages/mox.py", line 765, in __call__
    return mock_method(*params, **named_params)
  File "/usr/lib/python2.7/dist-packages/mox.py", line 985, in __call__
    expected_method = self._VerifyMethodCall()
  File "/usr/lib/python2.7/dist-packages/mox.py", line 1032, in _VerifyMethodCall
    expected = self._PopNextMethod()
  File "/usr/lib/python2.7/dist-packages/mox.py", line 1018, in _PopNextMethod
    raise UnexpectedMethodCallError(self, None)
UnexpectedMethodCallError: Unexpected method call instance_info_cache_update.__call__(<nova.context.RequestContext object at 0x4d0b890>, 'e5591e4d-7f02-4c7f-839c-51e001f12e2d', {'network_info': '[{"ovs_interfaceid": null, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [{"meta": {}, "version": 4, "type": "floating", "address": "172.0.1.2"}], "address": "10.0.1.2"}], "version": 4, "meta": {}, "dns": [{"meta": {}, "version": 4, "type": "dns", "address": "8.8.1.1"}, {"meta": {}, "version": 4, "type": "dns", "address": "8.8.1.2"}], "routes": [], "cidr": "10.0.1.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.1.1"}}], "meta": {"injected": false, "tenant_id": "my_tenantid"}, "id": "my_netid1", "label": "my_netname1"}, "devname": "tapmy_portid1", "qbh_params": null, "meta": {}, "address": "my_mac1", "type": null, "id": "my_portid1", "qbg_params": null}]'}) -> None

The exception is being suppressed by the except Exception: block in nova.network.api.update_instance_cache_with_nw_info.

Revision history for this message
Peter Feiner (pete5) wrote :

Armondo's patch gets rid of that suppressed exception in every test _except_ for TestNeutronv2.test_deallocate_port_for_instance_1, which illustrates the infinite recursion when nw_info is []:

FAILURES
TestNeutronv2.test_deallocate_port_for_instance_1

_StringException: Traceback (most recent call last):
  File "/opt/stack/nova/nova/tests/network/test_neutronv2.py", line 938, in test_deallocate_port_for_instance_1
    self._test_deallocate_port_for_instance(1)
  File "/opt/stack/nova/nova/tests/network/test_neutronv2.py", line 931, in _test_deallocate_port_for_instance
    port_data[0]['id'])
  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 435, in deallocate_port_for_instance
    return self._get_instance_nw_info(context, instance)
  File "/opt/stack/nova/nova/network/api.py", line 64, in wrapper
    nw_info=res)
  File "/opt/stack/nova/nova/network/api.py", line 78, in update_instance_cache_with_nw_info
    nw_info = api._get_instance_nw_info(context, instance)
  File "/opt/stack/nova/nova/network/api.py", line 49, in wrapper
    res = f(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 460, in _get_instance_nw_info
    nw_info = self._build_network_info_model(context, instance, networks)
  File "/opt/stack/nova/nova/network/neutronv2/api.py", line 981, in _build_network_info_model
    client = neutronv2.get_client(context, admin=True)
  File "/usr/lib/python2.7/dist-packages/mox.py", line 765, in __call__
    return mock_method(*params, **named_params)
  File "/usr/lib/python2.7/dist-packages/mox.py", line 985, in __call__
    expected_method = self._VerifyMethodCall()
  File "/usr/lib/python2.7/dist-packages/mox.py", line 1037, in _VerifyMethodCall
    expected, method = expected.MethodCalled(self)
  File "/usr/lib/python2.7/dist-packages/mox.py", line 1812, in MethodCalled
    next_method = mock_method._PopNextMethod();
  File "/usr/lib/python2.7/dist-packages/mox.py", line 1018, in _PopNextMethod
    raise UnexpectedMethodCallError(self, None)
UnexpectedMethodCallError: Unexpected method call get_client.__call__(<nova.context.RequestContext object at 0x5da3d90>, admin=True) -> None

Revision history for this message
Peter Feiner (pete5) wrote :

Fixed the main problem in https://review.openstack.org/#/c/57042/.

I didn't fix the overly permissive except clause, so errors in those tests could creep back in. A kludge is to conditionally re-raise the exception in update_instance_cache_with_nw_info if we're running a unit test.

Note that self.mox.verifyAll doesn't report the mox UnexpectedMethodCallError that the except clause absorbed. This unfortunate behaviour in mox has been documented (https://code.google.com/p/pymox/issues/detail?id=16) and fixed a long time ago (https://code.google.com/p/pymox/source/detail?r=72) but there hasn't been a release of mox since 2010. So there are probably other bugs like this lurking in these tests.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The patch you mentioned below was a partial attempt to address this very issue I saw in the context of bug 1235435

I still believe the root cause is (as Peter pointed out) the fact that for some reason the cache is empty, I am just hoping that once the infinite loop gets addressed the actual errors manifests itself.

Changed in nova:
assignee: nobody → Armando Migliaccio (armando-migliaccio)
Changed in nova:
assignee: Armando Migliaccio (armando-migliaccio) → nobody
assignee: nobody → Peter Feiner (pete5)
status: Triaged → In Progress
Revision history for this message
Peter Feiner (pete5) wrote :

Regarding the root cause, there's a simple way that nw_info=[] could arise. Suppose a request is made to delete the instance. Let's call this request req-A. If req-A raises an exception after nova.compute.manager.Manager._shutdown_instance has called self._try_deallocate_network, then the instance will still exist and subsequent queries for the instance's nw_info will return []. Another such query would arise from another request req-B that is issued to delete the instance again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/57069
Committed: http://github.com/openstack/nova/commit/2e520496c369a8ef646820ec068cc4416ab50eca
Submitter: Jenkins
Branch: master

commit 2e520496c369a8ef646820ec068cc4416ab50eca
Author: armando-migliaccio <email address hidden>
Date: Mon Nov 18 16:10:48 2013 -0800

    Address infinite loop in nova compute when getting network info

    Move the refresh_cache decorator to get_instance_nw_info.
    In fact, _get_instance_nw_info is called by the decorator
    itself, potentially causing an infinite loop in case nw_info
    is not of the expected value. This is also inline with
    the method's purpose, as stated by the docstring. At the
    same time, ensure that delete_port_for_instance and
    allocate_for_instance continue to refresh the cache by
    using the decorated version.

    Also, add a couple of debug traces that are friendly to
    the developer.

    Partial-bug: 1235435
    Related-bug: 1251792

    Change-Id: I06f8634ea241d05ac8fbcc290adf0cb23829f3e4

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/havana)

Related fix proposed to branch: stable/havana
Review: https://review.openstack.org/58471

tags: added: havana-backport-potential
tags: removed: havana-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/58471
Committed: http://github.com/openstack/nova/commit/ef9aec1cb6ec0d5fae699f9bb1f0d042530de5de
Submitter: Jenkins
Branch: stable/havana

commit ef9aec1cb6ec0d5fae699f9bb1f0d042530de5de
Author: armando-migliaccio <email address hidden>
Date: Mon Nov 18 16:10:48 2013 -0800

    Address infinite loop in nova compute when getting network info

    Move the refresh_cache decorator to get_instance_nw_info.
    In fact, _get_instance_nw_info is called by the decorator
    itself, potentially causing an infinite loop in case nw_info
    is not of the expected value. This is also inline with
    the method's purpose, as stated by the docstring. At the
    same time, ensure that delete_port_for_instance and
    allocate_for_instance continue to refresh the cache by
    using the decorated version.

    Also, add a couple of debug traces that are friendly to
    the developer.

    Partial-bug: 1235435
    Related-bug: 1251792

    Change-Id: I06f8634ea241d05ac8fbcc290adf0cb23829f3e4
    (cherry picked from commit 2e520496c369a8ef646820ec068cc4416ab50eca)

tags: added: in-stable-havana
Changed in nova:
status: In Progress → Fix Committed
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Alan Pevec (apevec)
tags: removed: in-stable-havana
Revision history for this message
Alan Pevec (apevec) wrote :

stable/havana was mistakenly marked as released while merged patch 58471 only had Related-bug: 1251792
https://review.openstack.org/57042 needs to be backported to fix it in Havana.

Revision history for this message
Alan Pevec (apevec) wrote :

stable/havana backport https://review.openstack.org/82448

Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.