'SubnetInUse: Unable to complete operation on subnet UUID. One or more ports have an IP allocation from this subnet.'

Bug #1235435 reported by Joe Gordon
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Armando Migliaccio
Havana
Medium
Yaguang Tang
neutron
Undecided
Unassigned
tempest
Undecided
Unassigned

Bug Description

Occasional tempest failure:

http://logs.openstack.org/86/49086/2/gate/gate-tempest-devstack-vm-neutron-isolated/ce14ceb/testr_results.html.gz

ft3.1: tearDownClass (tempest.scenario.test_network_basic_ops.TestNetworkBasicOps)_StringException: Traceback (most recent call last):
  File "tempest/scenario/manager.py", line 239, in tearDownClass
    thing.delete()
  File "tempest/api/network/common.py", line 71, in delete
    self.client.delete_subnet(self.id)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 112, in with_params
    ret = self.function(instance, *args, **kwargs)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 380, in delete_subnet
    return self.delete(self.subnet_path % (subnet))
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1233, in delete
    headers=headers, params=params)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1222, in retry_request
    headers=headers, params=params)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1165, in do_request
    self._handle_fault_response(status_code, replybody)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1135, in _handle_fault_response
    exception_handler_v20(status_code, des_error_body)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 97, in exception_handler_v20
    message=msg)
NeutronClientException: 409-{u'NeutronError': {u'message': u'Unable to complete operation on subnet 9e820b02-bfe2-47e3-b186-21c5644bc9cf. One or more ports have an IP allocation from this subnet.', u'type': u'SubnetInUse', u'detail': u''}}

logstash query:

@message:"One or more ports have an IP allocation from this subnet" AND @fields.filename:"logs/screen-q-svc.txt" and @message:" SubnetInUse: Unable to complete operation on subnet"

http://logstash.openstack.org/#eyJzZWFyY2giOiJAbWVzc2FnZTpcIk9uZSBvciBtb3JlIHBvcnRzIGhhdmUgYW4gSVAgYWxsb2NhdGlvbiBmcm9tIHRoaXMgc3VibmV0XCIgQU5EIEBmaWVsZHMuZmlsZW5hbWU6XCJsb2dzL3NjcmVlbi1xLXN2Yy50eHRcIiBhbmQgQG1lc3NhZ2U6XCIgU3VibmV0SW5Vc2U6IFVuYWJsZSB0byBjb21wbGV0ZSBvcGVyYXRpb24gb24gc3VibmV0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODA5MTY1NDUxODcsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

Revision history for this message
Zhikun Liu (zhikunliu) wrote : AUTO: Zhi Kun ZK Liu is on vacation

I am out of the office until 10/07/2013.

I will take vacation from 28th Sep to 7th Oct . If have any urgent, please
call 13910806810

Note: This is an automated response to your message "[Bug 1235435] [NEW]
'SubnetInUse: Unable to complete operation on subnet UUID. One or more
ports have an IP allocation from this subnet.'" sent on 10/05/2013
3:58:43.

This is the only notification you will receive while this person is away.

Revision history for this message
Anita Kuno (anteaya) wrote :

Last seen Nov. 1st, not seen since. (This is November 12th as I write.)

Let's keep an eye on this in case it returns.

Revision history for this message
Joe Gordon (jogo) wrote :
Changed in neutron:
status: New → Confirmed
assignee: nobody → Armando Migliaccio (armando-migliaccio)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

From what the logs show it looks like nova-compute goes in some sort of loop where it keeps getting network info: See nova-cpu and nova-cond logs:

http://logs.openstack.org/03/56303/2/check/check-tempest-devstack-vm-neutron-pg/23f03ba/logs/screen-n-cpu.txt.gz
http://logs.openstack.org/03/56303/2/check/check-tempest-devstack-vm-neutron-pg/23f03ba/logs/screen-n-cond.txt.gz

and search for request:

req-8c5d20d1-5191-4683-8fb8-25570e72bf97

Whether that's induced by a Neutron failure remains to be seen. I'll keep digging.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I think I found why this loops:

1) compute manager calls _shutdown_instance: this in turn calls _get_instance_nw_info which calls net_api.get_instance_nw_info
2) network_api.get_instance_nw_info calls _get_instance_nw_info that calls _build_network_info_model and shortly after update_instance_info_cache
3) update_instance_info_cache is set to network_api.update_instance_cache_with_nw_info in neutronv2/api.y
4) update_instance_cache_with_nw_info in turn calls _get_instance_nw_info again if nw_info is not of type NetworkInfo

This is where I am lost because as far as I can tell, nw_info should be of that type.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → nobody
status: Confirmed → New
Changed in nova:
importance: Undecided → Medium
assignee: nobody → Armando Migliaccio (armando-migliaccio)
status: New → In Progress
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

In the specific instance of the failure I examined Neutron complained correctly that one or more ports had an IP allocation from the subnet; this was because the user port associated to the server hadn't been deleted. Nova-compute was stuck in a loop trying to retrieve nw_info. I haven't been able to root-cause the issue, but I tried to address the infinite loop by removing what sounded to me an erroneous duplication of a refresh_cache action.

This in turn might cause the failure mode to change and hence reveal more about the underlying problem.

Changed in neutron:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/56504
Committed: http://github.com/openstack/nova/commit/651fac3d5d250d42e640c3ac113084bf0d2fa3b4
Submitter: Jenkins
Branch: master

commit 651fac3d5d250d42e640c3ac113084bf0d2fa3b4
Author: armando-migliaccio <email address hidden>
Date: Thu Nov 14 18:59:20 2013 -0800

    Remove reduntant call to update_instance_info_cache

    get_instance_nw_info calls _get_instance_nw_info,
    which is decorated with @refresh_cache. This in
    turn calls update_instance_info_cache, again. This
    is both expensive and more importantly dangerous because
    the method update_instance_info_cache may call
    _get_instance_nw_info itself, thus causing an infinite
    loop.

    Change-Id: Id8f0241d1c6609cf4e064ee6855228ae7de250e2
    Partial-bug: 1235435

Revision history for this message
Matt Riedemann (mriedem) wrote :

Commit 1957339df302e2da75e0dbe78b5d566194ab2c08 added the decorator and it's not in stable/havana so we don't need to backport this.

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Anita Kuno (anteaya) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/57069

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/57069
Committed: http://github.com/openstack/nova/commit/2e520496c369a8ef646820ec068cc4416ab50eca
Submitter: Jenkins
Branch: master

commit 2e520496c369a8ef646820ec068cc4416ab50eca
Author: armando-migliaccio <email address hidden>
Date: Mon Nov 18 16:10:48 2013 -0800

    Address infinite loop in nova compute when getting network info

    Move the refresh_cache decorator to get_instance_nw_info.
    In fact, _get_instance_nw_info is called by the decorator
    itself, potentially causing an infinite loop in case nw_info
    is not of the expected value. This is also inline with
    the method's purpose, as stated by the docstring. At the
    same time, ensure that delete_port_for_instance and
    allocate_for_instance continue to refresh the cache by
    using the decorated version.

    Also, add a couple of debug traces that are friendly to
    the developer.

    Partial-bug: 1235435
    Related-bug: 1251792

    Change-Id: I06f8634ea241d05ac8fbcc290adf0cb23829f3e4

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/58471

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/58471
Committed: http://github.com/openstack/nova/commit/ef9aec1cb6ec0d5fae699f9bb1f0d042530de5de
Submitter: Jenkins
Branch: stable/havana

commit ef9aec1cb6ec0d5fae699f9bb1f0d042530de5de
Author: armando-migliaccio <email address hidden>
Date: Mon Nov 18 16:10:48 2013 -0800

    Address infinite loop in nova compute when getting network info

    Move the refresh_cache decorator to get_instance_nw_info.
    In fact, _get_instance_nw_info is called by the decorator
    itself, potentially causing an infinite loop in case nw_info
    is not of the expected value. This is also inline with
    the method's purpose, as stated by the docstring. At the
    same time, ensure that delete_port_for_instance and
    allocate_for_instance continue to refresh the cache by
    using the decorated version.

    Also, add a couple of debug traces that are friendly to
    the developer.

    Partial-bug: 1235435
    Related-bug: 1251792

    Change-Id: I06f8634ea241d05ac8fbcc290adf0cb23829f3e4
    (cherry picked from commit 2e520496c369a8ef646820ec068cc4416ab50eca)

tags: added: in-stable-havana
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This seems to have gone quiet since the fixes merged on the 26th. Marking committed for now.

Changed in neutron:
status: Incomplete → Invalid
Changed in nova:
status: In Progress → Fix Committed
Changed in tempest:
status: New → Invalid
Revision history for this message
Anita Kuno (anteaya) wrote :

There was one failure on November 29th, 2013 at 18:00:00.000 UTC which is an oddly exact timestamp for the failure.

Changed in nova:
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Alan Pevec (apevec)
tags: removed: in-stable-havana
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/73421

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/73421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b047eaa9d96dc7c8f7a2f4311277090ce9d21d89
Submitter: Jenkins
Branch: stable/havana

commit b047eaa9d96dc7c8f7a2f4311277090ce9d21d89
Author: armando-migliaccio <email address hidden>
Date: Thu Nov 14 18:59:20 2013 -0800

    Remove reduntant call to update_instance_info_cache

    get_instance_nw_info calls _get_instance_nw_info,
    which is decorated with @refresh_cache. This in
    turn calls update_instance_info_cache, again. This
    is both expensive and more importantly dangerous because
    the method update_instance_info_cache may call
    _get_instance_nw_info itself, thus causing an infinite
    loop.

    Change-Id: Id8f0241d1c6609cf4e064ee6855228ae7de250e2
    Partial-bug: 1235435
    (cherry picked from commit 651fac3d5d250d42e640c3ac113084bf0d2fa3b4)

tags: added: in-stable-havana
Alan Pevec (apevec)
tags: removed: in-stable-havana
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers