'SubnetInUse: Unable to complete operation on subnet UUID. One or more ports have an IP allocation from this subnet.'

Bug #1235435 reported by Joe Gordon
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Armando Migliaccio
Havana
Fix Released
Medium
Yaguang Tang
neutron
Invalid
Undecided
Unassigned
tempest
Invalid
Undecided
Unassigned

Bug Description

Occasional tempest failure:

http://logs.openstack.org/86/49086/2/gate/gate-tempest-devstack-vm-neutron-isolated/ce14ceb/testr_results.html.gz

ft3.1: tearDownClass (tempest.scenario.test_network_basic_ops.TestNetworkBasicOps)_StringException: Traceback (most recent call last):
  File "tempest/scenario/manager.py", line 239, in tearDownClass
    thing.delete()
  File "tempest/api/network/common.py", line 71, in delete
    self.client.delete_subnet(self.id)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 112, in with_params
    ret = self.function(instance, *args, **kwargs)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 380, in delete_subnet
    return self.delete(self.subnet_path % (subnet))
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1233, in delete
    headers=headers, params=params)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1222, in retry_request
    headers=headers, params=params)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1165, in do_request
    self._handle_fault_response(status_code, replybody)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1135, in _handle_fault_response
    exception_handler_v20(status_code, des_error_body)
  File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 97, in exception_handler_v20
    message=msg)
NeutronClientException: 409-{u'NeutronError': {u'message': u'Unable to complete operation on subnet 9e820b02-bfe2-47e3-b186-21c5644bc9cf. One or more ports have an IP allocation from this subnet.', u'type': u'SubnetInUse', u'detail': u''}}

logstash query:

@message:"One or more ports have an IP allocation from this subnet" AND @fields.filename:"logs/screen-q-svc.txt" and @message:" SubnetInUse: Unable to complete operation on subnet"

http://logstash.openstack.org/#eyJzZWFyY2giOiJAbWVzc2FnZTpcIk9uZSBvciBtb3JlIHBvcnRzIGhhdmUgYW4gSVAgYWxsb2NhdGlvbiBmcm9tIHRoaXMgc3VibmV0XCIgQU5EIEBmaWVsZHMuZmlsZW5hbWU6XCJsb2dzL3NjcmVlbi1xLXN2Yy50eHRcIiBhbmQgQG1lc3NhZ2U6XCIgU3VibmV0SW5Vc2U6IFVuYWJsZSB0byBjb21wbGV0ZSBvcGVyYXRpb24gb24gc3VibmV0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODA5MTY1NDUxODcsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

Revision history for this message
Zhikun Liu (zhikunliu) wrote : AUTO: Zhi Kun ZK Liu is on vacation

I am out of the office until 10/07/2013.

I will take vacation from 28th Sep to 7th Oct . If have any urgent, please
call 13910806810

Note: This is an automated response to your message "[Bug 1235435] [NEW]
'SubnetInUse: Unable to complete operation on subnet UUID. One or more
ports have an IP allocation from this subnet.'" sent on 10/05/2013
3:58:43.

This is the only notification you will receive while this person is away.

Revision history for this message
Anita Kuno (anteaya) wrote :

Last seen Nov. 1st, not seen since. (This is November 12th as I write.)

Let's keep an eye on this in case it returns.

Revision history for this message
Joe Gordon (jogo) wrote :
Changed in neutron:
status: New → Confirmed
assignee: nobody → Armando Migliaccio (armando-migliaccio)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

From what the logs show it looks like nova-compute goes in some sort of loop where it keeps getting network info: See nova-cpu and nova-cond logs:

http://logs.openstack.org/03/56303/2/check/check-tempest-devstack-vm-neutron-pg/23f03ba/logs/screen-n-cpu.txt.gz
http://logs.openstack.org/03/56303/2/check/check-tempest-devstack-vm-neutron-pg/23f03ba/logs/screen-n-cond.txt.gz

and search for request:

req-8c5d20d1-5191-4683-8fb8-25570e72bf97

Whether that's induced by a Neutron failure remains to be seen. I'll keep digging.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I think I found why this loops:

1) compute manager calls _shutdown_instance: this in turn calls _get_instance_nw_info which calls net_api.get_instance_nw_info
2) network_api.get_instance_nw_info calls _get_instance_nw_info that calls _build_network_info_model and shortly after update_instance_info_cache
3) update_instance_info_cache is set to network_api.update_instance_cache_with_nw_info in neutronv2/api.y
4) update_instance_cache_with_nw_info in turn calls _get_instance_nw_info again if nw_info is not of type NetworkInfo

This is where I am lost because as far as I can tell, nw_info should be of that type.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → nobody
status: Confirmed → New
Changed in nova:
importance: Undecided → Medium
assignee: nobody → Armando Migliaccio (armando-migliaccio)
status: New → In Progress
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

In the specific instance of the failure I examined Neutron complained correctly that one or more ports had an IP allocation from the subnet; this was because the user port associated to the server hadn't been deleted. Nova-compute was stuck in a loop trying to retrieve nw_info. I haven't been able to root-cause the issue, but I tried to address the infinite loop by removing what sounded to me an erroneous duplication of a refresh_cache action.

This in turn might cause the failure mode to change and hence reveal more about the underlying problem.

Changed in neutron:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/56504
Committed: http://github.com/openstack/nova/commit/651fac3d5d250d42e640c3ac113084bf0d2fa3b4
Submitter: Jenkins
Branch: master

commit 651fac3d5d250d42e640c3ac113084bf0d2fa3b4
Author: armando-migliaccio <email address hidden>
Date: Thu Nov 14 18:59:20 2013 -0800

    Remove reduntant call to update_instance_info_cache

    get_instance_nw_info calls _get_instance_nw_info,
    which is decorated with @refresh_cache. This in
    turn calls update_instance_info_cache, again. This
    is both expensive and more importantly dangerous because
    the method update_instance_info_cache may call
    _get_instance_nw_info itself, thus causing an infinite
    loop.

    Change-Id: Id8f0241d1c6609cf4e064ee6855228ae7de250e2
    Partial-bug: 1235435

Revision history for this message
Matt Riedemann (mriedem) wrote :

Commit 1957339df302e2da75e0dbe78b5d566194ab2c08 added the decorator and it's not in stable/havana so we don't need to backport this.

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Anita Kuno (anteaya) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/57069

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/57069
Committed: http://github.com/openstack/nova/commit/2e520496c369a8ef646820ec068cc4416ab50eca
Submitter: Jenkins
Branch: master

commit 2e520496c369a8ef646820ec068cc4416ab50eca
Author: armando-migliaccio <email address hidden>
Date: Mon Nov 18 16:10:48 2013 -0800

    Address infinite loop in nova compute when getting network info

    Move the refresh_cache decorator to get_instance_nw_info.
    In fact, _get_instance_nw_info is called by the decorator
    itself, potentially causing an infinite loop in case nw_info
    is not of the expected value. This is also inline with
    the method's purpose, as stated by the docstring. At the
    same time, ensure that delete_port_for_instance and
    allocate_for_instance continue to refresh the cache by
    using the decorated version.

    Also, add a couple of debug traces that are friendly to
    the developer.

    Partial-bug: 1235435
    Related-bug: 1251792

    Change-Id: I06f8634ea241d05ac8fbcc290adf0cb23829f3e4

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/58471

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/58471
Committed: http://github.com/openstack/nova/commit/ef9aec1cb6ec0d5fae699f9bb1f0d042530de5de
Submitter: Jenkins
Branch: stable/havana

commit ef9aec1cb6ec0d5fae699f9bb1f0d042530de5de
Author: armando-migliaccio <email address hidden>
Date: Mon Nov 18 16:10:48 2013 -0800

    Address infinite loop in nova compute when getting network info

    Move the refresh_cache decorator to get_instance_nw_info.
    In fact, _get_instance_nw_info is called by the decorator
    itself, potentially causing an infinite loop in case nw_info
    is not of the expected value. This is also inline with
    the method's purpose, as stated by the docstring. At the
    same time, ensure that delete_port_for_instance and
    allocate_for_instance continue to refresh the cache by
    using the decorated version.

    Also, add a couple of debug traces that are friendly to
    the developer.

    Partial-bug: 1235435
    Related-bug: 1251792

    Change-Id: I06f8634ea241d05ac8fbcc290adf0cb23829f3e4
    (cherry picked from commit 2e520496c369a8ef646820ec068cc4416ab50eca)

tags: added: in-stable-havana
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This seems to have gone quiet since the fixes merged on the 26th. Marking committed for now.

Changed in neutron:
status: Incomplete → Invalid
Changed in nova:
status: In Progress → Fix Committed
Changed in tempest:
status: New → Invalid
Revision history for this message
Anita Kuno (anteaya) wrote :

There was one failure on November 29th, 2013 at 18:00:00.000 UTC which is an oddly exact timestamp for the failure.

Changed in nova:
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Alan Pevec (apevec)
tags: removed: in-stable-havana
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/73421

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/havana)

Reviewed: https://review.openstack.org/73421
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b047eaa9d96dc7c8f7a2f4311277090ce9d21d89
Submitter: Jenkins
Branch: stable/havana

commit b047eaa9d96dc7c8f7a2f4311277090ce9d21d89
Author: armando-migliaccio <email address hidden>
Date: Thu Nov 14 18:59:20 2013 -0800

    Remove reduntant call to update_instance_info_cache

    get_instance_nw_info calls _get_instance_nw_info,
    which is decorated with @refresh_cache. This in
    turn calls update_instance_info_cache, again. This
    is both expensive and more importantly dangerous because
    the method update_instance_info_cache may call
    _get_instance_nw_info itself, thus causing an infinite
    loop.

    Change-Id: Id8f0241d1c6609cf4e064ee6855228ae7de250e2
    Partial-bug: 1235435
    (cherry picked from commit 651fac3d5d250d42e640c3ac113084bf0d2fa3b4)

tags: added: in-stable-havana
Alan Pevec (apevec)
tags: removed: in-stable-havana
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.