Race to delete shared subnet in Tempest neutron full jobs

Bug #1357055 reported by Matt Riedemann on 2014-08-14
60
This bug affects 11 people
Affects Status Importance Assigned to Milestone
neutron
High
Eugene Nikanorov
tempest
Undecided
Salvatore Orlando

Bug Description

This seems to show up in several different tests, basically anything using neutron. I noticed it here:

http://logs.openstack.org/89/112889/1/gate/check-tempest-dsvm-neutron-full/21fcf50/console.html#_2014-08-14_17_03_10_330

That's on a stable/icehouse change, but logstash shows this on master mostly.

I see this in the neutron server logs:

http://logs.openstack.org/89/112889/1/gate/check-tempest-dsvm-neutron-full/21fcf50/logs/screen-q-svc.txt.gz#_2014-08-14_16_45_02_101

This query shows 82 hits in 10 days:

message:"delete failed \(client error\)\: Unable to complete operation on subnet" AND message:"One or more ports have an IP allocation from this subnet" AND tags:"screen-q-svc.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiZGVsZXRlIGZhaWxlZCBcXChjbGllbnQgZXJyb3JcXClcXDogVW5hYmxlIHRvIGNvbXBsZXRlIG9wZXJhdGlvbiBvbiBzdWJuZXRcIiBBTkQgbWVzc2FnZTpcIk9uZSBvciBtb3JlIHBvcnRzIGhhdmUgYW4gSVAgYWxsb2NhdGlvbiBmcm9tIHRoaXMgc3VibmV0XCIgQU5EIHRhZ3M6XCJzY3JlZW4tcS1zdmMudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6ImN1c3RvbSIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJmcm9tIjoiMjAxNC0wNy0zMVQxOTo0Mzo0NSswMDowMCIsInRvIjoiMjAxNC0wOC0xNFQxOTo0Mzo0NSswMDowMCIsInVzZXJfaW50ZXJ2YWwiOiIwIn0sInN0YW1wIjoxNDA4MDQ1NDY1OTU2fQ==

Logstash doesn't show this in the gate queue but it does show up in the uncategorized bugs list which is in the gate queue.

Matt Riedemann (mriedem) wrote :

e-r query here: https://review.openstack.org/#/c/114359/

If this is mostly just neutron full jobs, this might make sense, i.e. those run with tenant isolation so maybe there are multiple tests running at the same time allocating ports on the same subnet and then when one test completes and tries to tear down it wants to delete the subnet which is still got one or more ports on it, so we fail.

Seems like we could just do a simple check in the nova neutronv2 API before trying to delete the subnet to see if there are any ports still assigned, or just handle the 409 and move on.

tags: added: network neutron
Sean Dague (sdague) on 2014-09-09
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Joe Gordon (jogo) wrote :

11 hits in 24 hours marking as critical

Changed in nova:
importance: Low → Critical
milestone: none → juno-rc1
Joe Gordon (jogo) on 2014-09-11
Changed in neutron:
importance: Undecided → Critical

This looks like duplicate for bug #1283599

Neutron barfs at the fact that a subnet or network cannot be deleted because there are non-service ports in use on the logical segment. This failure mode may be the manifestation of multiple errors. In the case of bug 1283599, my triaging revealed that the error lied in the Nova caching logic.

I'll drill down into the recent failures, to see if I can provide further input

Joe Gordon (jogo) on 2014-09-18
Changed in neutron:
milestone: none → juno-rc1
Alex Xu (xuhj) wrote :
Download full text (4.3 KiB)

At compute manager, Instance 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e is dispeared when building:

2014-09-20 16:56:09.439 DEBUG nova.compute.manager [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] [instance: 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e] Instance disappeared during build. do_build_and_run_instance /opt/stack/new/nova/nova/compute/manager.py:2033

Check the log in nova api, the instance is deleted at nova-api side, it because it deleted before compute manager assign the host to the intsance.

2014-09-20 16:56:08.963 INFO nova.ec2.wsgi.server [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] 127.0.0.1 "POST /services/Cloud/ HTTP/1.1" status: 200 len: 1207 time: 2.3221920
2014-09-20 16:56:09.022 DEBUG nova.api.ec2 [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] action: TerminateInstances __call__ /opt/stack/new/nova/nova/api/ec2/__init__.py:325
2014-09-20 16:56:09.023 DEBUG nova.api.ec2 [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] arg: InstanceId.1 val: i-000000bd __call__ /opt/stack/new/nova/nova/api/ec2/__init__.py:328
2014-09-20 16:56:09.213 DEBUG nova.api.ec2.cloud [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Going to start terminating instances terminate_instances /opt/stack/new/nova/nova/api/ec2/cloud.py:1462
2014-09-20 16:56:09.214 DEBUG nova.compute.api [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] [instance: 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e] Going to try to terminate instance delete /opt/stack/new/nova/nova/compute/api.py:1791
2014-09-20 16:56:09.304 DEBUG nova.quota [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Created reservations ['fdae1ad5-be89-4dd5-b54e-29f4d24c19dc', '86b41f79-7b8a-446d-b545-77719c3dd891', 'aef891e1-a066-4e2a-9978-27967759b0c2'] reserve /opt/stack/new/nova/nova/quota.py:1315
2014-09-20 16:56:09.393 DEBUG nova.quota [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Committed reservations [u'fdae1ad5-be89-4dd5-b54e-29f4d24c19dc', u'86b41f79-7b8a-446d-b545-77719c3dd891', u'aef891e1-a066-4e2a-9978-27967759b0c2'] commit /opt/stack/new/nova/nova/quota.py:1340
2014-09-20 16:56:09.432 DEBUG nova.api.ec2.apirequest [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] <TerminateInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2014-05-01/">

So compute manager try to cleanup the network: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2044

But there is no port found by the cleanup function:
2014-09-20 16:56:09.459 DEBUG neutronclient.client [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] RESP:200 CaseInsensitiveDict({'date': 'Sat, 20 Sep 2014 16:56:09 GMT', 'content-length': '13', 'content-type': 'application/json; charset=UTF-8', 'x-openstack-request-id': 'req-e677c6b1-9316-4680-abe5-c3896bd71ee9'}) {"ports": []...

Read more...

Alex Xu (xuhj) wrote :

Patch for based on this analyze: https://review.openstack.org/#/c/122986/

Alex Xu (xuhj) wrote :
Download full text (3.2 KiB)

Except the case fixed by https://review.openstack.org/#/c/122986/, there also a race in tempest test.

The analyze is based on:
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/console.html

In this case subnet a404dc60-62ab-4c2c-bf24-2ab33df6f176 can't be deleted.
Port 375d726d-fccd-42e4-8b5f-60a200ed76af is in this subnet, and attach to instance c65d83f5-db85-4adf-b8ae-bba2b81ac4c6

At nova api side, the port attach at:
2014-09-21 05:27:06.453 AUDIT nova.api.openstack.compute.contrib.attach_interfaces [req-d547ba16-49ec-468c-a2d3-00075880f33b TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] [instance: c65d83f5-db85-4adf-b8ae-bba2b81ac4c6] Attach interface

and detach at:
2014-09-21 05:27:10.937 AUDIT nova.api.openstack.compute.contrib.attach_interfaces [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] [instance: c65d83f5-db85-4adf-b8ae-bba2b81ac4c6] Detach interface 375d726d-fccd-42e4-8b5f-60a200ed76af

But the detach failed
2014-09-21 05:27:11.047 ERROR oslo.messaging.rpc.dispatcher [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Exception during message handling: Port 375d726d-fccd-42e4-8b5f-60a200ed76af is not attached
2014-09-21 05:27:11.047 4710 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
....

So the reason is detach happend before attach finished.

The failed test is:

    def test_hotplug_nic(self):
        """
        1. create a new network, with no gateway (to prevent overwriting VM's
            gateway)
        2. connect VM to new network
        3. set static ip and bring new nic up
        4. check VM can ping new network dhcp port

        """
        self._setup_network_and_servers()
        self._check_public_network_connectivity(should_connect=True)
        self._create_new_network()
        self._hotplug_server()
        self._check_network_internal_connectivity(network=self.new_net)

self._hotplug_server() attach the interface, but there won't waiting for the interface attachment finished.
and 'self._check_network_internal_connectivity(network=self.new_net)' only check the interface created when boot instance and dhcp interface.

So the cleanup may happened before attach interface finished.

The cleanup is setup at 'self._hotplug_server'

        self.addCleanup(self.delete_wrapper,
                        self.interface_client.delete_interface,
                        server['id'], interface['port_id'])

But this cleanup failed as the log:
2014-09-21 05:27:11.047 ERROR oslo.messaging.rpc.dispatcher [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Exception during message handling: Port 375d726d-fccd-42e4-8b5f-60a200ed76af is not attached

And another cleanup is waiting for the interface deleted,
        self.addCleanup(self.network_client.wait_for_resource_deletion,
                        'port',
                        interface['port_id'])

We also can found this cleanup timeout finally
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/console.html#_2...

Read more...

Alex Xu (xuhj) wrote :

Emm....I missing something in the second case. There is code trying waiting for attach_interface finished.

Alex Xu (xuhj) wrote :
Download full text (3.7 KiB)

For the second case:

network info catch update and there is port info for 375d726d-fccd-42e4-8b5f-60a200ed76af:

2014-09-21 05:27:06.976 DEBUG nova.network.base_api [req-d547ba16-49ec-468c-a2d3-00075880f33b TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Updating cache with info: [VIF({'profile': {}, 'ovs_interfaceid': u'3725c51f-b240-404f-b747-06da1460126e', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [IP({'meta': {}, 'version': 4, 'type': 'floating', 'address': u'172.24.4.125'})], 'address': u'10.100.0.2'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.3'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.0/28', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'10.100.0.1'})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'614ab449-b511-4e99-9694-121e33067d4c', 'label': u'network-smoke--545348394'}), 'devname': u'tap3725c51f-b2', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:e1:30:0f', 'active': True, 'type': u'ovs', 'id': u'3725c51f-b240-404f-b747-06da1460126e', 'qbg_params': None}), VIF({'profile': {}, 'ovs_interfaceid': u'375d726d-fccd-42e4-8b5f-60a200ed76af', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'10.100.0.17'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.18'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.16/28', 'gateway': IP({'meta': {}, 'version': None, 'type': 'gateway', 'address': None})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'0e123907-7f88-47aa-a6d7-bbfc61105bff', 'label': u'network-smoke--29266327'}), 'devname': u'tap375d726d-fc', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:85:44:97', 'active': False, 'type': u'ovs', 'id': u'375d726d-fccd-42e4-8b5f-60a200ed76af', 'qbg_params': None})] update_instance_cache_with_nw_info

And another network info catch update by _heal_instance_info_cache, but there isn't port info for 375d726d-fccd-42e4-8b5f-60a200ed76af:

2014-09-21 05:27:06.999 4710 DEBUG nova.network.base_api [-] Updating cache with info: [VIF({'profile': {}, 'ovs_interfaceid': u'3725c51f-b240-404f-b747-06da1460126e', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [IP({'meta': {}, 'version': 4, 'type': 'floating', 'address': u'172.24.4.125'})], 'address': u'10.100.0.2'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.3'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.0/28', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'10.100.0.1'})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'614ab449-b511-4e99-9694-121e33067d4c', 'label': u'network-smoke--545348394'}), 'devname': u'tap3725c51f-b2', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'detai...

Read more...

Matt Riedemann (mriedem) wrote :

Looks like bug 1326183 is related.

Michael Still (mikal) wrote :

Is anyone working the nova side of this bug?

Alex Xu (xuhj) wrote :

@Michael, I'm working on it. I fixed one case with patch https://review.openstack.org/#/c/122986/ , it begin to merge. And there is another case is under review https://review.openstack.org/#/c/97662/ that fixed by Praveen

I didn't find anymore cases can trigger this bug for now. Hope those two patches can fix this.

And there can find another error message in the log:
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/logs/screen-n-cpu.txt.gz?level=ERROR#_2014-09-21_05_18_23_709

I think that is ec2 api bug, and that won't trigger this bug. We can file another bug for it.

Reviewed: https://review.openstack.org/122986
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4adc8376f722e48e1c89a05d1b4e922f8ca4c19b
Submitter: Jenkins
Branch: master

commit 4adc8376f722e48e1c89a05d1b4e922f8ca4c19b
Author: He Jie Xu <email address hidden>
Date: Sun Sep 21 14:07:29 2014 +0800

    Sync network_info if instance not found before _build_resources yield

    If instance is deleted before _build_resources yield, there won't
    waiting for network_info sync finish.

    So there is race case, when the exception catching for
    'InstanceNotFound' in do_build_and_run_instance invoke
    '_cleanup_allocated_networks', the '_allocate_network_async' didn't
    create port yet. After '_cleanup_allocated_networks' finished,
    '_allocate_network_async' create new port for the instance, but there
    isn't any cleanup for the new port anymore.

    Change-Id: I5027a1aaeb9a2106f7982fd0422f88c4caa20ffb
    Related-Bug: #1357055

Michael Still (mikal) wrote :

Alex thanks for those patches. The second one is now approved and awaiting merge.

Changed in nova:
assignee: nobody → Alex Xu (xuhj)
Sean Dague (sdague) wrote :

Both of the fixes provided in comment #15 have merged, so marking as fix committed. If we think there remains another outstanding fix needed, lets reopen.

Changed in nova:
status: Confirmed → Fix Committed
Kyle Mestery (mestery) wrote :

I'm assuming this bug can be closed on the Neutron side as well, but I'll wait to verify the nova patches have fixed things a bit before marking as Closed and removing from RC1.

Strangely enough the trace does not show up in logstash

Changed in neutron:
assignee: nobody → Mohammad Banikazemi (mb-s)
Mohammad Banikazemi (mb-s) wrote :

Have been looking at Neutron service logs posted above for comments #20 and #22. In both cases, the port that exists when subnet is being deleted (and therefore the subnet deletion fails) never gets deleted. That is Neutron never receives any request to delete the port. The port in question gets created, gets bound correctly, shows up in get ports, but never gets deleted. This is similar to the observation posted in comment #6 above.

Kyle Mestery (mestery) wrote :

Removing Mohammad from this. He has reached out to me and indicated he will not have time to work on this in the coming week or so.

Changed in neutron:
assignee: Mohammad Banikazemi (mb-s) → nobody
Kyle Mestery (mestery) on 2014-09-30
Changed in neutron:
milestone: juno-rc1 → kilo-1
Alex Xu (xuhj) wrote :

yea, one more case.

I didn't the root cause. But put my analyze at here.

The analyze base on this:
http://logs.openstack.org/71/125171/2/gate/gate-tempest-dsvm-neutron-full/62dd074/console.html

The subnet is deleted at: http://logs.openstack.org/71/125171/2/gate/gate-tempest-dsvm-neutron-full/62dd074/logs/tempest.txt.gz#_2014-09-30_22_04_05_218

ServerRescueNegativeTestXML create two servers for tests.
server id: b2c534bf-3655-4f16-b5a0-ed04fc8465ef
rescue server id: d281aee9-50ba-452b-a2f0-b495c8563d5e

The problem is there isn't any log to show those two servers were deleted. So the ports attach to those servers won't be delete.
But I don't know way those two servers won't be deleted.

Anyone has idea?

Alex Xu (xuhj) wrote :

I added tempest as affecting project. Because I think there should be log to show tempest try to clean up the servers that created by test.

I can't find those two servers were deleted:
server id: b2c534bf-3655-4f16-b5a0-ed04fc8465ef
rescue server id: d281aee9-50ba-452b-a2f0-b495c8563d5e

http://logs.openstack.org/71/125171/2/gate/gate-tempest-dsvm-neutron-full/62dd074/logs/tempest.txt.gz

Thierry Carrez (ttx) on 2014-10-01
Changed in nova:
status: Fix Committed → Fix Released
Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
tags: added: juno-rc-potential
Changed in tempest:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
Download full text (5.0 KiB)

I came to the same conclusions as Alex: the servers are not deleted hence the error.

However, the logging which Alex is claiming for is already there.
Indeed here are the delete operations on teardown for a failing test:

salvatore@trustillo:~$ cat tempest.txt.gz | grep -i ServerRescueNegativeTestJSON.*tearDownClass.*DELETE
2014-10-07 17:49:04.444 25908 INFO tempest.common.rest_client [req-75c758b3-d8cb-48d6-9cb6-3670147aca41 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/829473406bb545c895a5cd0320624812/os-volumes/ffc44446-ba25-413d-8ff1-839d3643299d 0.135s
2014-10-07 17:49:04.444 25908 DEBUG tempest.common.rest_client [req-75c758b3-d8cb-48d6-9cb6-3670147aca41 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/829473406bb545c895a5cd0320624812/os-volumes/ffc44446-ba25-413d-8ff1-839d3643299d 0.135s
2014-10-07 17:52:21.452 25908 INFO tempest.common.rest_client [req-d0fa5615-9e64-4faa-bd8d-2ad1ac6afb53 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/0ecd1539-70af-4500-aa4d-9e131fa1fffc 0.238s
2014-10-07 17:52:21.452 25908 DEBUG tempest.common.rest_client [req-d0fa5615-9e64-4faa-bd8d-2ad1ac6afb53 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/0ecd1539-70af-4500-aa4d-9e131fa1fffc 0.238s
2014-10-07 17:52:21.513 25908 INFO tempest.common.rest_client [req-89562c45-7448-41bb-8e3e-0beec8460aab None] Request (ServerRescueNegativeTestJSON:tearDownClass): 409 DELETE http://127.0.0.1:9696/v2.0/subnets/9614d778-66b3-4b81-83fc-f7a47602ceb2 0.060s
2014-10-07 17:52:21.514 25908 DEBUG tempest.common.rest_client [req-89562c45-7448-41bb-8e3e-0beec8460aab None] Request (ServerRescueNegativeTestJSON:tearDownClass): 409 DELETE http://127.0.0.1:9696/v2.0/subnets/9614d778-66b3-4b81-83fc-f7a47602ceb2 0.060s

No DELETE server command is specified.
Instead for a successful test the two servers are deleted.

2014-09-26 11:48:05.532 7755 INFO tempest.common.rest_client [req-6d9072aa-dbcb-4398-b4c4-46aeb2140e4b None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/os-volumes/a24f4dc2-bfb5-4a60-be05-9051f08cc447 0.086s
2014-09-26 11:50:06.733 7755 INFO tempest.common.rest_client [req-f3752c9f-8de5-4bde-98c9-879c5a37ff44 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/servers/1d754b6e-128f-4b42-88ab-9dbefedd887f 0.155s
2014-09-26 11:50:06.882 7755 INFO tempest.common.rest_client [req-dcb05efc-229a-4f40-ac67-2e95a80373c0 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/servers/d26d1144-58d2-4900-93af-bf7fecdd7a60 0.148s
2014-09-26 11:50:09.531 7755 INFO tempest.common.rest_client [req-73246abe-5e0e-4d16-88dc-75ca05593b2c None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/755eeb0d-6eb8-4acd-b36d-91bd4787cf4e 0.180s
2014-09-26 11:50:09.583 7755 INFO tempest.common.rest_client [req-549...

Read more...

no longer affects: neutron
tags: removed: juno-rc-potential
Alex Xu (xuhj) wrote :

As I and Salvatore get same conclusions. This isn't bug neutron and nova. It should be tempest bug. I removed nova from affected projects.

no longer affects: nova

Reviewed: https://review.openstack.org/126908
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=1422a9ac6c40df71978412cddce90870d578989c
Submitter: Jenkins
Branch: master

commit 1422a9ac6c40df71978412cddce90870d578989c
Author: Salvatore <email address hidden>
Date: Wed Oct 8 15:53:25 2014 +0200

    Log identifiers of resources being cleaned up

    On teardown compute API tests clean up resources such as servers
    and images. This patch simply logs the identifiers of these
    resources in order to help debugging failures which might originate
    from problems during cleanup.

    Change-Id: Ied2924bba78ec8f4a3b51df1be69273b28362475
    Related-Bug: #1357055

Changed in tempest:
status: New → Confirmed
Silvan Kaiser (2-silvan) wrote :

This issue still affects ppl, looking at elastic recheck at 2014-12-12 13:14 UTC:
Bug 1357055 - Race to delete shared subnet in Tempest neutron full jobs
23 fails in 24 hrs / 79 fails in 10 days
Any new ideas on this one?

Reviewed: https://review.openstack.org/151415
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c17faaad2b1030b16c1464b4ca3325a19631833d
Submitter: Jenkins
Branch: master

commit c17faaad2b1030b16c1464b4ca3325a19631833d
Author: Eugene Nikanorov <email address hidden>
Date: Fri Jan 30 00:58:42 2015 +0300

    Log tenant ports if subnet could not be deleted

    Change-Id: Icd64cda9de2bb1c3ab090366a7a20cec894dca14
    Related-Bug: #1357055

Fix proposed to branch: master
Review: https://review.openstack.org/171662

Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
status: New → In Progress
Changed in neutron:
importance: Undecided → Wishlist
Eugene Nikanorov (enikanorov) wrote :

The issue appears to be a simple race between DHCP port creation and subnet deletion.
DHCP port gets after 'autodelete' ports was cleared, but before other allocations are checked.
It doesn't happen too often, but is annoying.

Changed in neutron:
importance: Wishlist → High

Fix proposed to branch: master
Review: https://review.openstack.org/171848

Reviewed: https://review.openstack.org/171662
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d5aa1831ac95c16fcee6ec0bb8f0bf07afbe384c
Submitter: Jenkins
Branch: master

commit d5aa1831ac95c16fcee6ec0bb8f0bf07afbe384c
Author: Eugene Nikanorov <email address hidden>
Date: Wed Apr 8 02:23:22 2015 +0400

    Add logging to dangling port to ml2 delete_subnet

    This is useful when trouble shooting test failures.
    Also, in db_base_plugin_v2, log only port_id instead of
    full allocation object.

    Change-Id: I3d77318aee70836de125687a7f6c0f495d545f21
    Related-Bug: #1357055

Reviewed: https://review.openstack.org/171848
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1642bca4d9c4fee15129f74d93300c1eab1afd29
Submitter: Jenkins
Branch: master

commit 1642bca4d9c4fee15129f74d93300c1eab1afd29
Author: Eugene Nikanorov <email address hidden>
Date: Thu Apr 9 01:16:18 2015 +0300

    Handle race condition on subnet-delete

    This fix targets quite rare case of race condition between
    port creation and subnet deletion. This usually happens
    during API tests that do things quickly.
    DHCP port is being created after delete_subnet checks for
    DHCP ports, but before it checks for IPAllocations on subnet.
    The solution is to apply retrying logic, which is really necessary
    as we can't fetch new IPAllocations with the same query and within
    the active transaction in mysql because of REPEATABLE READ
    transaction isolation.

    Change-Id: Ib9da018e654cdee3b64aa38de90f171c92ee28ee
    Closes-Bug: 1357055

Changed in neutron:
status: In Progress → Fix Committed
Eugene Nikanorov (enikanorov) wrote :

The fix was not correct in the essential query, so it produces exactly the same behavior with postgresql

Changed in neutron:
status: Fix Committed → In Progress

Reviewed: https://review.openstack.org/175125
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0109578a8ec07f743f7e2b654007e17f145ea20f
Submitter: Jenkins
Branch: master

commit 0109578a8ec07f743f7e2b654007e17f145ea20f
Author: Eugene Nikanorov <email address hidden>
Date: Sat Apr 18 15:31:44 2015 +0400

    Fix incorrect query for user ip allocations

    Previously the query was fetching an IPAllocation object incorrectly
    relying on the fact that it has port attribute that should be
    join-loaded when it really is not.

    Incorrect query produced by previous code:
    SELECT ipallocations.port_id AS ipallocations_port_id,
           ipallocations.ip_address AS ipallocations_ip_address,
           ipallocations.subnet_id AS ipallocations_subnet_id,
           ipallocations.network_id AS ipallocations_network_id
    FROM ipallocations, ports
    WHERE ipallocations.subnet_id = :subnet_id_1
          AND ports.device_owner NOT IN (:device_owner_1)

    The query then may have produced results that don't satisfy
    the condition intended by the code.

    Query produced by the fixed code:
    SELECT ipallocations.port_id AS ipallocations_port_id,
           ipallocations.ip_address AS ipallocations_ip_address,
           ipallocations.subnet_id AS ipallocations_subnet_id,
           ipallocations.network_id AS ipallocations_network_id
    FROM ipallocations JOIN ports ON ports.id = ipallocations.port_id
    WHERE ipallocations.subnet_id = :subnet_id_1
          AND ports.device_owner NOT IN (:device_owner_1)

    Change-Id: I34682df784e30e3ce49ee48c690f8b799ad58149
    Closes-Bug: #1357055

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2015-06-24
Changed in neutron:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-10-15
Changed in neutron:
milestone: liberty-1 → 7.0.0
Ken'ichi Ohmichi (oomichi) wrote :

Alex has added Tempest to this bug report for adding more log for the investigation and that is done on https://review.openstack.org/#/c/126908/

So this bug report should be close now.

Changed in tempest:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers