Bug #1357055 “Race to delete shared subnet in Tempest neutron fu...” : Bugs : tempest

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-08-14:

#1

e-r query here: https://review.openstack.org/#/c/114359/

If this is mostly just neutron full jobs, this might make sense, i.e. those run with tenant isolation so maybe there are multiple tests running at the same time allocating ports on the same subnet and then when one test completes and tries to tear down it wants to delete the subnet which is still got one or more ports on it, so we fail.

Seems like we could just do a simple check in the nova neutronv2 API before trying to delete the subnet to see if there are any ports still assigned, or just handle the 409 and move on.

tags:

added: network neutron

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-08-14:

#2

Interesting, if I take out the 'Exception dispatching event' from the query, there are no hits, so maybe the stable/icehouse backport will clean this up for good?

http://logstash.openstack.org/#eyJzZWFyY2giOiJOT1QgbWVzc2FnZTpcIkV4Y2VwdGlvbiBkaXNwYXRjaGluZyBldmVudFwiIEFORCBtZXNzYWdlOlwiSW5mbyBjYWNoZSBmb3IgaW5zdGFuY2VcIiBBTkQgbWVzc2FnZTpcImNvdWxkIG5vdCBiZSBmb3VuZFwiIEFORCB0YWdzOlwic2NyZWVuLW4tY3B1LnR4dFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwODA1MjIwMDg3MCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==

Sean Dague (sdague) on 2014-09-09

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Low

Revision history for this message

Joe Gordon (jogo) wrote on 2014-09-11:

#3

11 hits in 24 hours marking as critical

Changed in nova:
importance:	Low → Critical
milestone:	none → juno-rc1

Joe Gordon (jogo) on 2014-09-11

Changed in neutron:
importance:	Undecided → Critical

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2014-09-15:

#4

This looks like duplicate for bug #1283599

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2014-09-15:

#5

Neutron barfs at the fact that a subnet or network cannot be deleted because there are non-service ports in use on the logical segment. This failure mode may be the manifestation of multiple errors. In the case of bug 1283599, my triaging revealed that the error lied in the Nova caching logic.

I'll drill down into the recent failures, to see if I can provide further input

Joe Gordon (jogo) on 2014-09-18

Changed in neutron:
milestone:	none → juno-rc1

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-21:

#6

Download full text (4.3 KiB)

At compute manager, Instance 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e is dispeared when building:

2014-09-20 16:56:09.439 DEBUG nova.compute.manager [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] [instance: 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e] Instance disappeared during build. do_build_and_run_instance /opt/stack/new/nova/nova/compute/manager.py:2033

Check the log in nova api, the instance is deleted at nova-api side, it because it deleted before compute manager assign the host to the intsance.

2014-09-20 16:56:08.963 INFO nova.ec2.wsgi.server [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] 127.0.0.1 "POST /services/Cloud/ HTTP/1.1" status: 200 len: 1207 time: 2.3221920
2014-09-20 16:56:09.022 DEBUG nova.api.ec2 [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] action: TerminateInstances __call__ /opt/stack/new/nova/nova/api/ec2/__init__.py:325
2014-09-20 16:56:09.023 DEBUG nova.api.ec2 [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] arg: InstanceId.1 val: i-000000bd __call__ /opt/stack/new/nova/nova/api/ec2/__init__.py:328
2014-09-20 16:56:09.213 DEBUG nova.api.ec2.cloud [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Going to start terminating instances terminate_instances /opt/stack/new/nova/nova/api/ec2/cloud.py:1462
2014-09-20 16:56:09.214 DEBUG nova.compute.api [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] [instance: 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e] Going to try to terminate instance delete /opt/stack/new/nova/nova/compute/api.py:1791
2014-09-20 16:56:09.304 DEBUG nova.quota [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Created reservations ['fdae1ad5-be89-4dd5-b54e-29f4d24c19dc', '86b41f79-7b8a-446d-b545-77719c3dd891', 'aef891e1-a066-4e2a-9978-27967759b0c2'] reserve /opt/stack/new/nova/nova/quota.py:1315
2014-09-20 16:56:09.393 DEBUG nova.quota [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Committed reservations [u'fdae1ad5-be89-4dd5-b54e-29f4d24c19dc', u'86b41f79-7b8a-446d-b545-77719c3dd891', u'aef891e1-a066-4e2a-9978-27967759b0c2'] commit /opt/stack/new/nova/nova/quota.py:1340
2014-09-20 16:56:09.432 DEBUG nova.api.ec2.apirequest [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] <TerminateInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2014-05-01/">

So compute manager try to cleanup the network: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2044

But there is no port found by the cleanup function:
2014-09-20 16:56:09.459 DEBUG neutronclient.client [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] RESP:200 CaseInsensitiveDict({'date': 'Sat, 20 Sep 2014 16:56:09 GMT', 'content-length': '13', 'content-type': 'application/json; charset=UTF-8', 'x-openstack-request-id': 'req-e677c6b1-9316-4680-abe5-c3896bd71ee9'}) {"ports": []...

At compute manager, Instance 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e is dispeared when building:

2014-09-20 16:56:09.439 DEBUG nova.compute.manager [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] [instance: 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e] Instance disappeared during build. do_build_and_run_instance /opt/stack/new/nova/nova/compute/manager.py:2033

Check the log in nova api, the instance is deleted at nova-api side, it because it deleted before compute manager assign the host to the intsance.

2014-09-20 16:56:08.963 INFO nova.ec2.wsgi.server [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] 127.0.0.1 "POST /services/Cloud/ HTTP/1.1" status: 200 len: 1207 time: 2.3221920
2014-09-20 16:56:09.022 DEBUG nova.api.ec2 [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] action: TerminateInstances __call__ /opt/stack/new/nova/nova/api/ec2/__init__.py:325
2014-09-20 16:56:09.023 DEBUG nova.api.ec2 [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] arg: InstanceId.1		val: i-000000bd __call__ /opt/stack/new/nova/nova/api/ec2/__init__.py:328
2014-09-20 16:56:09.213 DEBUG nova.api.ec2.cloud [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Going to start terminating instances terminate_instances /opt/stack/new/nova/nova/api/ec2/cloud.py:1462
2014-09-20 16:56:09.214 DEBUG nova.compute.api [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] [instance: 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e] Going to try to terminate instance delete /opt/stack/new/nova/nova/compute/api.py:1791
2014-09-20 16:56:09.304 DEBUG nova.quota [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Created reservations ['fdae1ad5-be89-4dd5-b54e-29f4d24c19dc', '86b41f79-7b8a-446d-b545-77719c3dd891', 'aef891e1-a066-4e2a-9978-27967759b0c2'] reserve /opt/stack/new/nova/nova/quota.py:1315
2014-09-20 16:56:09.393 DEBUG nova.quota [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] Committed reservations [u'fdae1ad5-be89-4dd5-b54e-29f4d24c19dc', u'86b41f79-7b8a-446d-b545-77719c3dd891', u'aef891e1-a066-4e2a-9978-27967759b0c2'] commit /opt/stack/new/nova/nova/quota.py:1340
2014-09-20 16:56:09.432 DEBUG nova.api.ec2.apirequest [req-4093d683-e99d-4816-b427-dd95f7c64987 InstanceRunTest-1019975326 InstanceRunTest-564222607] <TerminateInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2014-05-01/">

So compute manager try to cleanup the network: https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2044

But there is no port found by the cleanup function:
2014-09-20 16:56:09.459 DEBUG neutronclient.client [req-d9ef4ec9-3904-4c19-8e38-f06c76a5f1ee InstanceRunTest-1019975326 InstanceRunTest-564222607] RESP:200 CaseInsensitiveDict({'date': 'Sat, 20 Sep 2014 16:56:09 GMT', 'content-length': '13', 'content-type': 'application/json; charset=UTF-8', 'x-openstack-request-id': 'req-e677c6b1-9316-4680-abe5-c3896bd71ee9'}) {"ports": []}

At same time the '_allocate_network_async' is running also.
After the cleanup function end, the '_allocate_network_async' create new port for instance 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e

2014-09-20 16:56:09.865 1894 DEBUG nova.network.neutronv2.api [-] [instance: 6ce21bd9-e6a6-41a6-81ec-72ce368cb38e] Successfully created port: 89ff93ba-b020-43c7-98b3-7ba9afebcd1f _create_port /opt/stack/new/nova/nova/network/neutronv2/api.py:215

And check the log, there isn't any api call to neutron to delete this port 89ff93ba-b020-43c7-98b3-7ba9afebcd1f

And finally, the '_allocate_network_async' is stopped for instance not found also:
http://logs.openstack.org/85/122785/1/check/check-tempest-dsvm-neutron-pg-full/61429eb/logs/screen-n-cpu.txt.gz#_2014-09-20_16_56_10_014

So the instance is deleted before the 'yield resources' in '_build_resources'
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2216

before  'yield resources', the exception catching for InstanceNotFound won't waiting for the network_info sync finish.

So the race is happened between https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2044 and '_allocate_network_async'

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-21:

#7

above analyz is based on this case http://logs.openstack.org/85/122785/1/check/check-tempest-dsvm-neutron-pg-full/61429eb

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-21:

#8

Patch for based on this analyze: https://review.openstack.org/#/c/122986/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-21:

#9

Saw it again here:

http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/console.html

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-22:

#10

Download full text (3.2 KiB)

Except the case fixed by https://review.openstack.org/#/c/122986/, there also a race in tempest test.

The analyze is based on:
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/console.html

In this case subnet a404dc60-62ab-4c2c-bf24-2ab33df6f176 can't be deleted.
Port 375d726d-fccd-42e4-8b5f-60a200ed76af is in this subnet, and attach to instance c65d83f5-db85-4adf-b8ae-bba2b81ac4c6

At nova api side, the port attach at:
2014-09-21 05:27:06.453 AUDIT nova.api.openstack.compute.contrib.attach_interfaces [req-d547ba16-49ec-468c-a2d3-00075880f33b TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] [instance: c65d83f5-db85-4adf-b8ae-bba2b81ac4c6] Attach interface

and detach at:
2014-09-21 05:27:10.937 AUDIT nova.api.openstack.compute.contrib.attach_interfaces [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] [instance: c65d83f5-db85-4adf-b8ae-bba2b81ac4c6] Detach interface 375d726d-fccd-42e4-8b5f-60a200ed76af

But the detach failed
2014-09-21 05:27:11.047 ERROR oslo.messaging.rpc.dispatcher [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Exception during message handling: Port 375d726d-fccd-42e4-8b5f-60a200ed76af is not attached
2014-09-21 05:27:11.047 4710 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
....

So the reason is detach happend before attach finished.

The failed test is:

    def test_hotplug_nic(self):
        """
        1. create a new network, with no gateway (to prevent overwriting VM's
            gateway)
        2. connect VM to new network
        3. set static ip and bring new nic up
        4. check VM can ping new network dhcp port

        """
        self._setup_network_and_servers()
        self._check_public_network_connectivity(should_connect=True)
        self._create_new_network()
        self._hotplug_server()
        self._check_network_internal_connectivity(network=self.new_net)

self._hotplug_server() attach the interface, but there won't waiting for the interface attachment finished.
and 'self._check_network_internal_connectivity(network=self.new_net)' only check the interface created when boot instance and dhcp interface.

So the cleanup may happened before attach interface finished.

The cleanup is setup at 'self._hotplug_server'

        self.addCleanup(self.delete_wrapper,
                        self.interface_client.delete_interface,
                        server['id'], interface['port_id'])

But this cleanup failed as the log:
2014-09-21 05:27:11.047 ERROR oslo.messaging.rpc.dispatcher [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Exception during message handling: Port 375d726d-fccd-42e4-8b5f-60a200ed76af is not attached

And another cleanup is waiting for the interface deleted,
        self.addCleanup(self.network_client.wait_for_resource_deletion,
                        'port',
                        interface['port_id'])

We also can found this cleanup timeout finally
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/console.html#_2...

Except the case fixed by https://review.openstack.org/#/c/122986/, there also a race in tempest test.

The analyze is based on:
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/console.html

In this case subnet a404dc60-62ab-4c2c-bf24-2ab33df6f176 can't be deleted.
Port 375d726d-fccd-42e4-8b5f-60a200ed76af is in this subnet, and attach to instance c65d83f5-db85-4adf-b8ae-bba2b81ac4c6

At nova api side, the port attach at:
2014-09-21 05:27:06.453 AUDIT nova.api.openstack.compute.contrib.attach_interfaces [req-d547ba16-49ec-468c-a2d3-00075880f33b TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] [instance: c65d83f5-db85-4adf-b8ae-bba2b81ac4c6] Attach interface

and detach at:
2014-09-21 05:27:10.937 AUDIT nova.api.openstack.compute.contrib.attach_interfaces [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] [instance: c65d83f5-db85-4adf-b8ae-bba2b81ac4c6] Detach interface 375d726d-fccd-42e4-8b5f-60a200ed76af

But the detach failed
2014-09-21 05:27:11.047 ERROR oslo.messaging.rpc.dispatcher [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Exception during message handling: Port 375d726d-fccd-42e4-8b5f-60a200ed76af is not attached
2014-09-21 05:27:11.047 4710 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
....

So the reason is detach happend before attach finished.

The failed test is:

def test_hotplug_nic(self):
        """
        1. create a new network, with no gateway (to prevent overwriting VM's
            gateway)
        2. connect VM to new network
        3. set static ip and bring new nic up
        4. check VM can ping new network dhcp port

"""
        self._setup_network_and_servers()
        self._check_public_network_connectivity(should_connect=True)
        self._create_new_network()
        self._hotplug_server()
        self._check_network_internal_connectivity(network=self.new_net)

self._hotplug_server() attach the interface, but there won't waiting for the interface attachment finished.
and 'self._check_network_internal_connectivity(network=self.new_net)' only check the interface created when boot instance and dhcp interface.

So the cleanup may happened before attach interface finished.

The cleanup is setup at 'self._hotplug_server'

self.addCleanup(self.delete_wrapper,
                        self.interface_client.delete_interface,
                        server['id'], interface['port_id'])

But this cleanup failed as the log:
2014-09-21 05:27:11.047 ERROR oslo.messaging.rpc.dispatcher [req-035307f9-819f-4489-aea0-ce18f00ec2f5 TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Exception during message handling: Port 375d726d-fccd-42e4-8b5f-60a200ed76af is not attached

And another cleanup is waiting for the interface deleted,
        self.addCleanup(self.network_client.wait_for_resource_deletion,
                        'port',
                        interface['port_id'])

We also can found this cleanup timeout finally
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/console.html#_2014-09-21_05_41_32_226

So for this case, we should fix the tempest test, to waiting for attach interface down.

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-22:

#11

Emm....I missing something in the second case. There is code trying waiting for attach_interface finished.

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-22:

#12

Download full text (3.7 KiB)

For the second case:

network info catch update and there is port info for 375d726d-fccd-42e4-8b5f-60a200ed76af:

2014-09-21 05:27:06.976 DEBUG nova.network.base_api [req-d547ba16-49ec-468c-a2d3-00075880f33b TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Updating cache with info: [VIF({'profile': {}, 'ovs_interfaceid': u'3725c51f-b240-404f-b747-06da1460126e', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [IP({'meta': {}, 'version': 4, 'type': 'floating', 'address': u'172.24.4.125'})], 'address': u'10.100.0.2'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.3'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.0/28', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'10.100.0.1'})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'614ab449-b511-4e99-9694-121e33067d4c', 'label': u'network-smoke--545348394'}), 'devname': u'tap3725c51f-b2', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:e1:30:0f', 'active': True, 'type': u'ovs', 'id': u'3725c51f-b240-404f-b747-06da1460126e', 'qbg_params': None}), VIF({'profile': {}, 'ovs_interfaceid': u'375d726d-fccd-42e4-8b5f-60a200ed76af', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'10.100.0.17'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.18'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.16/28', 'gateway': IP({'meta': {}, 'version': None, 'type': 'gateway', 'address': None})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'0e123907-7f88-47aa-a6d7-bbfc61105bff', 'label': u'network-smoke--29266327'}), 'devname': u'tap375d726d-fc', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:85:44:97', 'active': False, 'type': u'ovs', 'id': u'375d726d-fccd-42e4-8b5f-60a200ed76af', 'qbg_params': None})] update_instance_cache_with_nw_info

And another network info catch update by _heal_instance_info_cache, but there isn't port info for 375d726d-fccd-42e4-8b5f-60a200ed76af:

2014-09-21 05:27:06.999 4710 DEBUG nova.network.base_api [-] Updating cache with info: [VIF({'profile': {}, 'ovs_interfaceid': u'3725c51f-b240-404f-b747-06da1460126e', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [IP({'meta': {}, 'version': 4, 'type': 'floating', 'address': u'172.24.4.125'})], 'address': u'10.100.0.2'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.3'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.0/28', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'10.100.0.1'})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'614ab449-b511-4e99-9694-121e33067d4c', 'label': u'network-smoke--545348394'}), 'devname': u'tap3725c51f-b2', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'detai...

For the second case:

network info catch update and there is port info for 375d726d-fccd-42e4-8b5f-60a200ed76af:

2014-09-21 05:27:06.976 DEBUG nova.network.base_api [req-d547ba16-49ec-468c-a2d3-00075880f33b TestNetworkBasicOps-1044489211 TestNetworkBasicOps-84164684] Updating cache with info: [VIF({'profile': {}, 'ovs_interfaceid': u'3725c51f-b240-404f-b747-06da1460126e', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [IP({'meta': {}, 'version': 4, 'type': 'floating', 'address': u'172.24.4.125'})], 'address': u'10.100.0.2'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.3'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.0/28', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'10.100.0.1'})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'614ab449-b511-4e99-9694-121e33067d4c', 'label': u'network-smoke--545348394'}), 'devname': u'tap3725c51f-b2', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:e1:30:0f', 'active': True, 'type': u'ovs', 'id': u'3725c51f-b240-404f-b747-06da1460126e', 'qbg_params': None}), VIF({'profile': {}, 'ovs_interfaceid': u'375d726d-fccd-42e4-8b5f-60a200ed76af', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u'10.100.0.17'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.18'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.16/28', 'gateway': IP({'meta': {}, 'version': None, 'type': 'gateway', 'address': None})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'0e123907-7f88-47aa-a6d7-bbfc61105bff', 'label': u'network-smoke--29266327'}), 'devname': u'tap375d726d-fc', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:85:44:97', 'active': False, 'type': u'ovs', 'id': u'375d726d-fccd-42e4-8b5f-60a200ed76af', 'qbg_params': None})] update_instance_cache_with_nw_info

And another network info catch update by _heal_instance_info_cache, but there isn't port info for 375d726d-fccd-42e4-8b5f-60a200ed76af:

2014-09-21 05:27:06.999 4710 DEBUG nova.network.base_api [-] Updating cache with info: [VIF({'profile': {}, 'ovs_interfaceid': u'3725c51f-b240-404f-b747-06da1460126e', 'network': Network({'bridge': 'br-int', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [IP({'meta': {}, 'version': 4, 'type': 'floating', 'address': u'172.24.4.125'})], 'address': u'10.100.0.2'})], 'version': 4, 'meta': {'dhcp_server': u'10.100.0.3'}, 'dns': [], 'routes': [], 'cidr': u'10.100.0.0/28', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u'10.100.0.1'})})], 'meta': {'injected': False, 'tenant_id': u'9a48bc2e80124a6d9d0abfa0c53c908b'}, 'id': u'614ab449-b511-4e99-9694-121e33067d4c', 'label': u'network-smoke--545348394'}), 'devname': u'tap3725c51f-b2', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {u'port_filter': True, u'ovs_hybrid_plug': True}, 'address': u'fa:16:3e:e1:30:0f', 'active': True, 'type': u'ovs', 'id': u'3725c51f-b240-404f-b747-06da1460126e', 'qbg_params': None})] update_instance_cache_with_nw_info /opt/stack/new/nova/nova/network/base_api.py:39
2014-09-21 05:27:07.010 4710 DEBUG nova.compute.manager [-] [instance: c65d83f5-db85-4adf-b8ae-bba2b81ac4c6] Updated the network info_cache for instance _heal_instance_info_cache /opt/stack/new/nova/nova/compute/manager.py:5311

This patch looks like fix similar problem: https://review.openstack.org/#/c/97662/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2014-09-23:

#13

Looks like bug 1326183 is related.

Revision history for this message

Michael Still (mikal) wrote on 2014-09-24:

#14

Is anyone working the nova side of this bug?

Revision history for this message

Alex Xu (xuhj) wrote on 2014-09-24:

#15

@Michael, I'm working on it. I fixed one case with patch https://review.openstack.org/#/c/122986/ , it begin to merge. And there is another case is under review https://review.openstack.org/#/c/97662/ that fixed by Praveen

I didn't find anymore cases can trigger this bug for now. Hope those two patches can fix this.

And there can find another error message in the log:
http://logs.openstack.org/73/122873/1/gate/gate-tempest-dsvm-neutron-full/e5a2bf6/logs/screen-n-cpu.txt.gz?level=ERROR#_2014-09-21_05_18_23_709

I think that is ec2 api bug, and that won't trigger this bug. We can file another bug for it.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-09-24: Related fix merged to nova (master)

#16

Reviewed: https://review.openstack.org/122986
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4adc8376f722e48e1c89a05d1b4e922f8ca4c19b
Submitter: Jenkins
Branch: master

commit 4adc8376f722e48e1c89a05d1b4e922f8ca4c19b
Author: He Jie Xu <email address hidden>
Date: Sun Sep 21 14:07:29 2014 +0800

Sync network_info if instance not found before _build_resources yield

If instance is deleted before _build_resources yield, there won't
waiting for network_info sync finish.

    So there is race case, when the exception catching for
    'InstanceNotFound' in do_build_and_run_instance invoke
    '_cleanup_allocated_networks', the '_allocate_network_async' didn't
    create port yet. After '_cleanup_allocated_networks' finished,
    '_allocate_network_async' create new port for the instance, but there
    isn't any cleanup for the new port anymore.

Change-Id: I5027a1aaeb9a2106f7982fd0422f88c4caa20ffb
Related-Bug: #1357055

Revision history for this message

Michael Still (mikal) wrote on 2014-09-25:

#17

Alex thanks for those patches. The second one is now approved and awaiting merge.

Changed in nova:
assignee:	nobody → Alex Xu (xuhj)

Revision history for this message

Sean Dague (sdague) wrote on 2014-09-25:

#18

Both of the fixes provided in comment #15 have merged, so marking as fix committed. If we think there remains another outstanding fix needed, lets reopen.

Changed in nova:
status:	Confirmed → Fix Committed

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-09-25:

#19

I'm assuming this bug can be closed on the Neutron side as well, but I'll wait to verify the nova patches have fixed things a bit before marking as Closed and removing from RC1.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2014-09-26:

#20

I just saw this here :(

http://logs.openstack.org/29/121729/6/check/check-tempest-dsvm-neutron-full/105091d/logs/screen-q-svc.txt.gz#_2014-09-26_00_11_56_311

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2014-09-26:

#21

Strangely enough the trace does not show up in logstash

Revision history for this message

Mohammad Banikazemi (mb-s) wrote on 2014-09-26:

#22

I think this is relevant too: http://logs.openstack.org/99/113999/26/check/check-tempest-dsvm-neutron-full-2/4cda5b0/logs/screen-q-svc.txt.gz#_2014-09-25_15_54_17_897

I did do a recheck bug 1357055 after noticing the failure.

Salvatore Orlando (salvatore-orlando) on 2014-09-26

Changed in neutron:
assignee:	nobody → Mohammad Banikazemi (mb-s)

Revision history for this message

Mohammad Banikazemi (mb-s) wrote on 2014-09-26:

#23

Have been looking at Neutron service logs posted above for comments #20 and #22. In both cases, the port that exists when subnet is being deleted (and therefore the subnet deletion fails) never gets deleted. That is Neutron never receives any request to delete the port. The port in question gets created, gets bound correctly, shows up in get ports, but never gets deleted. This is similar to the observation posted in comment #6 above.

Revision history for this message

Kyle Mestery (mestery) wrote on 2014-09-28:

#24

Removing Mohammad from this. He has reached out to me and indicated he will not have time to work on this in the coming week or so.

Changed in neutron:
assignee:	Mohammad Banikazemi (mb-s) → nobody

Kyle Mestery (mestery) on 2014-09-30

Changed in neutron:
milestone:	juno-rc1 → kilo-1

Revision history for this message

Alex Xu (xuhj) wrote on 2014-10-01:

#25

yea, one more case.

I didn't the root cause. But put my analyze at here.

The analyze base on this:
http://logs.openstack.org/71/125171/2/gate/gate-tempest-dsvm-neutron-full/62dd074/console.html

The subnet is deleted at: http://logs.openstack.org/71/125171/2/gate/gate-tempest-dsvm-neutron-full/62dd074/logs/tempest.txt.gz#_2014-09-30_22_04_05_218

ServerRescueNegativeTestXML create two servers for tests.
server id: b2c534bf-3655-4f16-b5a0-ed04fc8465ef
rescue server id: d281aee9-50ba-452b-a2f0-b495c8563d5e

The problem is there isn't any log to show those two servers were deleted. So the ports attach to those servers won't be delete.
But I don't know way those two servers won't be deleted.

Anyone has idea?

Revision history for this message

Alex Xu (xuhj) wrote on 2014-10-01:

#26

I added tempest as affecting project. Because I think there should be log to show tempest try to clean up the servers that created by test.

I can't find those two servers were deleted:
server id: b2c534bf-3655-4f16-b5a0-ed04fc8465ef
rescue server id: d281aee9-50ba-452b-a2f0-b495c8563d5e

http://logs.openstack.org/71/125171/2/gate/gate-tempest-dsvm-neutron-full/62dd074/logs/tempest.txt.gz

Thierry Carrez (ttx) on 2014-10-01

Changed in nova:
status:	Fix Committed → Fix Released

Salvatore Orlando (salvatore-orlando) on 2014-10-03

Changed in neutron:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)

Salvatore Orlando (salvatore-orlando) on 2014-10-07

tags:

added: juno-rc-potential

Salvatore Orlando (salvatore-orlando) on 2014-10-08

Changed in tempest:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-10-08:

#27

Download full text (5.0 KiB)

I came to the same conclusions as Alex: the servers are not deleted hence the error.

However, the logging which Alex is claiming for is already there.
Indeed here are the delete operations on teardown for a failing test:

salvatore@trustillo:~$ cat tempest.txt.gz | grep -i ServerRescueNegativeTestJSON.*tearDownClass.*DELETE
2014-10-07 17:49:04.444 25908 INFO tempest.common.rest_client [req-75c758b3-d8cb-48d6-9cb6-3670147aca41 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/829473406bb545c895a5cd0320624812/os-volumes/ffc44446-ba25-413d-8ff1-839d3643299d 0.135s
2014-10-07 17:49:04.444 25908 DEBUG tempest.common.rest_client [req-75c758b3-d8cb-48d6-9cb6-3670147aca41 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/829473406bb545c895a5cd0320624812/os-volumes/ffc44446-ba25-413d-8ff1-839d3643299d 0.135s
2014-10-07 17:52:21.452 25908 INFO tempest.common.rest_client [req-d0fa5615-9e64-4faa-bd8d-2ad1ac6afb53 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/0ecd1539-70af-4500-aa4d-9e131fa1fffc 0.238s
2014-10-07 17:52:21.452 25908 DEBUG tempest.common.rest_client [req-d0fa5615-9e64-4faa-bd8d-2ad1ac6afb53 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/0ecd1539-70af-4500-aa4d-9e131fa1fffc 0.238s
2014-10-07 17:52:21.513 25908 INFO tempest.common.rest_client [req-89562c45-7448-41bb-8e3e-0beec8460aab None] Request (ServerRescueNegativeTestJSON:tearDownClass): 409 DELETE http://127.0.0.1:9696/v2.0/subnets/9614d778-66b3-4b81-83fc-f7a47602ceb2 0.060s
2014-10-07 17:52:21.514 25908 DEBUG tempest.common.rest_client [req-89562c45-7448-41bb-8e3e-0beec8460aab None] Request (ServerRescueNegativeTestJSON:tearDownClass): 409 DELETE http://127.0.0.1:9696/v2.0/subnets/9614d778-66b3-4b81-83fc-f7a47602ceb2 0.060s

No DELETE server command is specified.
Instead for a successful test the two servers are deleted.

2014-09-26 11:48:05.532 7755 INFO tempest.common.rest_client [req-6d9072aa-dbcb-4398-b4c4-46aeb2140e4b None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/os-volumes/a24f4dc2-bfb5-4a60-be05-9051f08cc447 0.086s
2014-09-26 11:50:06.733 7755 INFO tempest.common.rest_client [req-f3752c9f-8de5-4bde-98c9-879c5a37ff44 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/servers/1d754b6e-128f-4b42-88ab-9dbefedd887f 0.155s
2014-09-26 11:50:06.882 7755 INFO tempest.common.rest_client [req-dcb05efc-229a-4f40-ac67-2e95a80373c0 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/servers/d26d1144-58d2-4900-93af-bf7fecdd7a60 0.148s
2014-09-26 11:50:09.531 7755 INFO tempest.common.rest_client [req-73246abe-5e0e-4d16-88dc-75ca05593b2c None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/755eeb0d-6eb8-4acd-b36d-91bd4787cf4e 0.180s
2014-09-26 11:50:09.583 7755 INFO tempest.common.rest_client [req-549...

I came to the same conclusions as Alex: the servers are not deleted hence the error.

However, the logging which Alex is claiming for is already there.
Indeed here are the delete operations on teardown for a failing test:

salvatore@trustillo:~$ cat tempest.txt.gz | grep -i ServerRescueNegativeTestJSON.*tearDownClass.*DELETE
2014-10-07 17:49:04.444 25908 INFO tempest.common.rest_client [req-75c758b3-d8cb-48d6-9cb6-3670147aca41 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/829473406bb545c895a5cd0320624812/os-volumes/ffc44446-ba25-413d-8ff1-839d3643299d 0.135s
2014-10-07 17:49:04.444 25908 DEBUG tempest.common.rest_client [req-75c758b3-d8cb-48d6-9cb6-3670147aca41 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/829473406bb545c895a5cd0320624812/os-volumes/ffc44446-ba25-413d-8ff1-839d3643299d 0.135s
2014-10-07 17:52:21.452 25908 INFO tempest.common.rest_client [req-d0fa5615-9e64-4faa-bd8d-2ad1ac6afb53 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/0ecd1539-70af-4500-aa4d-9e131fa1fffc 0.238s
2014-10-07 17:52:21.452 25908 DEBUG tempest.common.rest_client [req-d0fa5615-9e64-4faa-bd8d-2ad1ac6afb53 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/0ecd1539-70af-4500-aa4d-9e131fa1fffc 0.238s
2014-10-07 17:52:21.513 25908 INFO tempest.common.rest_client [req-89562c45-7448-41bb-8e3e-0beec8460aab None] Request (ServerRescueNegativeTestJSON:tearDownClass): 409 DELETE http://127.0.0.1:9696/v2.0/subnets/9614d778-66b3-4b81-83fc-f7a47602ceb2 0.060s
2014-10-07 17:52:21.514 25908 DEBUG tempest.common.rest_client [req-89562c45-7448-41bb-8e3e-0beec8460aab None] Request (ServerRescueNegativeTestJSON:tearDownClass): 409 DELETE http://127.0.0.1:9696/v2.0/subnets/9614d778-66b3-4b81-83fc-f7a47602ceb2 0.060s

No DELETE server command is specified.
Instead for a successful test the two servers are deleted.

2014-09-26 11:48:05.532 7755 INFO tempest.common.rest_client [req-6d9072aa-dbcb-4398-b4c4-46aeb2140e4b None] Request (ServerRescueNegativeTestJSON:tearDownClass): 202 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/os-volumes/a24f4dc2-bfb5-4a60-be05-9051f08cc447 0.086s
2014-09-26 11:50:06.733 7755 INFO tempest.common.rest_client [req-f3752c9f-8de5-4bde-98c9-879c5a37ff44 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/servers/1d754b6e-128f-4b42-88ab-9dbefedd887f 0.155s
2014-09-26 11:50:06.882 7755 INFO tempest.common.rest_client [req-dcb05efc-229a-4f40-ac67-2e95a80373c0 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:8774/v2/0cdef32ce1b746fa957a013a3638b3ec/servers/d26d1144-58d2-4900-93af-bf7fecdd7a60 0.148s
2014-09-26 11:50:09.531 7755 INFO tempest.common.rest_client [req-73246abe-5e0e-4d16-88dc-75ca05593b2c None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/routers/755eeb0d-6eb8-4acd-b36d-91bd4787cf4e 0.180s
2014-09-26 11:50:09.583 7755 INFO tempest.common.rest_client [req-549a5b98-dbb1-43ae-b4ce-f8182ebc10e2 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/subnets/999c8033-7684-4fc0-a09e-1ccb70196278 0.051s
2014-09-26 11:50:09.662 7755 INFO tempest.common.rest_client [req-c74637c0-f985-42c4-b5b0-f980bc431858 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/networks/ef773f01-7b53-4207-bc75-120563f36d7f 0.078s
2014-09-26 11:50:09.809 7755 INFO tempest.common.rest_client [-] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:35357/v2.0/users/fe26c8cb709644e1862d1b69c63b802b 0.146s
2014-09-26 11:50:09.877 7755 INFO tempest.common.rest_client [req-ea281fc3-70e2-487d-977b-7aa65db86722 None] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:9696/v2.0/security-groups/10b16895-5b18-48c5-88e4-9c384eb9d1b8 0.043s
2014-09-26 11:50:10.016 7755 INFO tempest.common.rest_client [-] Request (ServerRescueNegativeTestJSON:tearDownClass): 204 DELETE http://127.0.0.1:35357/v2.0/tenants/0cdef32ce1b746fa957a013a3638b3ec 0.137s

This happens consistently.
Also note that in the case of the failing tests the same events are logged both at DEBUG and INFO level. This might indicate that some concurrency problem among test runners is probably installing an additional log handler, but I have no idea whether this is even possible.
What is probably happening is that the servers class variable gets resetted, and therefore the servers are not removed on resource_cleanup.

However, this still has to be proved. Further logging might be added to this aim, which might be helpful to validate this hypothesis (I could not find any clue through static code and log analysis).

Finally, I reckon I have enough elements to declare this is NOT a neutron bug, and I'm removing neutron from affected projects. I'd do the same for nova, but I don't have the rights to do that.

no longer affects:

neutron

Salvatore Orlando (salvatore-orlando) on 2014-10-08

tags:

removed: juno-rc-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-08: Related fix proposed to tempest (master)

#28

Related fix proposed to branch: master
Review: https://review.openstack.org/126908

Revision history for this message

Alex Xu (xuhj) wrote on 2014-10-15:

#29

As I and Salvatore get same conclusions. This isn't bug neutron and nova. It should be tempest bug. I removed nova from affected projects.

no longer affects:

nova

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-17: Related fix merged to tempest (master)

#30

Reviewed: https://review.openstack.org/126908
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=1422a9ac6c40df71978412cddce90870d578989c
Submitter: Jenkins
Branch: master

commit 1422a9ac6c40df71978412cddce90870d578989c
Author: Salvatore <email address hidden>
Date: Wed Oct 8 15:53:25 2014 +0200

Log identifiers of resources being cleaned up

    On teardown compute API tests clean up resources such as servers
    and images. This patch simply logs the identifiers of these
    resources in order to help debugging failures which might originate
    from problems during cleanup.

Change-Id: Ied2924bba78ec8f4a3b51df1be69273b28362475
Related-Bug: #1357055

Eugene Nikanorov (enikanorov) on 2014-10-30

Changed in tempest:
status:	New → Confirmed

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2014-12-12:

#31

This issue still affects ppl, looking at elastic recheck at 2014-12-12 13:14 UTC:
Bug 1357055 - Race to delete shared subnet in Tempest neutron full jobs
23 fails in 24 hrs / 79 fails in 10 days
Any new ideas on this one?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-29: Related fix proposed to neutron (master)

#32

Related fix proposed to branch: master
Review: https://review.openstack.org/151415

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-03: Related fix merged to neutron (master)

#33

Reviewed: https://review.openstack.org/151415
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c17faaad2b1030b16c1464b4ca3325a19631833d
Submitter: Jenkins
Branch: master

commit c17faaad2b1030b16c1464b4ca3325a19631833d
Author: Eugene Nikanorov <email address hidden>
Date: Fri Jan 30 00:58:42 2015 +0300

Log tenant ports if subnet could not be deleted

Change-Id: Icd64cda9de2bb1c3ab090366a7a20cec894dca14
Related-Bug: #1357055

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-16:

#34

Since https://review.openstack.org/#/c/151415 merged to add logging on failure cases, we have these hits in logstash:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRm91bmQgSVAgYWxsb2NhdGlvblwiIEFORCBtZXNzYWdlOlwib24gc3VibmV0XCIgQU5EIG1lc3NhZ2U6XCJjYW5ub3QgZGVsZXRlXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6ImN1c3RvbSIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJmcm9tIjoiMjAxNS0wMy0wNlQxNDozNzoxNSswMDowMCIsInRvIjoiMjAxNS0wMy0xNlQxNDozNzoxNSswMDowMCIsInVzZXJfaW50ZXJ2YWwiOiIwIn0sInN0YW1wIjoxNDI2NTE2Njg1NjA2fQ==

190 hits in 10 days, check and gate queues, however it's not all failed jobs. When this shows up, only about 26% of the jobs are failures.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-08: Fix proposed to neutron (master)

#35

Fix proposed to branch: master
Review: https://review.openstack.org/171662

Changed in neutron:
assignee:	nobody → Eugene Nikanorov (enikanorov)
status:	New → In Progress

Eugene Nikanorov (enikanorov) on 2015-04-08

Changed in neutron:
importance:	Undecided → Wishlist

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-04-08:

#36

The issue appears to be a simple race between DHCP port creation and subnet deletion.
DHCP port gets after 'autodelete' ports was cleared, but before other allocations are checked.
It doesn't happen too often, but is annoying.

Changed in neutron:
importance:	Wishlist → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-08:

#37

Fix proposed to branch: master
Review: https://review.openstack.org/171848

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-10: Related fix merged to neutron (master)

#38

Reviewed: https://review.openstack.org/171662
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d5aa1831ac95c16fcee6ec0bb8f0bf07afbe384c
Submitter: Jenkins
Branch: master

commit d5aa1831ac95c16fcee6ec0bb8f0bf07afbe384c
Author: Eugene Nikanorov <email address hidden>
Date: Wed Apr 8 02:23:22 2015 +0400

Add logging to dangling port to ml2 delete_subnet

    This is useful when trouble shooting test failures.
    Also, in db_base_plugin_v2, log only port_id instead of
    full allocation object.

Change-Id: I3d77318aee70836de125687a7f6c0f495d545f21
Related-Bug: #1357055

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-13: Fix merged to neutron (master)

#39

Reviewed: https://review.openstack.org/171848
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1642bca4d9c4fee15129f74d93300c1eab1afd29
Submitter: Jenkins
Branch: master

commit 1642bca4d9c4fee15129f74d93300c1eab1afd29
Author: Eugene Nikanorov <email address hidden>
Date: Thu Apr 9 01:16:18 2015 +0300

Handle race condition on subnet-delete

    This fix targets quite rare case of race condition between
    port creation and subnet deletion. This usually happens
    during API tests that do things quickly.
    DHCP port is being created after delete_subnet checks for
    DHCP ports, but before it checks for IPAllocations on subnet.
    The solution is to apply retrying logic, which is really necessary
    as we can't fetch new IPAllocations with the same query and within
    the active transaction in mysql because of REPEATABLE READ
    transaction isolation.

Change-Id: Ib9da018e654cdee3b64aa38de90f171c92ee28ee
Closes-Bug: 1357055

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

Eugene Nikanorov (enikanorov) wrote on 2015-04-18:

#40

The fix was not correct in the essential query, so it produces exactly the same behavior with postgresql

Changed in neutron:
status:	Fix Committed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-18: Fix proposed to neutron (master)

#41

Fix proposed to branch: master
Review: https://review.openstack.org/175125

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-24: Fix merged to neutron (master)

#42

Reviewed: https://review.openstack.org/175125
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0109578a8ec07f743f7e2b654007e17f145ea20f
Submitter: Jenkins
Branch: master

commit 0109578a8ec07f743f7e2b654007e17f145ea20f
Author: Eugene Nikanorov <email address hidden>
Date: Sat Apr 18 15:31:44 2015 +0400

Fix incorrect query for user ip allocations

    Previously the query was fetching an IPAllocation object incorrectly
    relying on the fact that it has port attribute that should be
    join-loaded when it really is not.

    Incorrect query produced by previous code:
    SELECT ipallocations.port_id AS ipallocations_port_id,
           ipallocations.ip_address AS ipallocations_ip_address,
           ipallocations.subnet_id AS ipallocations_subnet_id,
           ipallocations.network_id AS ipallocations_network_id
    FROM ipallocations, ports
    WHERE ipallocations.subnet_id = :subnet_id_1
          AND ports.device_owner NOT IN (:device_owner_1)

The query then may have produced results that don't satisfy
the condition intended by the code.

    Query produced by the fixed code:
    SELECT ipallocations.port_id AS ipallocations_port_id,
           ipallocations.ip_address AS ipallocations_ip_address,
           ipallocations.subnet_id AS ipallocations_subnet_id,
           ipallocations.network_id AS ipallocations_network_id
    FROM ipallocations JOIN ports ON ports.id = ipallocations.port_id
    WHERE ipallocations.subnet_id = :subnet_id_1
          AND ports.device_owner NOT IN (:device_owner_1)

Change-Id: I34682df784e30e3ce49ee48c690f8b799ad58149
Closes-Bug: #1357055

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-22: Fix proposed to neutron (neutron-pecan)

#43

Fix proposed to branch: neutron-pecan
Review: https://review.openstack.org/185072

Thierry Carrez (ttx) on 2015-06-24

Changed in neutron:
milestone:	none → liberty-1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-10-15

Changed in neutron:
milestone:	liberty-1 → 7.0.0

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2016-09-01:

#44

Alex has added Tempest to this bug report for adding more log for the investigation and that is done on https://review.openstack.org/#/c/126908/

So this bug report should be close now.

Changed in tempest:
status:	Confirmed → Fix Released

tempest

Race to delete shared subnet in Tempest neutron full jobs

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Eugene Nikanorov	neutron 7.0.0 "liberty"
	tempest	Fix Released	Undecided	Salvatore Orlando