Lock wait timeout on delete port for DVR

Bug #1377241 reported by Swaminathan Vasudevan
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Kevin Benton
Juno
Fix Released
Undecided
Unassigned

Bug Description

We run a script to configure networks, VMs, Routers and assigin floatingIP to the VM.
After it is created, then we run a script to clean all ports, networks, routers and gateway and FIP.

The issue is seen when there is a back to back call to router-interface-delete and router-gateway-clear.

There are three calls to router-interface-delete and the fourth call to router-gateway-clear.

At this time there is a db lock obtained for port delete and when the other delete comes in, it timeout.

2014-10-03 09:28:39.587 DEBUG neutron.openstack.common.lockutils [req-a89ee05c-d8b2-438a-a707-699f450d3c41 admin d3bb4e1791814b809672385bc8252688] Got semaphore "db-access" from (pid=25888) lock /opt/stack/neutron/neutron/openstack/common/lockutils.py:168
2014-10-03 09:29:30.777 INFO neutron.wsgi [-] (25888) accepted ('192.168.15.144', 54899)
2014-10-03 09:29:30.778 INFO neutron.wsgi [-] (25888) accepted ('192.168.15.144', 54900)
2014-10-03 09:29:30.778 INFO neutron.wsgi [-] (25888) accepted ('192.168.15.144', 54901)
2014-10-03 09:29:30.778 INFO neutron.wsgi [-] (25888) accepted ('192.168.15.144', 54902)
2014-10-03 09:29:30.780 ERROR neutron.api.v2.resource [req-a89ee05c-d8b2-438a-a707-699f450d3c41 admin d3bb4e1791814b809672385bc8252688] remove_router_interface failed
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource Traceback (most recent call last):
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/opt/stack/neutron/neutron/api/v2/resource.py", line 87, in resource
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource result = method(request=request, **args)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/opt/stack/neutron/neutron/api/v2/base.py", line 200, in _handle_action
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource return getattr(self._plugin, name)(*arg_list, **kwargs)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/opt/stack/neutron/neutron/db/l3_dvr_db.py", line 247, in remove_router_interface
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource context.elevated(), router, subnet_id=subnet_id)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/opt/stack/neutron/neutron/db/l3_dvr_db.py", line 557, in delete_csnat_router_interface_ports
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource l3_port_check=False)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/opt/stack/neutron/neutron/plugins/ml2/plugin.py", line 983, in delete_port
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource port_db, binding = db.get_locked_port_and_binding(session, id)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/opt/stack/neutron/neutron/plugins/ml2/db.py", line 135, in get_locked_port_and_binding
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource with_lockmode('update').
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2310, in one
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource ret = list(self)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2353, in __iter__
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource return self._execute_and_instances(context)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2368, in _execute_and_instances
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource result = conn.execute(querycontext.statement, self._params)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 662, in execute
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource params)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 761, in _execute_clauseelement
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource compiled_sql, distilled_params
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 874, in _execute_context
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource context)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/local/lib/python2.7/dist-packages/oslo/db/sqlalchemy/compat/handle_error.py", line 125, in _handle_dbapi_exception
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource six.reraise(type(newraise), newraise, sys.exc_info()[2])
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/local/lib/python2.7/dist-packages/oslo/db/sqlalchemy/compat/handle_error.py", line 102, in _handle_dbapi_exception
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource per_fn = fn(ctx)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/local/lib/python2.7/dist-packages/oslo/db/sqlalchemy/exc_filters.py", line 323, in handler
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource context.is_disconnect)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource File "/usr/local/lib/python2.7/dist-packages/oslo/db/sqlalchemy/exc_filters.py", line 254, in _raise_operational_errors_directly_filter
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource raise operational_error
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'SELECT ports.tenant_id AS ports_tenant_id, ports.id AS ports_id, ports.name AS ports_name, ports.network_id AS ports_network_id, ports.mac_address AS ports_mac_address, ports.admin_state_up AS ports_admin_state_up, ports.status AS ports_status, ports.device_id AS ports_device_id, ports.device_owner AS ports_device_owner \nFROM ports \nWHERE ports.id = %s FOR UPDATE' ('bec69266-227d-4482-a346-ef47dd3a7a78',)
2014-10-03 09:29:30.780 TRACE neutron.api.v2.resource

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Carl has something:

https://review.openstack.org/#/c/122880/

Can you verify that this might be related/fixes what you're seeing?

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Carl Baldwin (carl-baldwin)
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Will check the patch and run the script.

summary: - OperationalError: Lock wait timeout exceeded seen occasionally when we
- run a script to clean all networks, ports and routers. This is seen with
- DVR
+ Lock wait timeout on delete port for DVR
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :
Download full text (10.7 KiB)

I tested the above patch with the original script that caused the "lock wait".
The problem seems to be still there.

014-10-06 09:51:25.031 DEBUG neutron.openstack.common.lockutils [req-415c01b8-c3cb-4834-b9a4-04242dc9f618 admin bfc91f91c80e4f64908a035506df281e] Got semaphore "db-access" from (pid=17796) lock /opt/stack/neutron/neutron/openstack/common/lockutils.py:168
2014-10-06 09:51:25.092 DEBUG neutron.plugins.ml2.plugin [req-415c01b8-c3cb-4834-b9a4-04242dc9f618 admin bfc91f91c80e4f64908a035506df281e] Calling delete_port for 7bbcedcc-77ec-496f-9a54-d0ee1c2ec94e owned by network:router_interface_distributed from (pid=17796) delete_port /opt/stack/neutron/neutron/plugins/ml2/plugin.py:1045
2014-10-06 09:51:25.128 DEBUG neutron.db.l3_dvr_db [req-6a74e01a-0036-4ebd-9ae7-4e1d7c6aeb05 admin bfc91f91c80e4f64908a035506df281e] Subnet matches: ebfc8c6d-a4ed-4307-a80a-f4ec138255fd from (pid=17796) delete_csnat_router_interface_ports /opt/stack/neutron/neutron/db/l3_dvr_db.py:544
2014-10-06 09:51:25.128 DEBUG neutron.plugins.ml2.plugin [req-6a74e01a-0036-4ebd-9ae7-4e1d7c6aeb05 admin bfc91f91c80e4f64908a035506df281e] Deleting port e1ce4490-025a-4bee-9983-66bee9a5e7f4 from (pid=17796) delete_port /opt/stack/neutron/neutron/plugins/ml2/plugin.py:999
2014-10-06 09:51:25.129 DEBUG neutron.plugins.ml2.plugin [req-415c01b8-c3cb-4834-b9a4-04242dc9f618 admin bfc91f91c80e4f64908a035506df281e] update_port_arp for port 7bbcedcc-77ec-496f-9a54-d0ee1c2ec94e, action del from (pid=17796) update_port_arp /opt/stack/neutron/neutron/plugins/ml2/plugin.py:825
2014-10-06 09:51:25.129 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc [req-415c01b8-c3cb-4834-b9a4-04242dc9f618 admin bfc91f91c80e4f64908a035506df281e] Fanout notify l2population agents at q-agent-notifier the message remove_fdb_entries with {u'c6b32d25-8ae2-49f9-bb3b-85a7441a13ee': {'segment_id': 1002L, 'ports': {u'192.168.15.144': []}, 'network_type': u'vxlan'}} from (pid=17796) _notification_fanout /opt/stack/neutron/neutron/plugins/ml2/drivers/l2pop/rpc.py:40
2014-10-06 09:51:25.129 DEBUG neutron.common.rpc [req-415c01b8-c3cb-4834-b9a4-04242dc9f618 admin bfc91f91c80e4f64908a035506df281e] neutron.plugins.ml2.drivers.l2pop.rpc.L2populationAgentNotifyAPI method fanout_cast called with arguments (<neutron.context.ContextBase object at 0x7f66c87a00d0>, {'args': {'fdb_entries': {u'c6b32d25-8ae2-49f9-bb3b-85a7441a13ee': {'ports': {u'192.168.15.144': []}, 'network_type': u'vxlan', 'segment_id': 1002L}}}, 'namespace': None, 'method': 'remove_fdb_entries'}) {'topic': 'q-agent-notifier-l2population-update'} from (pid=17796) wrapper /opt/stack/neutron/neutron/common/log.py:33
2014-10-06 09:51:25.131 DEBUG neutron.openstack.common.lockutils [req-f120505e-2a32-4ce5-993f-67fdaac75a29 admin bfc91f91c80e4f64908a035506df281e] Got semaphore "db-access" from (pid=17796) lock /opt/stack/neutron/neutron/openstack/common/lockutils.py:168
2014-10-06 09:51:25.210 DEBUG neutron.plugins.ml2.plugin [req-f120505e-2a32-4ce5-993f-67fdaac75a29 admin bfc91f91c80e4f64908a035506df281e] Calling delete_port for e1ce4490-025a-4bee-9983-66bee9a5e7f4 owned by network:router_centralized_snat from (pid=17796) delete_port /opt/stack/neutron/neutron/plugi...

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

When multiple back to back delete_port occurs, "get_locked_port_and_binding" does not return in time and raises the "OperationalError" during the first query for the Port with "with_lockmode('update').

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

When router_interface_delete is issued, it is supposed to call the "delete_csnat_router_interface" after it deletes the router interface ports, but it is delayed.
The next call to gateway_clear happens before this and during the gateway_clear operation it tries to call "delete_csnat_router_interface" and tries to delete the csnat_ports.

At this time the "router_interface_delete" finally calls "delete_csnat_router_interface" and we get this lock timeout.

There should be a way to allow the first router_interface_delete operation to complete its sequence and then call the gateway_clear.

Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Swaminathan Vasudevan (swaminathan-vasudevan)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: master
Review: https://review.openstack.org/124849

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/127129

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

The semphore before calling the "delete_port" to remove the csnat port fixes the DB lockwait timeout issue.

Right now the "lock" on the "ml2/plugin.py - delete_port" is not sufficient to prevent this DB lockwait timeout issue.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Some more additional information from the mysql innodb.
The attached file provides the "innodb status" when the DBlockwait happens.

If there are DB experts can you guys please take a look at it and update what is happening.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/127943

Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Kevin Benton (kevinbenton)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Wow, I am actually spoilt for choice :)

- https://review.openstack.org/#/c/127943
- https://review.openstack.org/#/c/122880
- https://review.openstack.org/#/c/127129

Kevin's change seems promising...

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

This patch seems to fix the "lockwait timeout" issue that we have been seeing with our automated scripts.

The other two patches out there for the same problem can be abandoned.

https://review.openstack.org/#/c/122880
There is nothing wrong with the above patch, but Kevin's patch seems to fix the lockwait problem, but the 122880 does not fix the lockwait problem.

https://review.openstack.org/#/c/127129

All my testing was with single node devstack and the automated scripts.
The only failures that I still see is the "KeyError in dhcp_rpc.py for the network_id" and also the "DBDuplicate Error".

https://bugs.launchpad.net/bugs/1378508
https://bugs.launchpad.net/neutron/+bug/1378468

The DBDuplicateError will be handled by the patch https://review.openstack.org/#/c/126793/

If we claim that the DBDuplicateError and the KeyError in dhcp_rpc is the cause of this lockwait, then we still have not found out the root cause.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/127129
Reason: Abandoned as per Swami's comment on:

https://review.openstack.org/#/c/127943

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Armando it seems that this patch may not address the lockwait timeout issue completely. I still see this problen some times.
So as I have been saying, the main issue when gateway_clear and router_interface_delete both trying to call the "delete_csnat_router_interface" and delete the ports.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

The patch I proposed is still necessary though for some lock waits because it sent RPC messages which would cause a yield during a transaction.

Do you have some debug logs that show exactly what happened leading up to the lock wait stacktrace?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/127943
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4c2b42e21744be56cbf32aeac6f4b4f1c87de24e
Submitter: Jenkins
Branch: master

commit 4c2b42e21744be56cbf32aeac6f4b4f1c87de24e
Author: Kevin Benton <email address hidden>
Date: Sat Oct 11 03:42:47 2014 -0700

    Call DVR VMARP notify outside of transaction

    The dvr vmarp table update notification was being called inside
    of the delete_port transaction in ML2, which can cause a yield
    and lead to the glorious mysql/eventlet deadlock.

    This patch moves it outside the transaction and adjusts it to
    use an existing port dictionary rather than re-looking it up since
    the port is now gone from the DB by the time it is called.

    Closes-Bug: #1377241
    Change-Id: I0b4dac61e49b2a926353f8478e421cd1a70be038

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :
Download full text (3.7 KiB)

This csnat port shown below is deleted first by the "router_gateway_clear" action and then again router_interface_delete comes in and tries to delete the same.

"705a8dcc-2448-4d1d-ba61-9d050f9b0aa0"

I have included the trace in the attached file. "DBLockwaitEntry.Failure.txt"

Also see below the console output when this failure occurs. "Internal Server Error".

stack@ubuntu:~/devstack$ ./dvs-fastClean
Found 4 subnets
Found 3 networks
Found 1 routers
Found 1 floating IPs
Found 12 ports
Found 2 security groups
Found 9 security rules
Found 3 nova VMs
Disassociated floating IP 714c30ac-41b5-4c74-9b7d-bea0672dfff2
neutron floatingip-delete 714c30ac-41b5-4c74-9b7d-bea0672dfff2
Deleted floatingip: 714c30ac-41b5-4c74-9b7d-bea0672dfff2
nova delete c0e64d5a-ec87-4305-837e-05a11aee87f4
nova delete 20c88121-87e7-4993-8d84-6f2511ab8b48
nova delete fbda9c1f-e3e5-4ee1-b12f-dc336615dd4d
Request to delete server fbda9c1f-e3e5-4ee1-b12f-dc336615dd4d has been accepted.
Request to delete server 20c88121-87e7-4993-8d84-6f2511ab8b48 has been accepted.
Request to delete server c0e64d5a-ec87-4305-837e-05a11aee87f4 has been accepted.
neutron router-port-list dvs-fip-test.rtr
delete inter sub a5f5b012-7756-436d-aa20-e638c4c25961
neutron router-interface-delete de75a152-4a73-4382-8642-c27d92c4e521 a5f5b012-7756-436d-aa20-e638c4c25961
delete inter sub 1312c18e-4fbd-4163-9137-6f96db9088cb
neutron router-interface-delete de75a152-4a73-4382-8642-c27d92c4e521 1312c18e-4fbd-4163-9137-6f96db9088cb
delete inter sub 72a54512-ff42-479f-85dc-6ea545964c46
neutron router-interface-delete de75a152-4a73-4382-8642-c27d92c4e521 72a54512-ff42-479f-85dc-6ea545964c46
neutron router-gateway-clear de75a152-4a73-4382-8642-c27d92c4e521
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------------+
| id | name | mac_address | fixed_ips |
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------------+
| 351908de-55ce-4841-ab46-862c4f975b1d | | fa:16:3e:f5:f6:d4 | {"subnet_id": "a5f5b012-7756-436d-aa20-e638c4c25961", "ip_address": "151.0.0.1"} |
| 621c0ed9-f7d4-43da-aca3-ba882a4ad740 | | fa:16:3e:76:13:ee | {"subnet_id": "1312c18e-4fbd-4163-9137-6f96db9088cb", "ip_address": "152.2.0.1"} |
| 705a8dcc-2448-4d1d-ba61-9d050f9b0aa0 | | fa:16:3e:29:6b:92 | {"subnet_id": "1312c18e-4fbd-4163-9137-6f96db9088cb", "ip_address": "152.2.0.11"} |
| 99004bf5-1619-468a-8887-13fe87504231 | | fa:16:3e:19:cc:52 | {"subnet_id": "a5f5b012-7756-436d-aa20-e638c4c25961", "ip_address": "151.0.0.2"} |
| a5bf0dd7-d942-45ac-a6e1-292d401b51a9 | | fa:16:3e:c0:0a:47 | {"subnet_id": "8dc49a29-8078-45bd-8906-2d1bef7d2d78", "ip_address": "153.1.1.100"} |
| ad8c8d32-bb23-4d22-abff-5a9b72e7289e | | fa:16:3e:60:94:05 | {"subnet_id": "72a54512-ff42-479f-85dc-6ea545964c46", "ip_address": "152.1.0.1"} |
| aefdc820-4861-42e9-956d-63c7f99762c9 | | fa:16:3e:60:de:34 | {"subnet_id": "72a54512-ff42-479f...

Read more...

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Thanks, is it possible for me to get a copy of that script and your local.conf file for devstack so I can try to replicate it? I've poked around quite a bit and don't see anything obviously yielding with a lock anymore.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Kevin yes here are the scripts that I use for creating and cleaning up.
The issue will be seen when using the cleanup script.
You need to patiently wait until the problem shows up.

./dvs-fip-test ( create script).
./dvs-fastClean-test

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Create script attached.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Kevin I have posted the test scripts in here for your reference.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Thanks, what's the required devstack topology needed for this to happen. How many L3 agents do I need?

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

You just need a single node devstack and set Q_DVR_MODE to dvr_snat.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Then just OVS and l2pop plugins, right?

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Yes OVS and l2pop.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/128855

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Hi Swami,

Can you see if you encounter the lockwait issues with this patch?
https://review.openstack.org/#/c/128855/

I ran your scripts in a loop for 30 minutes or so any didn't get any, so I think it addresses the problem.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

@Swami Could you to be a bit more specific in your descriptions? "gateway_clear" and "router_interface_delete" do not map directly to any methods that are searchable in the code. It takes time for readers to mentally map these vague references to something that can be pinpointed in the code. As far as I can tell, here is the mapping.

gateway_clear -> _delete_current_gw_port in neutron/db/l3_dvr_db.py
router_interface_delete -> remove_router_interface in the same file.
delete_csnat_router_interface -> delete_csnat_router_interface_ports (okay, maybe close enough)

Below are the ways that I see delete_csnat_router_interface_ports being called. Are you saying that a call to remove_router_interface and update_router are competing to delete the csnat router interface ports?

create_router
-> _update_router_gw_info
   -> _delete_current_gw_port
      -> delete_csnat_router_interface_ports

update_router
-> _update_router_db
   -> _update_router_gw_info
      -> _delete_current_gw_port
         -> delete_csnat_router_interface_ports

remove_router_interface
-> delete_csnat_router_interface_ports

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Hi Carl,

Yes your understanding is correct.
Sorry if my message was not too informative.

My comment #6 states the actual functions being called. But it was not as detailed as yours.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I agree that "gateway_clear" does not correspond directly any function in the code.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

To be clear, before my second patch I was able to reproduce this problem with Swami's scripts.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/lbaasv2)

Fix proposed to branch: feature/lbaasv2
Review: https://review.openstack.org/130864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/lbaasv2)
Download full text (72.6 KiB)

Reviewed: https://review.openstack.org/130864
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c089154a94e5872efc95eab33d3d0c9de8619fe4
Submitter: Jenkins
Branch: feature/lbaasv2

commit 62588957fbeccfb4f80eaa72bef2b86b6f08dcf8
Author: Kevin Benton <email address hidden>
Date: Wed Oct 22 13:04:03 2014 -0700

    Big Switch: Switch to TLSv1 in server manager

    Switch to TLSv1 for the connections to the backend
    controllers. The default SSLv3 is no longer considered
    secure.

    TLSv1 was chosen over .1 or .2 because the .1 and .2 weren't
    added until python 2.7.9 so TLSv1 is the only compatible option
    for py26.

    Closes-Bug: #1384487
    Change-Id: I68bd72fc4d90a102003d9ce48c47a4a6a3dd6e03

commit 17204e8f02fdad046dabdb8b31397289d72c877b
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Oct 22 06:20:15 2014 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I58db0476c810aa901463b07c42182eef0adb5114

commit d712663b99520e6d26269b0ca193527603178742
Author: Carl Baldwin <email address hidden>
Date: Mon Oct 20 21:48:42 2014 +0000

    Move disabling of metadata and ipv6_ra to _destroy_router_namespace

    I noticed that disable_ipv6_ra is called from the wrong place and that
    in some cases it was called with a bogus router_id because the code
    made an incorrect assumption about the context. In other case, it was
    never called because _destroy_router_namespace was being called
    directly. This patch moves the disabling of metadata and ipv6_ra in
    to _destroy_router_namespace to ensure they get called correctly and
    avoid duplication.

    Change-Id: Ia76a5ff4200df072b60481f2ee49286b78ece6c4
    Closes-Bug: #1383495

commit f82a5117f6f484a649eadff4b0e6be9a5a4d18bb
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Oct 21 12:11:19 2014 +0000

    Updated from global requirements

    Change-Id: Idcbd730f5c781d21ea75e7bfb15959c8f517980f

commit be6bd82d43fbcb8d1512d8eb5b7a106332364c31
Author: Angus Lees <email address hidden>
Date: Mon Aug 25 12:14:29 2014 +1000

    Remove duplicate import of constants module

    .. and enable corresponding pylint check now the only offending instance
    is fixed.

    Change-Id: I35a12ace46c872446b8c87d0aacce45e94d71bae

commit 9902400039018d77aa3034147cfb24ca4b2353f6
Author: rajeev <email address hidden>
Date: Mon Oct 13 16:25:36 2014 -0400

    Fix race condition on processing DVR floating IPs

    Fip namespace and agent gateway port can be shared by multiple dvr routers.
    This change uses a set as the control variable for these shared resources
    and ensures that Test and Set operation on the control variable are
    performed atomically so that race conditions do not occur among
    multiple threads processing floating IPs.
    Limitation: The scope of this change is limited to addressing the race
    condition described in the bug report. It may not address other issues
    such as pre-existing issue wit...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/128855
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f23f2ecee68ba4abd12139bbb91b77ba9410f581
Submitter: Jenkins
Branch: master

commit f23f2ecee68ba4abd12139bbb91b77ba9410f581
Author: Kevin Benton <email address hidden>
Date: Thu Oct 16 01:49:19 2014 -0700

    _update_router_db: don't hold open transactions

    This patch prevents the L3 _update_router_db method from
    starting a transaction before calling the gateway interface
    removal functions. With these port changes now occuring
    outside of the L3 DB transaction, a failure to update the
    router DB information will not rollback the port deletion
    operation.

    The 'VPN in use' check had to be moved inside of the DB deletion
    transaction now that there isn't an enclosing transaction to undo
    the delete when an 'in use' error is raised.

    ===Details===

    The router update db method starts a transaction and calls
    the gateway update method with the transaction held open.
    This becomes a problem when the update results in an
    interface removal which uses a port table lock.

    Because the delete_port caller is still holding open a
    transaction, other sessions are blocked from getting an
    SQL lock on the same tables when delete_port starts
    performing RPC notifications, external controller calls,
    etc. During those external calls, eventlet will
    yield and another thread may try to get a lock on the
    port table, causing the infamous mysql/eventlet deadlock.

    This separation of L2/L3 transactions is similiar to change
    I3ae7bb269df9b9dcef94f48f13f1bde1e4106a80 in nature. Even
    though there is a loss in the atomic behavior of the interface
    removal operation, it was arguably incorrect to begin with.
    The restoration of port DB records during a rollback after some
    other failure doesn't undo the backend operations (e.g. REST calls)
    that happened during the original deletion. So, having a delete
    rollback without corresponding 'create_port' calls to the backend
    causes a loss in consistency.

    Closes-Bug: #1377241
    Change-Id: I5fdb6b24bf2fb80ac5e36a742aa7056db72c8c7d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/131189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/131189
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a575cb7cb6c30d843c3ac824e5f2f11a5055a9fb
Submitter: Jenkins
Branch: stable/juno

commit a575cb7cb6c30d843c3ac824e5f2f11a5055a9fb
Author: Kevin Benton <email address hidden>
Date: Thu Oct 16 01:49:19 2014 -0700

    _update_router_db: don't hold open transactions

    This patch prevents the L3 _update_router_db method from
    starting a transaction before calling the gateway interface
    removal functions. With these port changes now occuring
    outside of the L3 DB transaction, a failure to update the
    router DB information will not rollback the port deletion
    operation.

    The 'VPN in use' check had to be moved inside of the DB deletion
    transaction now that there isn't an enclosing transaction to undo
    the delete when an 'in use' error is raised.

    ===Details===

    The router update db method starts a transaction and calls
    the gateway update method with the transaction held open.
    This becomes a problem when the update results in an
    interface removal which uses a port table lock.

    Because the delete_port caller is still holding open a
    transaction, other sessions are blocked from getting an
    SQL lock on the same tables when delete_port starts
    performing RPC notifications, external controller calls,
    etc. During those external calls, eventlet will
    yield and another thread may try to get a lock on the
    port table, causing the infamous mysql/eventlet deadlock.

    This separation of L2/L3 transactions is similiar to change
    I3ae7bb269df9b9dcef94f48f13f1bde1e4106a80 in nature. Even
    though there is a loss in the atomic behavior of the interface
    removal operation, it was arguably incorrect to begin with.
    The restoration of port DB records during a rollback after some
    other failure doesn't undo the backend operations (e.g. REST calls)
    that happened during the original deletion. So, having a delete
    rollback without corresponding 'create_port' calls to the backend
    causes a loss in consistency.

    Conflicts:

     neutron/db/l3_db.py

    Closes-Bug: #1377241
    Change-Id: I5fdb6b24bf2fb80ac5e36a742aa7056db72c8c7d
    (cherry picked from commit f23f2ecee68ba4abd12139bbb91b77ba9410f581)

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/128134
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.