DVR: Fix race conditions when trying to add default gateway for fip gateway port.

Bug #1631513 reported by Swaminathan Vasudevan on 2016-10-07
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Brian Haley

Bug Description

There seems to be a race condition when trying to add default gateway route in fip namespace for the fip agent gateway port.

The way it happens is at high scale testing, when there is a router update that is currently happening for the Router-A which has a floatingip, a fip namespace is getting created and gateway ports plugged to the external bridge in the context of the fip namespace. While it is getting created, if there is another router update for the same Router-A, then it calls 'update-gateway-port' and tries to set the default gateway and fails.

We do find a log message in the l3-agent with 'Failed to process compatible router' and also a TRACE in the l3-agent.
Traceback (most recent call last):
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
     self._process_router_if_compatible(router)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 440, in _process_router_if_compatible
     self._process_updated_router(router)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 454, in _process_updated_router
     ri.process(self)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 538, in process
     super(DvrLocalRouter, self).process(agent)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/dvr_router_base.py", line 31, in process
     super(DvrRouterBase, self).process(agent)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/common/utils.py", line 396, in call
     self.logger(e)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
     self.force_reraise()
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
     six.reraise(self.type_, self.value, self.tb)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/common/utils.py", line 393, in call
     return func(*args, **kwargs)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/router_info.py", line 989, in process
     self.process_external(agent)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 491, in process_external
     self.create_dvr_fip_interfaces(ex_gw_port)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/dvr_local_router.py", line 522, in create_dvr_fip_interfaces
     self.fip_ns.update_gateway_port(fip_agent_port)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/l3/dvr_fip_ns.py", line 243, in update_gateway_port
     ipd.route.add_gateway(gw_ip)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 690, in add_gateway
     self._as_root([ip_version], tuple(args))
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 361, in _as_root
     use_root_namespace=use_root_namespace)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 94, in _as_root
     log_fail_as_error=self.log_fail_as_error)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 103, in _execute
     log_fail_as_error=log_fail_as_error)
   File "/opt/stack/venv/neutron-20160927T090820Z/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 140, in execute
     raise RuntimeError(msg)

summary: - Fix race conditions when trying to add default gateway for fip gateway
- port.
+ DVR: Fix race conditions when trying to add default gateway for fip
+ gateway port.
Assaf Muller (amuller) wrote :

Keep in mind that the L3 agent is designed so that updates for the same router is serialized. Is this not happening? Do you see operations on the same router interleaved in log files?

Fix proposed to branch: master
Review: https://review.openstack.org/383941

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/385617

Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: master
Review: https://review.openstack.org/383941
Reason: Adbandoning this patch to accommodate the alternate approach addressed in this patch.

https://review.openstack.org/#/c/385617/

Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Swaminathan Vasudevan (swaminathan-vasudevan)
Dongwon Cho (dongwoncho) wrote :
Download full text (11.8 KiB)

Same situation here.

cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.1 LTS"

dpkg -l | egrep 'nova|neutron'
ii neutron-common 2:8.3.0-0ubuntu1 all Neutron is a virtual network service for Openstack - common
ii neutron-l3-agent 2:8.3.0-0ubuntu1 all Neutron is a virtual network service for Openstack - l3 agent
ii neutron-metadata-agent 2:8.3.0-0ubuntu1 all Neutron is a virtual network service for Openstack - metadata agent
ii neutron-openvswitch-agent 2:8.3.0-0ubuntu1 all Neutron is a virtual network service for Openstack - Open vSwitch plugin agent
ii nova-common 2:13.1.2-0ubuntu2 all OpenStack Compute - common files
ii nova-compute 2:13.1.2-0ubuntu2 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:13.1.2-0ubuntu2 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:13.1.2-0ubuntu2 all OpenStack Compute - compute node libvirt support

2016-11-10 19:43:43.044 6664 INFO eventlet.wsgi.server [-] (6664) wsgi starting up on http:/var/lib/neutron/keepalived-state-change
2016-11-10 19:43:43.051 6664 INFO neutron.agent.l3.agent [-] L3 agent started
2016-11-10 19:43:43.054 6664 INFO neutron.agent.l3.agent [-] Agent has just been revived. Doing a full sync.
2016-11-10 19:43:56.262 6664 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "fip-5d2242e3-4ac0-4873-a752-3a504d17235d": No such file or directory

2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "fip-5d2242e3-4ac0-4873-a752-3a504d17235d": No such file or directory
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 382, in call
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 989, in process
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info self.process_external(agent)
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/dvr_local_router.py", line 514, in process_external
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info self.create_dvr_fip_interfaces(ex_gw_port)
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/dvr_local_router.py", line 545, in create_dvr_fip_interfaces
2016-11-10 19:43:56.274 6664 ERROR neutron.agent.l3.router_info ...

Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Brian Haley (brian-haley)

Reviewed: https://review.openstack.org/385617
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d40322c7d4aa1dd6d595dfe415278c9f252f4da2
Submitter: Jenkins
Branch: master

commit d40322c7d4aa1dd6d595dfe415278c9f252f4da2
Author: Swaminathan Vasudevan <email address hidden>
Date: Fri Oct 7 10:30:40 2016 -0700

    DVR: Fix race condition in creation of fip gateway

    In large-scale environments, we have seen a router update
    arrive for one tenant while we are still creating the
    router for a different tenant and initializing the shared
    floating IP gateway port. Sometimes these updates can
    get scheduled simultaneously, with the second running
    before we are done creating all the resources in the
    first, causing an exception when trying to set the
    default route since either the interface or IP address
    does not exist yet.

    Add a lock to better synchronize these functions so
    a create can finish before an update can be done.

    If it still fails, we will throw an exception so that
    the namespace will be cleaned-up and the update can be
    re-scheduled for the next iteration.

    Closes-Bug: #1631513
    Change-Id: Ia8c92cea2f8798582c39ad3450ab3b3c45a356f7

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/413240
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=95d93495d743115808b511740899d649cab6b5ae
Submitter: Jenkins
Branch: stable/newton

commit 95d93495d743115808b511740899d649cab6b5ae
Author: Swaminathan Vasudevan <email address hidden>
Date: Fri Oct 7 10:30:40 2016 -0700

    DVR: Fix race condition in creation of fip gateway

    In large-scale environments, we have seen a router update
    arrive for one tenant while we are still creating the
    router for a different tenant and initializing the shared
    floating IP gateway port. Sometimes these updates can
    get scheduled simultaneously, with the second running
    before we are done creating all the resources in the
    first, causing an exception when trying to set the
    default route since either the interface or IP address
    does not exist yet.

    Add a lock to better synchronize these functions so
    a create can finish before an update can be done.

    If it still fails, we will throw an exception so that
    the namespace will be cleaned-up and the update can be
    re-scheduled for the next iteration.

    Closes-Bug: #1631513
    Change-Id: Ia8c92cea2f8798582c39ad3450ab3b3c45a356f7
    (cherry picked from commit d40322c7d4aa1dd6d595dfe415278c9f252f4da2)

tags: added: in-stable-newton

Reviewed: https://review.openstack.org/413263
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7cb6942b79e6fad990e5f5f2263e0b8228804069
Submitter: Jenkins
Branch: stable/mitaka

commit 7cb6942b79e6fad990e5f5f2263e0b8228804069
Author: Swaminathan Vasudevan <email address hidden>
Date: Fri Oct 7 10:30:40 2016 -0700

    DVR: Fix race condition in creation of fip gateway

    In large-scale environments, we have seen a router update
    arrive for one tenant while we are still creating the
    router for a different tenant and initializing the shared
    floating IP gateway port. Sometimes these updates can
    get scheduled simultaneously, with the second running
    before we are done creating all the resources in the
    first, causing an exception when trying to set the
    default route since either the interface or IP address
    does not exist yet.

    Add a lock to better synchronize these functions so
    a create can finish before an update can be done.

    If it still fails, we will throw an exception so that
    the namespace will be cleaned-up and the update can be
    re-scheduled for the next iteration.

    Closes-Bug: #1631513
    (cherry picked from commit d40322c7d4aa1dd6d595dfe415278c9f252f4da2)

    Conflicts:
     neutron/agent/l3/dvr_fip_ns.py
     neutron/tests/functional/agent/l3/test_dvr_router.py

    Change-Id: Ia8c92cea2f8798582c39ad3450ab3b3c45a356f7

tags: added: in-stable-mitaka
tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
tags: removed: mitaka-backport-potential newton-backport-potential

This issue was fixed in the openstack/neutron 10.0.0.0b3 development milestone.

This issue was fixed in the openstack/neutron 9.2.0 release.

This issue was fixed in the openstack/neutron 8.4.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers