FIP Namespace add/delete race condition seen in DVR router log

Bug #1501873 reported by Swaminathan Vasudevan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Swaminathan Vasudevan

Bug Description

FIP Namespace add/delete race conditon seen in DVR router log. This might cause the FIP functionality to fail.
From the trace log it seems when this happens, a bunch of tests related to FIP functionality fails with SSH Timeout waiting for reply.

Here is the output of the trace that kinds of shows the race condition.

Exit code: 0
 execute /opt/stack/new/neutron/neutron/agent/linux/utils.py:156
2015-09-29 21:10:33.433 7884 DEBUG neutron.agent.l3.dvr_local_router [-] Removed last floatingip, so requesting the server to delete Floatingip Agent Gateway port:{u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'device_owner': u'network:floatingip_agent_gateway', u'port_security_enabled': False, u'binding:profile': {}, u'fixed_ips': [{u'subnet_id': u'362e9033-db93-4193-9413-1073215ab326', u'prefixlen': 24, u'ip_address': u'172.24.5.9'}, {u'subnet_id': u'feb3aa76-53b1-4d4e-b136-412c747ffd30', u'prefixlen': 64, u'ip_address': u'2001:db8::a'}], u'id': u'044a8e2f-00eb-4231-b526-13cb46dcc42f', u'security_groups': [], u'binding:vif_details': {u'port_filter': True, u'ovs_hybrid_plug': True}, u'binding:vif_type': u'ovs', u'mac_address': u'fa:16:3e:7a:a6:85', u'status': u'DOWN', u'subnets': [{u'ipv6_ra_mode': None, u'cidr': u'2001:db8::/64', u'gateway_ip': u'2001:db8::2', u'id': u'feb3aa76-53b1-4d4e-b136-412c747ffd30', u'subnetpool_id': None}, {u'ipv6_ra_mode': None, u'cidr': u'172.24.5.0/24', u'gateway_ip': u'172.24.5.1', u'id': u'362e9033-db93-4193-9413-1073215ab326', u'subnetpool_id': None}], u'binding:host_id': u'devstack-trusty-hpcloud-b5-5153724', u'dns_assignment': [{u'hostname': u'host-172-24-5-9', u'ip_address': u'172.24.5.9', u'fqdn': u'host-172-24-5-9.openstacklocal.'}, {u'hostname': u'host-2001-db8--a', u'ip_address': u'2001:db8::a', u'fqdn': u'host-2001-db8--a.openstacklocal.'}], u'device_id': u'646bb18b-da52-4ead-a635-012c72c1ccf1', u'name': u'', u'admin_state_up': True, u'network_id': u'31689320-95d7-44f9-932a-cc82c1bca2b4', u'dns_name': u'', u'binding:vnic_type': u'normal', u'tenant_id': u'', u'extra_subnets': []} floating_ip_removed_dist /opt/stack/new/neutron/neutron/agent/l3/dvr_local_router.py:148

2015-09-29 21:10:34.031 7884 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'delete', 'fip-31689320-95d7-44f9-932a-cc82c1bca2b4'] execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:101

2015-09-29 21:10:34.043 DEBUG neutron.agent.l3.dvr_local_router [req-33413b07-784c-469e-8a35-0e20312a157e None None] FloatingIP agent gateway port received from the plugin: {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'device_owner': u'network:floatingip_agent_gateway', u'port_security_enabled': False, u'binding:profile': {}, u'fixed_ips': [{u'subnet_id': u'362e9033-db93-4193-9413-1073215ab326', u'prefixlen': 24, u'ip_address': u'172.24.5.9'}, {u'subnet_id': u'feb3aa76-53b1-4d4e-b136-412c747ffd30', u'prefixlen': 64, u'ip_address': u'2001:db8::a'}], u'id': u'044a8e2f-00eb-4231-b526-13cb46dcc42f', u'security_groups': [], u'binding:vif_details': {u'port_filter': True, u'ovs_hybrid_plug': True}, u'binding:vif_type': u'ovs', u'mac_address': u'fa:16:3e:7a:a6:85', u'status': u'ACTIVE', u'subnets': [{u'ipv6_ra_mode': None, u'cidr': u'172.24.5.0/24', u'gateway_ip': u'172.24.5.1', u'id': u'362e9033-db93-4193-9413-1073215ab326', u'subnetpool_id': None}, {u'ipv6_ra_mode': None, u'cidr': u'2001:db8::/64', u'gateway_ip': u'2001:db8::2', u'id': u'feb3aa76-53b1-4d4e-b136-412c747ffd30', u'subnetpool_id': None}], u'binding:host_id': u'devstack-trusty-hpcloud-b5-5153724', u'dns_assignment': [{u'hostname': u'host-172-24-5-9', u'ip_address': u'172.24.5.9', u'fqdn': u'host-172-24-5-9.openstacklocal.'}, {u'hostname': u'host-2001-db8--a', u'ip_address': u'2001:db8::a', u'fqdn': u'host-2001-db8--a.openstacklocal.'}], u'device_id': u'646bb18b-da52-4ead-a635-012c72c1ccf1', u'name': u'', u'admin_state_up': True, u'network_id': u'31689320-95d7-44f9-932a-cc82c1bca2b4', u'dns_name': u'', u'binding:vnic_type': u'normal', u'tenant_id': u'', u'extra_subnets': []} create_dvr_fip_interfaces /opt/stack/new/neutron/neutron/agent/l3/dvr_local_router.py:427

2015-09-29 21:10:34.043 DEBUG neutron.agent.l3.dvr_fip_ns [req-33413b07-784c-469e-8a35-0e20312a157e None None] add fip-namespace(fip-31689320-95d7-44f9-932a-cc82c1bca2b4) create /opt/stack/new/neutron/neutron/agent/l3/dvr_fip_ns.py:133

Exit code: 0
 execute /opt/stack/new/neutron/neutron/agent/linux/utils.py:156
2015-09-29 21:10:34.053 DEBUG neutron.agent.linux.utils [req-33413b07-784c-469e-8a35-0e20312a157e None None] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'fip-31689320-95d7-44f9-932a-cc82c1bca2b4', 'sysctl', '-w', 'net.ipv4.ip_forward=1'] execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:101

2015-09-29 21:10:34.084 ERROR neutron.agent.linux.utils [req-33413b07-784c-469e-8a35-0e20312a157e None None]
Command: ['ip', 'netns', 'exec', 'fip-31689320-95d7-44f9-932a-cc82c1bca2b4', 'sysctl', '-w', 'net.ipv4.ip_forward=1']
Exit code: 1
Stdin:
Stdout:
Stderr: seting the network namespace "fip-31689320-95d7-44f9-932a-cc82c1bca2b4" failed: Invalid argument

This leads to a series of failures.

This failure is seen only in the gate.

This can be reproduced by constantly adding and deleting floatingip to a private IP, with multiple API worker threads.

For more information you can also look at the "logstash" output below.

http://logs.openstack.org/82/228582/8/check/gate-tempest-dsvm-neutron-dvr/9053337/logs/screen-q-l3.txt.gz?level=TRACE#_2015-09-29_21_10_34_084

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: New → In Progress
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Carl Baldwin (carl-baldwin)
Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Carl Baldwin (carl-baldwin)
Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Carl Baldwin (carl-baldwin)
Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Carl Baldwin (carl-baldwin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/229561
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c874f6dadaa983ed4f338808907c8829a7f86031
Submitter: Jenkins
Branch: master

commit c874f6dadaa983ed4f338808907c8829a7f86031
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Sep 30 11:15:52 2015 -0700

    Split the FIP Namespace delete in L3 agent for DVR

    Right now we are seeing a race condition in the l3 agent
    for DVR routers when a floatingip is deleted and created.

    The agent tries to delete the floatingip namespace and
    while it tries to delete there is another call to add a
    namespace. There is a timing window in between these two
    calls where sometimes the call to create a namespace succeeds
    but, when tried to execute any commands in the namespace
    it fails, since the namespace was deleted concurrently.

    Since the fip namespace is associated with an external net
    and each node has only one fip namespace for an external net,
    we would like to only delete the fip namespace when the
    external net is deleted.

    The first step is to split the delete functionality into two.
    The call to fip_ns.cleanup will only remove the dependency that
    the fipnamespace has with the router namespace such as fpr and
    rfp veth pairs.

    The call to fip_ns.delete will actually delete the
    the fip namespace and the fg device.

    Partial-Bug: #1501873
    Change-Id: Ic94625d5a968f554af70c274b2b2c20ab64e2487

Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Swaminathan Vasudevan (swaminathan-vasudevan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/230079
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cb465d40f59bfbc109204dc16259c0e4ce1c903a
Submitter: Jenkins
Branch: master

commit cb465d40f59bfbc109204dc16259c0e4ce1c903a
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Oct 1 11:48:55 2015 -0700

    Delete fipnamespace when external net removed on DVR

    The fipnamespace is associated with an external network
    on a given node. In the case of DVR there is just one
    single FIP namespace for a given node.

    We have seen some race conditions in the agent for creation
    and deletion of the fip namespace. See the bug report for
    details on the failure.

    So in order to address this race condition and make the
    code more stable, we will be cleaning up the fip namespace
    only when an external network is removed.

    The server will be sending a rpc notification message to
    the agent to cleanup the fip namespace when the external
    net is removed.

    This patch address the above mentioned issue by not constantly
    deleting and creating the fip namespace.

    Closes-Bug: #1501873
    Change-Id: I86869f66d4afffad7db09942578b1a456a9bd418

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/273235

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/273236

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/liberty)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/273235
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/273236
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/273235
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90fb36f187826f6249815c611021bece625d0a96
Submitter: Jenkins
Branch: stable/liberty

commit 90fb36f187826f6249815c611021bece625d0a96
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Sep 30 11:15:52 2015 -0700

    Split the FIP Namespace delete in L3 agent for DVR

    Right now we are seeing a race condition in the l3 agent
    for DVR routers when a floatingip is deleted and created.

    The agent tries to delete the floatingip namespace and
    while it tries to delete there is another call to add a
    namespace. There is a timing window in between these two
    calls where sometimes the call to create a namespace succeeds
    but, when tried to execute any commands in the namespace
    it fails, since the namespace was deleted concurrently.

    Since the fip namespace is associated with an external net
    and each node has only one fip namespace for an external net,
    we would like to only delete the fip namespace when the
    external net is deleted.

    The first step is to split the delete functionality into two.
    The call to fip_ns.cleanup will only remove the dependency that
    the fipnamespace has with the router namespace such as fpr and
    rfp veth pairs.

    The call to fip_ns.delete will actually delete the
    the fip namespace and the fg device.

    Partial-Bug: #1501873
    (cherry picked from commit c874f6dadaa983ed4f338808907c8829a7f86031)
    Change-Id: Ic94625d5a968f554af70c274b2b2c20ab64e2487

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/273236
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=05de3183a3a85a08d8ea4cfcd87cefa8b67ceb4b
Submitter: Jenkins
Branch: stable/liberty

commit 05de3183a3a85a08d8ea4cfcd87cefa8b67ceb4b
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Oct 1 11:48:55 2015 -0700

    Delete fipnamespace when external net removed on DVR

    The fipnamespace is associated with an external network
    on a given node. In the case of DVR there is just one
    single FIP namespace for a given node.

    We have seen some race conditions in the agent for creation
    and deletion of the fip namespace. See the bug report for
    details on the failure.

    So in order to address this race condition and make the
    code more stable, we will be cleaning up the fip namespace
    only when an external network is removed.

    The server will be sending a rpc notification message to
    the agent to cleanup the fip namespace when the external
    net is removed.

    This patch address the above mentioned issue by not constantly
    deleting and creating the fip namespace.

    Conflicts:
     neutron/tests/functional/agent/test_l3_agent.py

    Closes-Bug: #1501873
    (cherry picked from commit cb465d40f59bfbc109204dc16259c0e4ce1c903a)
    Change-Id: I86869f66d4afffad7db09942578b1a456a9bd418

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 7.1.0

This issue was fixed in the openstack/neutron 7.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.