SNAT namespace is not getting cleared after the manual move of SNAT with dead agent

Bug #1557909 reported by Hardik Italia
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Swaminathan Vasudevan

Bug Description

Latest patch (2016-06-10): https://review.openstack.org/#/c/326729/

Stale snat namespace on the controller after recovery of dead l3 agent.

Note: Only on Stable/LIBERTY Branch:

Setup:
Multiple controller (DVR_SNAT) setup.

Steps:
1) Create tenant network, subnet and router.
 2) Create a external network
 3) Attached internal & external network to a router
 4) Create VM on above tenant network.
 5) Make sure VM can reach outside using CSNAT.
 6) Find router hosting l3 agent and stop the l3 agent.
 7) Manually move router to other controller (dvr_snat mode). SNAT namespace should be create on new controller node.
 8) Start the l3 agent on the controller (the one that stopped in step6)
 9) Notice that snat namespace is now available on 2 controller and it is not getting deleted from the agent which is not hosting it.

Example:
| cfa97c12-b975-4515-86c3-9710c9b88d76 | L3 agent | vm2-ctl2-936 | :-) | True | neutron-l3-agent |
| df4ca7c5-9bae-4cfb-bc83-216612b2b378 | L3 agent | vm1-ctl1-936 | :-) | True | neutron-l3-agent |

mysql> select * from csnat_l3_agent_bindings;
+--------------------------------------+--------------------------------------+---------+------------------+
| router_id | l3_agent_id | host_id | csnat_gw_port_id |
+--------------------------------------+--------------------------------------+---------+------------------+
| 0fb68420-9e69-41bb-8a88-8ab53b0faabb | cfa97c12-b975-4515-86c3-9710c9b88d76 | NULL | NULL |
+--------------------------------------+--------------------------------------+---------+------------------+

On vm1-ctl1-936

Stale SNAT namespace on Initially hosting controller.

ubuntu@vm1-ctl1-936:~/devstack$ sudo ip netns
snat-0fb68420-9e69-41bb-8a88-8ab53b0faabb
qrouter-0fb68420-9e69-41bb-8a88-8ab53b0faabb

On vm2-ctl2-936 (2nd Controller)

ubuntu@vm2-ctl2-936:~$ ip netns
snat-0fb68420-9e69-41bb-8a88-8ab53b0faabb
qrouter-0fb68420-9e69-41bb-8a88-8ab53b0faabb

tags: added: l3-dvr-backlog
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/302068

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/306065

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: master
Review: https://review.openstack.org/302068
Reason: I have to abandon this for the alternate patch that solves the problem.

https://review.openstack.org/306065

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/306065
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9dc70ed77e055677a4bd3257a0e9e24239ed4cce
Submitter: Jenkins
Branch: master

commit 9dc70ed77e055677a4bd3257a0e9e24239ed4cce
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Apr 14 12:49:08 2016 -0700

    DVR: Clear SNAT namespace when agent restarts after router move

    When we manually move a router from one dvr_snat node to
    another dvr_snat node the snat_namespace should be removed in
    the originating node by the agent and will be re-created in the
    destination node by the destination agent.

    But when the agent dies, the router_update message reaches the
    agent after the agent restarts. At this time the agent should
    remove the snat_namespace since it is no more hosted by the
    current agent.

    Even though we do have logic in agent to take care of cleaning
    up the snat namespaces if the gw_port_host does not match with the
    existing agent host, in this particular use case the self.snat_namespace
    is always set to 'None' in the dvr_edge_router init call when agent
    restarts.

    This patch fixes the above issue by initializing the snat namespace
    object during the router_init. Since we do have a valid snat
    namespace object and if the gw_port_host mismatches, the agent
    should clean up the namespace.

    Change-Id: I30524dc77b743429ef70941479c9b6cccb21c23c
    Closes-Bug: #1557909

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/313020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/313021

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/313020
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b9103329855481b1415a07e6b31206ba35dabc7d
Submitter: Jenkins
Branch: stable/mitaka

commit b9103329855481b1415a07e6b31206ba35dabc7d
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Apr 14 12:49:08 2016 -0700

    DVR: Clear SNAT namespace when agent restarts after router move

    When we manually move a router from one dvr_snat node to
    another dvr_snat node the snat_namespace should be removed in
    the originating node by the agent and will be re-created in the
    destination node by the destination agent.

    But when the agent dies, the router_update message reaches the
    agent after the agent restarts. At this time the agent should
    remove the snat_namespace since it is no more hosted by the
    current agent.

    Even though we do have logic in agent to take care of cleaning
    up the snat namespaces if the gw_port_host does not match with the
    existing agent host, in this particular use case the self.snat_namespace
    is always set to 'None' in the dvr_edge_router init call when agent
    restarts.

    This patch fixes the above issue by initializing the snat namespace
    object during the router_init. Since we do have a valid snat
    namespace object and if the gw_port_host mismatches, the agent
    should clean up the namespace.

    Change-Id: I30524dc77b743429ef70941479c9b6cccb21c23c
    Closes-Bug: #1557909
    (cherry picked from commit 9dc70ed77e055677a4bd3257a0e9e24239ed4cce)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/313021
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d29e2f77d3933ea0279bf1863a31baf3a73a86e7
Submitter: Jenkins
Branch: stable/liberty

commit d29e2f77d3933ea0279bf1863a31baf3a73a86e7
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Apr 14 12:49:08 2016 -0700

    DVR: Clear SNAT namespace when agent restarts after router move

    When we manually move a router from one dvr_snat node to
    another dvr_snat node the snat_namespace should be removed in
    the originating node by the agent and will be re-created in the
    destination node by the destination agent.

    But when the agent dies, the router_update message reaches the
    agent after the agent restarts. At this time the agent should
    remove the snat_namespace since it is no more hosted by the
    current agent.

    Even though we do have logic in agent to take care of cleaning
    up the snat namespaces if the gw_port_host does not match with the
    existing agent host, in this particular use case the self.snat_namespace
    is always set to 'None' in the dvr_edge_router init call when agent
    restarts.

    This patch fixes the above issue by initializing the snat namespace
    object during the router_init. Since we do have a valid snat
    namespace object and if the gw_port_host mismatches, the agent
    should clean up the namespace.

    Change-Id: I30524dc77b743429ef70941479c9b6cccb21c23c
    Closes-Bug: #1557909
    (cherry picked from commit 9dc70ed77e055677a4bd3257a0e9e24239ed4cce)

tags: added: in-stable-liberty
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.1.1

This issue was fixed in the openstack/neutron 8.1.1 release.

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 7.1.0

This issue was fixed in the openstack/neutron 7.1.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b1

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/327509

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: master
Review: https://review.openstack.org/327509
Reason: Will go ahead with the previous option.

Changed in neutron:
status: Fix Released → In Progress
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/326729
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=acd04d668bd414cd21f2715adc6a35a0eaed59a3
Submitter: Jenkins
Branch: master

commit acd04d668bd414cd21f2715adc6a35a0eaed59a3
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Jun 7 13:31:56 2016 -0700

    DVR: Clean stale snat-ns by checking its existence when agent restarts

    At present there is no clear way to distinguish when the snat_namespace
    object is initialized and when the actual namespace is created.
    There is no way to check if the namespace already existed. The
    code was only checking at the snat_namespace object instead of its
    existence.

    This patch addresses the issue by adding in an exists method to the
    namespace object to identify the existence of the namespace in the
    given agent.

    This would allow us to check for the existence of the namespace,
    and also allow us to identify the stale snat namespace and
    delete the namespace when the gateway is cleared as the agent restarts.

    This also applies for conditions when the router is manually moved
    from one agent to another agent while the agent is dead. When the
    agent wakes up it would clean up the stale snat namespace.
    Change-Id: Icb00297208813436c2a9e9a003275462293ad643
    Closes-Bug: #1557909

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/351923

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/351947

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/351947
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c334eb215aeeeb93e7c0ed6fce8b24e9ae8b2360
Submitter: Jenkins
Branch: stable/liberty

commit c334eb215aeeeb93e7c0ed6fce8b24e9ae8b2360
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Jun 7 13:31:56 2016 -0700

    DVR: Clean stale snat-ns by checking its existence when agent restarts

    At present there is no clear way to distinguish when the snat_namespace
    object is initialized and when the actual namespace is created.
    There is no way to check if the namespace already existed. The
    code was only checking at the snat_namespace object instead of its
    existence.

    This patch addresses the issue by adding in an exists method to the
    namespace object to identify the existence of the namespace in the
    given agent.

    This would allow us to check for the existence of the namespace,
    and also allow us to identify the stale snat namespace and
    delete the namespace when the gateway is cleared as the agent restarts.

    This also applies for conditions when the router is manually moved
    from one agent to another agent while the agent is dead. When the
    agent wakes up it would clean up the stale snat namespace.

    Closes-Bug: #1557909

    (cherry picked from commit acd04d668bd414cd21f2715adc6a35a0eaed59a3)

    Conflicts:
     neutron/agent/l3/agent.py
     neutron/agent/l3/dvr_edge_ha_router.py
     neutron/agent/l3/dvr_edge_router.py
     neutron/tests/functional/agent/l3/test_dvr_router.py

    Change-Id: Icb00297208813436c2a9e9a003275462293ad643

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/351923
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6e5892da1434a1a7674456aa4c897480fbb0208b
Submitter: Jenkins
Branch: stable/mitaka

commit 6e5892da1434a1a7674456aa4c897480fbb0208b
Author: Swaminathan Vasudevan <email address hidden>
Date: Tue Jun 7 13:31:56 2016 -0700

    DVR: Clean stale snat-ns by checking its existence when agent restarts

    At present there is no clear way to distinguish when the snat_namespace
    object is initialized and when the actual namespace is created.
    There is no way to check if the namespace already existed. The
    code was only checking at the snat_namespace object instead of its
    existence.

    This patch addresses the issue by adding in an exists method to the
    namespace object to identify the existence of the namespace in the
    given agent.

    This would allow us to check for the existence of the namespace,
    and also allow us to identify the stale snat namespace and
    delete the namespace when the gateway is cleared as the agent restarts.

    This also applies for conditions when the router is manually moved
    from one agent to another agent while the agent is dead. When the
    agent wakes up it would clean up the stale snat namespace.
    Change-Id: Icb00297208813436c2a9e9a003275462293ad643
    Closes-Bug: #1557909
    (cherry picked from commit acd04d668bd414cd21f2715adc6a35a0eaed59a3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.