Centralized SNAT failover does not recover until "systemctl restart neutron-l3-agent" on transferred node

Bug #1881995 reported by Machi Hoshino
This bug affects 1 person
Affects Status Importance Assigned to Milestone
In Progress
Ann Taraday

Bug Description


OVSGTW DVR Mode: dvr_snat
CMP DVR Mode: dvr
No L3 HA

Use Case: Centralized FIPs (aka Floating IPs agains unbound ports)

**How to reproduce**

1. Create normally a VM

2. Create allowed-pair port against the VM port

openstack port list --server <server_name> # Get port id
openstack port create --security-group <sec_group> --fixed-ip subnet=<subnet>,ip-address=<ip_address> --network <network name> <port name>
openstack port set --allowed-address ip-address=<ip_address> <server port>

3. Assign floating ip to the port

openstack floating ip set --port <port_name> <floating_ip>

4. Inside the deployed VM create IP alias for the new ip address

ip addr add <ip_address>/24 dev ens3

5. Detect which gtw node is hosting the centralized fip

neutron l3-agent-list-hosting-router <router>

6. Perform manual failover

neutron l3-agent-router-remove <hosting-l3-agent> <router>
neutron l3-agent-router-add <new-l3-agent> <router>

(Or) Perform automatic failover

shutdown -h now (on hosting gtw)

7. Detect failover happened on new node

neutron l3-agent-list-hosting-router <router>

**Expected Result**

Connection to floating ip address recovers automatically

**Actual Result**

Connection does not recover. Reoccurrence is 100%

**How to recover**

Perform "neutron-l3-agent" restart on hosting node (after failover). Recovers within few seconds.

systemctl restart neutron-l3-agent

**Additional information**

After failover the SNAT namespace does not include the sysctl rules that should be added upon namespace creation. We have also confirmed that fixing them manually also fixes the issue.


The following is the sysctl's after failover
root@gtw03:~# ip netns exec snat-8737216a-b561-434f-a023-1d9cae2ce04e sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0
root@gtw03:~# ip netns exec snat-8737216a-b561-434f-a023-1d9cae2ce04e sysctl net.ipv4.conf.all.arp_ignore
net.ipv4.conf.all.arp_ignore = 0
root@gtw03:~# ip netns exec snat-8737216a-b561-434f-a023-1d9cae2ce04e sysctl net.ipv4.conf.all.arp_announce
net.ipv4.conf.all.arp_announce = 0
root@gtw03:~# ip netns exec snat-8737216a-b561-434f-a023-1d9cae2ce04e sysctl net.ipv6.conf.all.forwarding
net.ipv6.conf.all.forwarding = 1

We are believe this caused by the following commits which only does initialization when neutron-l3-agent starts.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/734070

Changed in neutron:
assignee: nobody → Ann Taraday (akamyshnikova)
status: New → In Progress
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

If default host configs disable net.ipv4.ip_forward = 0 this issue persist as snat namespace does not initialized properly. Proposed change with validation.

Revision history for this message
LIU Yulong (dragon889) wrote :

It's not reproducible in stable/queens locally. The sysconf in my test env works fine.

How many routers or ports your network nodes have? I can image one issue is when your ovs-agent is busy, the re-scheduled router ports in new L3-agent may not setup properly. Or it takes too long time to config properly. That may gives you a phenomenon is L3-agent does not work fine.

Changed in neutron:
importance: Undecided → Medium
tags: added: l3-dvr-backlog
Revision history for this message
Ann Taraday (akamyshnikova) wrote :

It is reproduced on my multinode devstack with stable/queens with one router, no VMs or other ports. Logs with reproduction of issue http://paste.openstack.org/show/794506/
As you can see no rules with ip_forward applied.
Some debug logs with issue http://paste.openstack.org/show/794507/ - in this logs you can see that snat namespace does not exits when _create_dvr_gateway started and was created later, but was not properly initialized.

The odd thing about reproduction that I see other behavior of L3 agent (dvr_snat) as well on rescheduling router http://paste.openstack.org/show/794508/ - in this case snat namespace got initialized at the very beginning properly. But with several number of reschedule attempts I hit the issue anyway.

Revision history for this message
LIU Yulong (dragon889) wrote :

Looks like there is a race condition between 2 different RPCs: "router_added_to_agent" and "routers updated notification" according to your paste log [1].

"router_added_to_agent" has request-id req-4746c297-5636-4ecd-bc8e-e6bc862bd2b1, while "routers updated notification" has req-700a760c-2f07-4744-9db1-761e3620e168.

The root cause could be the router_info is still in the agent cache, so one RPC hit the update procedure which will not run initialize.

[1] http://paste.openstack.org/show/794506/

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

Issue is reproducing on master, attaching neutron-l3-agent logs.
(timestamp 2020-06-10 11:48:07.052)

Revision history for this message
LIU Yulong (dragon889) wrote :

Accoring to your last comment log [1], we can see there is a snat destroy info:

2020-06-10 11:47:55.161 4887 DEBUG neutron.agent.l3.dvr_snat_ns [-] DVR: destroy snat ns: snat-b81dfaa6-21ba-4736-ad2c-2f6de8a297c4 delete /opt/stack/neutron/neutron/agent/l3/dvr_snat_ns.py:60

and then there is a list ns result which shows the qrouter-namespace is till there (but no snat-namepace):

2020-06-10 11:47:55.185 4900 DEBUG oslo.privsep.daemon [-] privsep: reply[140642348147824]: (4, ['qdhcp-367c9eaf-7acb-48c8-96f7-bb7da6e92e9d', 'qdhcp-780abcc2-7dbc-43d8-b242-bb5b3b8b22f8', 'fip-85fc5fbc-dbb4-4bc8-9fc5-6ee87e031b1b', 'qrouter-b81dfaa6-21ba-4736-ad2c-2f6de8a297c4', 'qdhcp-0ec11f4e-9391-4a68-a28c-bc9a20267219']) _call_back /usr/local/lib/python3.6/dist-packages/oslo_privsep/daemon.py:475

Based on the code path [2], the "destroy snat ns" is the last action of a router info deletation. But why the qrouter-namespace is still stand? This could indicate one thing is the router may still in the router info cache which will cause your next router_add_to_agent action do not hit the initialize function.

[1] https://launchpadlibrarian.net/483658618/neutron-l3-agent.log
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_edge_router.py#L236

Revision history for this message
LIU Yulong (dragon889) wrote :

So based on my comment #7, please add some log to the namespace deletation, router delete, and router_info cache delete code path to dig more information. Yes, IMO these logs could be accepted to upstream which should be useful for troubleshooting.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/736269

Revision history for this message
LIU Yulong (dragon889) wrote :

@Ann, I'm adding some logs to L3 agent, please have a test on that, then we can dig more details about this issue.

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

Thank a lot for creation log patch!
I will upload logs with repro as soon I will get to it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/734070
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3754bba8068134b0c7be294efda02b8fa33341aa
Submitter: Zuul
Branch: master

commit 3754bba8068134b0c7be294efda02b8fa33341aa
Author: Ann Taraday <email address hidden>
Date: Mon Jun 8 16:45:05 2020 +0400

    Validate that snat namespace exits in _create_dvr_gateway

    During rescheduling dvr router snat namespace may not be
    created due to race between router added and router
    updated updated notifications.

    Verify that snat namespace exits or create one.

    Partial-bug: 1881995

    Change-Id: Ic28ce249d59264b0b882bd1cc3c9fb55854a6a47

Revision history for this message
Ann Taraday (akamyshnikova) wrote :

I applied logs change and got repro for issue.
It reproduced several times, one of timestamps 2020-07-08 09:23:51.736

Revision history for this message
Ann Taraday (akamyshnikova) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/736269
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/736269
Committed: https://opendev.org/openstack/neutron/commit/ac1597d00911fefdd5dec388052cb8fc76f30965
Submitter: "Zuul (22348)"
Branch: master

commit ac1597d00911fefdd5dec388052cb8fc76f30965
Author: LIU Yulong <email address hidden>
Date: Wed Jun 17 23:18:04 2020 +0800

    [L3] Add some logs for router processing

    In order to dig the real action of a ResourceUpdate, add logs for:
    1. add/update router
    2. delete router
    3. delete namespace
    4. agent extension router add/delete/update actions

    Change-Id: I5c0ff485cd0c966afe535f8063deca6e410e012d
    Related-bug: #1881995

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.