Network loop between physical networks with DVR

Bug #1887148 reported by Darragh O'Reilly on 2020-07-10
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
Rodolfo Alonso

Bug Description

Our CI experienced a network loop due to https://review.opendev.org/#/c/733568/ . DVR is enabled and there is more than one physical bridge mapping, and the neutron server was not available when the ovs agents were started.

Steps
=====
# add more physical bridges
ovs-vsctl add-br br-physnet1
ip link set dev br-physnet1 up

ovs-vsctl add-br br-physnet2
ip link set dev br-physnet2 up

# set a broadcast going from one bridge
ip address add 1.1.1.1/31 dev br-physnet1
arping -b -I br-physnet1 1.1.1.1

# listen on the other
tcpdump -eni br-physnet2

# Update /etc/neutron/plugins/ml2/ml2_conf.ini
[ml2_type_vlan]
network_vlan_ranges = public,physnet1,physnet2

[ovs]
datapath_type = system
bridge_mappings = public:br-ex,physnet1:br-physnet1,physnet2:br-physnet2
tunnel_bridge = br-tun
local_ip = 127.0.0.1

[agent]
tunnel_types = vxlan
root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
enable_distributed_routing = True
l2_population = True

# stop server and agent
systemctl stop devstack@q-svc
systemctl stop devstack@q-agt

# clear all flows
for BR in $(sudo ovs-vsctl list-br); do echo $BR; sudo ovs-ofctl del-flows $BR; done

# start agent
systemctl start devstack@q-agt

$ sudo tcpdump -eni br-physnet2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
09:46:56.577183 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
09:46:57.577568 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
...

If there is more than one node running the ovs agent in this state, then there will be a network loop and packets can multiple quickly and overwhelm the network. We saw ~1 million packets/sec.

I think because the neutron server is not available, the get_dvr_mac_address rpc is blocked and the required drops are not installed:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L138
https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L230-L234

summary: - Network loop between physical network with DVR
+ Network loop between physical networks with DVR
description: updated
description: updated
description: updated
description: updated

The call to setup_dvr_flows() also installs the drop rule. But this is not getting to run because setup_rpc blocks on the has_alive_neutron_server rpc. Some stable branches don't have this rpc, and they block on get_dvr_mac_address instead.

https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L268-L286

Fix proposed to branch: master
Review: https://review.opendev.org/740506

Changed in neutron:
assignee: nobody → Darragh O'Reilly (darragh-oreilly)
status: New → In Progress

# for BR in br-int br-physnet1 br-physnet2; do echo -e '\n' $BR; ovs-ofctl dump-flows $BR; done

 br-int
 cookie=0x45cd6ea336bf1713, duration=388.709s, table=0, n_packets=0, n_bytes=0, priority=65535,vlan_tci=0x0fff/0x1fff actions=drop
 cookie=0x45cd6ea336bf1713, duration=388.712s, table=0, n_packets=396, n_bytes=17144, priority=0 actions=resubmit(,60)
 cookie=0x45cd6ea336bf1713, duration=388.713s, table=23, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x45cd6ea336bf1713, duration=388.710s, table=24, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x45cd6ea336bf1713, duration=388.710s, table=60, n_packets=396, n_bytes=17144, priority=3 actions=NORMAL
 cookie=0x45cd6ea336bf1713, duration=388.708s, table=61, n_packets=0, n_bytes=0, priority=3 actions=NORMAL

 br-physnet1
 cookie=0x43cb683a62e439ee, duration=388.668s, table=0, n_packets=396, n_bytes=17144, priority=0 actions=NORMAL

 br-physnet2
 cookie=0xb9eab97d187e7034, duration=388.655s, table=0, n_packets=396, n_bytes=17144, priority=0 actions=NORMAL

Bogdan Dobrelya (bogdando) wrote :

Did I get the situation (with all the provided background) correct:

So, if setup rpc first, the rabbit gets muted and rpc cannot be checked before setting up bridges (no rules matching in secure mode - no flow).

And if setup bridges first, it blocks on rpc and no blocking rules got installed on bridges that causes ~1 mil/s packets arp storm.

Ultimately with the blocking rule (being removed in https://review.opendev.org/#/c/733568/), there is a dataplane downtime each time the neutron ovs agent restarted?

If so, the only solution I can think of (after reverting the latter patch) is make rpc check and the setup bridges to be exec'ed *real* parallel, and added a strict timeouts + a single retry for each of it?

Change abandoned by Darragh O'Reilly (<email address hidden>) on branch: master
Review: https://review.opendev.org/740506
Reason: Will create another patchset that fixes both issues.

Changed in neutron:
importance: Undecided → High

Fix proposed to branch: master
Review: https://review.opendev.org/741444

Changed in neutron:
assignee: Darragh O'Reilly (darragh-oreilly) → Rodolfo Alonso (rodolfo-alonso-hernandez)

Reviewed: https://review.opendev.org/740724
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8
Submitter: Zuul
Branch: master

commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808

Changed in neutron:
status: In Progress → Fix Released

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: master
Review: https://review.opendev.org/741444
Reason: Superseded by https://review.opendev.org/#/c/740724/. Nice to see a better option.

Reviewed: https://review.opendev.org/742364
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=47ec363f5faefd85dfa33223c0087f4444afb5b9
Submitter: Zuul
Branch: stable/rocky

commit 47ec363f5faefd85dfa33223c0087f4444afb5b9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)
    (cherry picked from commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9)
    Conflicts:
            neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

tags: added: in-stable-rocky
tags: added: in-stable-pike

Reviewed: https://review.opendev.org/742366
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=00466f41d690ca7c7a918bfd861878ef620bbec9
Submitter: Zuul
Branch: stable/pike

commit 00466f41d690ca7c7a918bfd861878ef620bbec9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)
    (cherry picked from commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9)
    Conflicts:
            neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
    (cherry picked from commit 47ec363f5faefd85dfa33223c0087f4444afb5b9)
    Conflicts:
            neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py
    (cherry picked from commit 8a173ec29ac1819c3d28c191814cd1402d272bb9)

Reviewed: https://review.opendev.org/742365
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8a173ec29ac1819c3d28c191814cd1402d272bb9
Submitter: Zuul
Branch: stable/queens

commit 8a173ec29ac1819c3d28c191814cd1402d272bb9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)
    (cherry picked from commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9)
    Conflicts:
            neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
    (cherry picked from commit 47ec363f5faefd85dfa33223c0087f4444afb5b9)
    Conflicts:
            neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py

tags: added: in-stable-queens
tags: added: in-stable-stein

Reviewed: https://review.opendev.org/742361
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8181c5dbfe799ac6c832ab67b7eab3bcef4098b9
Submitter: Zuul
Branch: stable/stein

commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)

Reviewed: https://review.opendev.org/742355
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=143fe8ff89ba776618ed6291af9d5e28e4662bdb
Submitter: Zuul
Branch: stable/ussuri

commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)

tags: added: in-stable-ussuri

Reviewed: https://review.opendev.org/742359
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=17eded13595b18ab60af5256e0f63c57c3702296
Submitter: Zuul
Branch: stable/train

commit 17eded13595b18ab60af5256e0f63c57c3702296
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)

tags: added: in-stable-train
tags: added: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers