reboot neutron-ovs-agent introduces a short interrupt of vlan traffic

Bug #1869808 reported by norman shen on 2020-03-31
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Low
norman shen

Bug Description

We are using Openstack Neutron 13.0.6 and it is deployed using OpenStack-helm.

I test ping servers in the same vlan while rebooting neutron-ovs-agent. The result shows

root@mgt01:~# openstack server list
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+-----------------+--------+------------------------------------------+------------------------------+-----------+
| 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1 | ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny |
| 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2 | ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny |

$ ping 172.31.10.4
PING 172.31.10.4 (172.31.10.4): 56 data bytes
......
64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms
64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms <--------
64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms
64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms
64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms
64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms
64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms
64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms

As one can see, packet seq 62 is lost, I believe, during rebooting ovs agent.

Right now, I am suspecting https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229 this code is refreshing flow table rules even though it is not necessary.

Because when I dump flows on phys bridge, I can see duration is rewinding to 0 which suggests flow has been deleted and created again

""" duration=secs
              The time, in seconds, that the entry has been in the table.
              secs includes as much precision as the switch provides, possibly
              to nanosecond resolution.
"""

root@compute01:~# ovs-ofctl dump-flows br-floating
...
 cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409,
                            ^------ this value resets
priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL
...

IMO, rebooting ovs-agent should not affecting data plane.

Brian Haley (brian-haley) wrote :

Hi Norman - are you able to reproduce this on a later version of neutron like 15.0.x?

Since this is just causing a small impact and the restart of the ovs-agent isn't a frequent event, I'm going to set the priority to Low.

tags: added: ovs
Changed in neutron:
importance: Undecided → Low
Slawek Kaplonski (slaweq) wrote :

After restart of neutron-ovs-agent, it should installs all OF rules with new cookie_id and then delete old rules so duration in all of them should be reset.
Is your problem only on vlan network? Or it happens on other network types too?

LIU Yulong (dragon889) wrote :

Hi,
Please take a look at this fix: https://review.opendev.org/#/c/681462/. If your deployment does no have this fix, try to backport it and see if it helps.

norman shen (jshen28) wrote :

Thank you very much for the reply. I've checked the code and we do indeed have this patch applied already.

norman shen (jshen28) wrote :

but it seems to a different problem, because the short interrupt still exists.

Changed in neutron:
assignee: nobody → norman shen (jshen28)
status: New → In Progress
norman shen (jshen28) wrote :

I would like to remove drop_port calls inside setup_physical_bridges because it seems particularly overkill,
for example, such flows will be quickly overriden by dvr flows after several steps.

I would be appreciated to have some backgrounds on theses two calls and understand why they are put there. From our point of view, control plane failure should do little or no harm to user plane data flows. But blocking connections will have a huge backfire and making all vlans useless.

Besides it will also interrupt data flows during startup depending on how fast rpc call is made. So I'd really like to remove these two calls if there are no other security issues.

tags: added: l3-dvr-backlog

Reviewed: https://review.opendev.org/733568
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=90212b12cdf62e92d811997ebba699cab431d696
Submitter: Zuul
Branch: master

commit 90212b12cdf62e92d811997ebba699cab431d696
Author: shenjiatong <email address hidden>
Date: Thu Jun 18 15:33:13 2020 +0800

    Do not block connection between br-int and br-phys on startup

    Block traffic between br-int and br-physical is over kill
    and will at least

    1. interrupt vlan flow during startup, and is particularly
    so if dvr enabled
    2. if let's rabbitmq is not stable, it is possible data plane
    will be affected and vlan will never work.

    Using openstack on k8s particularly amplifies the problem
    because pod could be killed pretty easily by liveness
    probes.

    Change-Id: I51050c600ba7090fea71213687d94340bac0674a
    Closes-Bug: #1869808

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/738503
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=88e70a520acaca37db645c3ef1124df8c7d778d5
Submitter: Zuul
Branch: stable/ussuri

commit 88e70a520acaca37db645c3ef1124df8c7d778d5
Author: shenjiatong <email address hidden>
Date: Thu Jun 18 15:33:13 2020 +0800

    Do not block connection between br-int and br-phys on startup

    Block traffic between br-int and br-physical is over kill
    and will at least

    1. interrupt vlan flow during startup, and is particularly
    so if dvr enabled
    2. if let's rabbitmq is not stable, it is possible data plane
    will be affected and vlan will never work.

    Using openstack on k8s particularly amplifies the problem
    because pod could be killed pretty easily by liveness
    probes.

    Change-Id: I51050c600ba7090fea71213687d94340bac0674a
    Closes-Bug: #1869808
    (cherry picked from commit 90212b12cdf62e92d811997ebba699cab431d696)

tags: added: in-stable-ussuri
tags: added: in-stable-train

Reviewed: https://review.opendev.org/738504
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4f501f405d1c44e00784df8450cbe83129da1ea7
Submitter: Zuul
Branch: stable/train

commit 4f501f405d1c44e00784df8450cbe83129da1ea7
Author: shenjiatong <email address hidden>
Date: Thu Jun 18 15:33:13 2020 +0800

    Do not block connection between br-int and br-phys on startup

    Block traffic between br-int and br-physical is over kill
    and will at least

    1. interrupt vlan flow during startup, and is particularly
    so if dvr enabled
    2. if let's rabbitmq is not stable, it is possible data plane
    will be affected and vlan will never work.

    Using openstack on k8s particularly amplifies the problem
    because pod could be killed pretty easily by liveness
    probes.

    Change-Id: I51050c600ba7090fea71213687d94340bac0674a
    Closes-Bug: #1869808
    (cherry picked from commit 90212b12cdf62e92d811997ebba699cab431d696)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/738505
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6dfc35680fcc885d9ad449ca2b39225fb1bca898
Submitter: Zuul
Branch: stable/stein

commit 6dfc35680fcc885d9ad449ca2b39225fb1bca898
Author: shenjiatong <email address hidden>
Date: Thu Jun 18 15:33:13 2020 +0800

    Do not block connection between br-int and br-phys on startup

    Block traffic between br-int and br-physical is over kill
    and will at least

    1. interrupt vlan flow during startup, and is particularly
    so if dvr enabled
    2. if let's rabbitmq is not stable, it is possible data plane
    will be affected and vlan will never work.

    Using openstack on k8s particularly amplifies the problem
    because pod could be killed pretty easily by liveness
    probes.

    Change-Id: I51050c600ba7090fea71213687d94340bac0674a
    Closes-Bug: #1869808
    (cherry picked from commit 90212b12cdf62e92d811997ebba699cab431d696)

Reviewed: https://review.opendev.org/738512
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=131bbc9a53411033cf27664d8f1fd7afc72c57bf
Submitter: Zuul
Branch: stable/queens

commit 131bbc9a53411033cf27664d8f1fd7afc72c57bf
Author: shenjiatong <email address hidden>
Date: Thu Jun 18 15:33:13 2020 +0800

    Do not block connection between br-int and br-phys on startup

    Block traffic between br-int and br-physical is over kill
    and will at least

    1. interrupt vlan flow during startup, and is particularly
    so if dvr enabled
    2. if let's rabbitmq is not stable, it is possible data plane
    will be affected and vlan will never work.

    Using openstack on k8s particularly amplifies the problem
    because pod could be killed pretty easily by liveness
    probes.

    Conflicts:
        neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I51050c600ba7090fea71213687d94340bac0674a
    Closes-Bug: #1869808
    (cherry picked from commit 90212b12cdf62e92d811997ebba699cab431d696)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/738511
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cc48edf85cf66277423b0eb52ae6353f8028d2a6
Submitter: Zuul
Branch: stable/rocky

commit cc48edf85cf66277423b0eb52ae6353f8028d2a6
Author: shenjiatong <email address hidden>
Date: Thu Jun 18 15:33:13 2020 +0800

    Do not block connection between br-int and br-phys on startup

    Block traffic between br-int and br-physical is over kill
    and will at least

    1. interrupt vlan flow during startup, and is particularly
    so if dvr enabled
    2. if let's rabbitmq is not stable, it is possible data plane
    will be affected and vlan will never work.

    Using openstack on k8s particularly amplifies the problem
    because pod could be killed pretty easily by liveness
    probes.

    Conflicts:
        neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I51050c600ba7090fea71213687d94340bac0674a
    Closes-Bug: #1869808
    (cherry picked from commit 90212b12cdf62e92d811997ebba699cab431d696)

tags: added: in-stable-rocky

Change abandoned by Darragh O'Reilly (<email address hidden>) on branch: master
Review: https://review.opendev.org/740506
Reason: Will create another patchset that fixes both issues.

Related fix proposed to branch: master
Review: https://review.opendev.org/741444

Reviewed: https://review.opendev.org/738523
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1f4f888ad34d54ec968d9c9f9f80c388f3ca0d12
Submitter: Zuul
Branch: stable/pike

commit 1f4f888ad34d54ec968d9c9f9f80c388f3ca0d12
Author: shenjiatong <email address hidden>
Date: Thu Jun 18 15:33:13 2020 +0800

    Do not block connection between br-int and br-phys on startup

    Block traffic between br-int and br-physical is over kill
    and will at least

    1. interrupt vlan flow during startup, and is particularly
    so if dvr enabled
    2. if let's rabbitmq is not stable, it is possible data plane
    will be affected and vlan will never work.

    Using openstack on k8s particularly amplifies the problem
    because pod could be killed pretty easily by liveness
    probes.

    Conflicts:
        neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I51050c600ba7090fea71213687d94340bac0674a
    Closes-Bug: #1869808
    (cherry picked from commit 90212b12cdf62e92d811997ebba699cab431d696)

tags: added: in-stable-pike

Reviewed: https://review.opendev.org/740724
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8
Submitter: Zuul
Branch: master

commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: master
Review: https://review.opendev.org/741444
Reason: Superseded by https://review.opendev.org/#/c/740724/. Nice to see a better option.

Reviewed: https://review.opendev.org/742364
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=47ec363f5faefd85dfa33223c0087f4444afb5b9
Submitter: Zuul
Branch: stable/rocky

commit 47ec363f5faefd85dfa33223c0087f4444afb5b9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)
    (cherry picked from commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9)
    Conflicts:
            neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

Reviewed: https://review.opendev.org/742366
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=00466f41d690ca7c7a918bfd861878ef620bbec9
Submitter: Zuul
Branch: stable/pike

commit 00466f41d690ca7c7a918bfd861878ef620bbec9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)
    (cherry picked from commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9)
    Conflicts:
            neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
    (cherry picked from commit 47ec363f5faefd85dfa33223c0087f4444afb5b9)
    Conflicts:
            neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py
    (cherry picked from commit 8a173ec29ac1819c3d28c191814cd1402d272bb9)

Reviewed: https://review.opendev.org/742365
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8a173ec29ac1819c3d28c191814cd1402d272bb9
Submitter: Zuul
Branch: stable/queens

commit 8a173ec29ac1819c3d28c191814cd1402d272bb9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)
    (cherry picked from commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9)
    Conflicts:
            neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
    (cherry picked from commit 47ec363f5faefd85dfa33223c0087f4444afb5b9)
    Conflicts:
            neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py

Reviewed: https://review.opendev.org/742361
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8181c5dbfe799ac6c832ab67b7eab3bcef4098b9
Submitter: Zuul
Branch: stable/stein

commit 8181c5dbfe799ac6c832ab67b7eab3bcef4098b9
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)
    (cherry picked from commit 6a861b8c8c28e5675ec2208057298b811ba2b649)

Reviewed: https://review.opendev.org/742355
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=143fe8ff89ba776618ed6291af9d5e28e4662bdb
Submitter: Zuul
Branch: stable/ussuri

commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)

Reviewed: https://review.opendev.org/742359
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=17eded13595b18ab60af5256e0f63c57c3702296
Submitter: Zuul
Branch: stable/train

commit 17eded13595b18ab60af5256e0f63c57c3702296
Author: Darragh O'Reilly <email address hidden>
Date: Mon Jul 13 14:48:10 2020 +0000

    Ensure drop flows on br-int at agent startup for DVR too

    Commit 90212b12 changed the OVS agent so adding vital drop flows on
    br-int (table 0 priority 2) for packets from physical bridges was
    deferred until DVR initialization later on. But if br-int has no flows
    from a previous run (eg after host reboot), then these packets will hit
    the NORMAL flow in table 60. And if there is more than one physical
    bridge, then the physical interfaces from the different bridges are now
    essentially connected at layer 2 and a network loop is possible in the
    time before the flows are added by DVR. Also the DVR code won't add them
    until after RPC calls to the server, so a loop is more likely if the
    server is not available.

    This patch restores adding these flows to when the physical bridges are
    first configured. Also updated a comment that was no longer correct and
    updated the unit test.

    Change-Id: I42c33fefaae6a7bee134779c840f35632823472e
    Closes-Bug: #1887148
    Related-Bug: #1869808
    (cherry picked from commit c1a77ef8b74bb9b5abbc5cb03fb3201383122eb8)
    (cherry picked from commit 143fe8ff89ba776618ed6291af9d5e28e4662bdb)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers