[L2][scale issue] ovs-agent meets unexpected flow lost

Bug #1813714 reported by LIU Yulong on 2019-01-29
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
LIU Yulong

Bug Description

Ovs-agent will lost some flows during restart, for instance, flows to DHCP or L3, tunnel flows. These lost flows can sometimes cause VM failed to boot or dataplane down.
When subnets or security group ports quantity reaches 2000+, this issue can be seen in high probability.

This is a subproblem of bug #1813703, for more information, please see the summary:
https://bugs.launchpad.net/neutron/+bug/1813703

We have seen such scenarios happening at our customer environment as well.
Timing issues and instability of the ovs-agent is the cause of the issue.

tags: added: l2-pop ovs
LIU Yulong (dragon889) on 2019-02-22
summary: - [L2][scale issue] ovs-agent meet unexpected flow lost
+ [L2][scale issue] ovs-agent meets unexpected flow lost

Fix proposed to branch: master
Review: https://review.openstack.org/640797

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: New → In Progress
Changed in neutron:
assignee: LIU Yulong (dragon889) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → LIU Yulong (dragon889)

Reviewed: https://review.openstack.org/640797
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch: master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 14.0.0.0rc1 release candidate.

Reviewed: https://review.openstack.org/645408
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=62fe7852bbd70a24174853997096c52ee015e269
Submitter: Zuul
Branch: stable/pike

commit 62fe7852bbd70a24174853997096c52ee015e269
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Conflicts:
     neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/645405
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cc49ab550179bdc76d79f48be67886681cb43d4e
Submitter: Zuul
Branch: stable/rocky

commit cc49ab550179bdc76d79f48be67886681cb43d4e
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)

tags: added: in-stable-rocky

This issue was fixed in the openstack/neutron 11.0.7 release.

This issue was fixed in the openstack/neutron 13.0.3 release.

This issue was fixed in the openstack/neutron 12.0.6 release.

Reviewed: https://review.opendev.org/649729
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9583dc0549da2b4529a59b5862ba42aebc5ae15f
Submitter: Zuul
Branch: stable/ocata

commit 9583dc0549da2b4529a59b5862ba42aebc5ae15f
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Conflicts:
     neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags: added: in-stable-ocata
tags: added: neutron-proactive-backport-potential

Reviewed: https://review.opendev.org/641866
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=76c028063582ced23cb5bc38c81f57b422b55900
Submitter: Zuul
Branch: master

commit 76c028063582ced23cb5bc38c81f57b422b55900
Author: LIU Yulong <email address hidden>
Date: Fri Mar 8 07:53:55 2019 +0800

    Remove the l2pop agent_boot_time config

    It was marked as deprecated, so let's do a quick
    removal.

    Related-Bug: #1813714
    Change-Id: Ibc039b34b826641811a7e08b0d1bff0fd21b9193

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers