[L2] [summary] ovs-agent issues at large scale

Bug #1813703 reported by LIU Yulong on 2019-01-29
42
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
High
LIU Yulong

Bug Description

[L2] [summary] ovs-agent issues at large scale

Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems.

Problems:
(1) RPC timeout during ovs-agent restart
https://bugs.launchpad.net/neutron/+bug/1813704
(2) local connection to ovs-vswitchd was drop or timeout
https://bugs.launchpad.net/neutron/+bug/1813705
(3) ovs-agent failed to restart
https://bugs.launchpad.net/neutron/+bug/1813706
(4) ovs-agent restart costs too long time (15-40mins+)
https://bugs.launchpad.net/neutron/+bug/1813707
(5) unexpected flow lost
https://bugs.launchpad.net/neutron/+bug/1813714
(6) unexpected tunnel lost
https://bugs.launchpad.net/neutron/+bug/1813715
(7) multipe cookies flows (stale flows)
https://bugs.launchpad.net/neutron/+bug/1813712
(8) dump-flows takes a lots of time
https://bugs.launchpad.net/neutron/+bug/1813709
(9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows).
https://bugs.launchpad.net/neutron/+bug/1813708

Problem can be seen in the following scenarios:
(1) 2000-3000 ports related to one single security group (or one remote security group)
(2) create 2000-3000 VMs in one single subnet (network)
(3) create 2000-3000 VMs under one single security group

Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse.

Test ENV:
stable/queens

Deployment topology:
neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least.

Configurations:
ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following:
[agent]
enable_distributed_routing = True
l2_population = True
tunnel_types = vxlan
arp_responder = True
prevent_arp_spoofing = True
extensions = qos
report_interval = 60

[ovs]
bridge_mappings = tenant:br-vlan,external:br-ex
local_ip = 10.114.4.48

[securitygroup]
firewall_driver = openvswitch
enable_security_group = True

Some issue tracking:
(1) mostly because the great number of ports related to one security grop or in one network
(2) uncessary RPC call during ovs-agent restart
(3) inefficient database query conditions
(4) full sync will redo again and again if any exception was raised in rpc_loop
(5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming

So this is a summay bug for the entire scale issues we have met.

Some potential solutions:
Increase some config like rpc_response_timeout, of_connect_timeout, of_request_timeout, ovsdb_timeout etc,
does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen.

One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.

LIU Yulong (dragon889) on 2019-01-29
description: updated
description: updated
LIU Yulong (dragon889) on 2019-01-29
description: updated
LIU Yulong (dragon889) on 2019-01-29
description: updated

It is good to collect all these bugs in a single location.
Thanks for the update.

tags: added: ovs
tags: added: l2-pop

Not sure if the each of the sub-bugs that are listed in here can be fixed individually.
We have seen these problems at scale as well with our customers.
Probably for the purpose of fixing things, as I mentioned in one of the bugs, there are couple of items that we can separate from this discussion.
1. Make ovs-agent to openvswitchd communication robust at scale. Don't get locked or disconnected.

2. Introduce some sort of throttle mechanism for syncing the port details when there is a sync.
   ( May be suggest some config options for the rabbitmq configurations for getting rid of timeouts and handling the rpc calls)

3. On the server side make sure even if we have 2000+ ports on a single subnet it can handle it. Meanwhile the full sync might not happen from all nodes at the same time, but the issue here is with a single subnet hosting more than 2000+ ports. There may be some tuning that we can do in the DB lookup for each and every port based on the subnet/network.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
LIU Yulong (dragon889) wrote :

Yes, some of these sub-bugs may not be fixed in a short time. But since we have this location that we can trace all the issues. And for cloud users or developers, they can get some inspiration here.

Dongcan Ye (hellochosen) wrote :

Good job, maybe all nodes down then recovery also seems a problem here.

LIU Yulong (dragon889) on 2019-02-20
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
LIU Yulong (dragon889) wrote :

I will submit some fixes of these bugs, so let me paste the way to test them:
http://paste.openstack.org/show/745685/

Related fix proposed to branch: master
Review: https://review.openstack.org/638642

Related fix proposed to branch: master
Review: https://review.openstack.org/638643

Related fix proposed to branch: master
Review: https://review.openstack.org/638644

Related fix proposed to branch: master
Review: https://review.openstack.org/638645

Related fix proposed to branch: master
Review: https://review.openstack.org/638646

LIU Yulong (dragon889) wrote :

As you may noticed, I've upload 7 small changes, yes, we can not conquer this giant beast once once for all. Each one may relate to one or small child issues. Aagin, anyone who want to test these patch sets please follow the guide here:
http://paste.openstack.org/show/745685/

LIU Yulong (dragon889) on 2019-02-26
Changed in neutron:
status: Confirmed → In Progress

Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https://review.openstack.org/638643
Reason: revisit if needed

Changed in neutron:
assignee: LIU Yulong (dragon889) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → LIU Yulong (dragon889)
LIU Yulong (dragon889) wrote :

Let me give some updates here:

I've tested following patches many many times with 400+ ports and 1000+ security groups rules hosting in one single ovs-agent.
[1] https://review.openstack.org/#/c/638641/
[2] https://review.openstack.org/#/c/638642/
[3] https://review.openstack.org/#/c/638644/
[4] https://review.openstack.org/#/c/638645/
[5] https://review.openstack.org/#/c/638646/
[6] https://review.openstack.org/#/c/638647/
[7] https://review.openstack.org/#/c/640797/

The results shows:
(1) ovs-agent can start successfully with almost 95% success rate
Failures are mainly concentrated on the following:
a. neutron server has heavy load
b. ovs-agent cache takes a lot of memory, and sometimes MemoryError raised
c. local ovs-vswitchd connection still meets some timeout or drop, but this may be also addressed in this bug and fix:
https://bugs.launchpad.net/neutron/+bug/1817022
https://review.openstack.org/#/c/641681/

(2) no flow lose (no dataplane down)
(3) no RPC timeout coming out from RPC loop anymore
(sometimes neutron server raise timeout for that report_state, IMO, for such situation, you may need to add more neutron server state report worker, or restart ovs-agents in a small set, not all at once)
(4) restart time reduced to 15min-20min averagely
(5) dump and clean stale flows action has very high success rate, at least I didn't observe failure.
(6) no remain stale flows based on (5)

Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https://review.openstack.org/638644
Reason: I'd prefer the the alternative: https://review.openstack.org/#/c/644613/

Reviewed: https://review.openstack.org/640797
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch: master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 14.0.0.0rc1 release candidate.

Reviewed: https://review.openstack.org/638647
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f898ffd71fba4f9b8fd9f4cb851fc3976d72396a
Submitter: Zuul
Branch: master

commit f898ffd71fba4f9b8fd9f4cb851fc3976d72396a
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800

    Divide-and-conquer local bridge flows beasts

    The dump-flows action will get a very large sets of flow information
    if there are enormous ports or openflow security group rules. For now
    we can meet some known exception during such action, for instance,
    memory issue, timeout issue.
    So after this patch, the cleanup action of the bridge stale flows
    will be done one table by one table. But note, this only supports
    for 'native' OpenFlow interface driver.

    Related-Bug: #1813703
    Related-Bug: #1813712
    Related-Bug: #1813709
    Related-Bug: #1813708

    Change-Id: Ie06d1bebe83ffeaf7130dcbb8ca21e5e59a220fb

Reviewed: https://review.openstack.org/638642
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6ac420df7eb3ed324669472c61fec41b3d9cb35b
Submitter: Zuul
Branch: master

commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa

Reviewed: https://review.openstack.org/638646
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8408af4f173a0ffde354599e26c49bf9e17e8bef
Submitter: Zuul
Branch: master

commit 8408af4f173a0ffde354599e26c49bf9e17e8bef
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649701

Reviewed: https://review.openstack.org/638645
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c
Submitter: Zuul
Branch: master

commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58

Reviewed: https://review.openstack.org/645408
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=62fe7852bbd70a24174853997096c52ee015e269
Submitter: Zuul
Branch: stable/pike

commit 62fe7852bbd70a24174853997096c52ee015e269
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Conflicts:
     neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/645405
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cc49ab550179bdc76d79f48be67886681cb43d4e
Submitter: Zuul
Branch: stable/rocky

commit cc49ab550179bdc76d79f48be67886681cb43d4e
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/649343
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=98139553424375a2a0ec18fb6b07b4bf30fe88d0
Submitter: Zuul
Branch: stable/stein

commit 98139553424375a2a0ec18fb6b07b4bf30fe88d0
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)

tags: added: in-stable-stein
tags: added: in-stable-queens

Reviewed: https://review.openstack.org/649366
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=195c1378317719d548dfd149ecc0ec9b01d53eef
Submitter: Zuul
Branch: stable/queens

commit 195c1378317719d548dfd149ecc0ec9b01d53eef
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py

    Conflicts:
     neutron/agent/securitygroups_rpc.py
     neutron/common/constants.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)
    (cherry picked from commit f5d110e15f60753d056da942414ca6ecd6b78d8a)

Reviewed: https://review.openstack.org/649369
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=51a766653395c11985b7dd5d3e3549224ae4ca88
Submitter: Zuul
Branch: stable/pike

commit 51a766653395c11985b7dd5d3e3549224ae4ca88
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py

    Conflicts:
     neutron/agent/securitygroups_rpc.py
     neutron/common/constants.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Conflicts:
     neutron/agent/common/ovs_lib.py
     neutron/common/constants.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)
    (cherry picked from commit f5d110e15f60753d056da942414ca6ecd6b78d8a)
    (cherry picked from commit 5424b9a68cb3ac1fcc04ed8ae603c421bde2dee3)

Reviewed: https://review.openstack.org/648219
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e4bfc7d50ee94502bead86078a123676bc9c24f9
Submitter: Zuul
Branch: stable/queens

commit e4bfc7d50ee94502bead86078a123676bc9c24f9
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800

    Divide-and-conquer local bridge flows beasts

    The dump-flows action will get a very large sets of flow information
    if there are enormous ports or openflow security group rules. For now
    we can meet some known exception during such action, for instance,
    memory issue, timeout issue.
    So after this patch, the cleanup action of the bridge stale flows
    will be done one table by one table. But note, this only supports
    for 'native' OpenFlow interface driver.

    Related-Bug: #1813703
    Related-Bug: #1813712
    Related-Bug: #1813709
    Related-Bug: #1813708

    Change-Id: Ie06d1bebe83ffeaf7130dcbb8ca21e5e59a220fb
    (cherry picked from commit f898ffd71fba4f9b8fd9f4cb851fc3976d72396a)

Reviewed: https://review.openstack.org/648220
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fb84771d1364d9be6fa7d0bce1bc89b2e3541271
Submitter: Zuul
Branch: stable/pike

commit fb84771d1364d9be6fa7d0bce1bc89b2e3541271
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800

    Divide-and-conquer local bridge flows beasts

    The dump-flows action will get a very large sets of flow information
    if there are enormous ports or openflow security group rules. For now
    we can meet some known exception during such action, for instance,
    memory issue, timeout issue.
    So after this patch, the cleanup action of the bridge stale flows
    will be done one table by one table. But note, this only supports
    for 'native' OpenFlow interface driver.

    Related-Bug: #1813703
    Related-Bug: #1813712
    Related-Bug: #1813709
    Related-Bug: #1813708

    Change-Id: Ie06d1bebe83ffeaf7130dcbb8ca21e5e59a220fb
    (cherry picked from commit f898ffd71fba4f9b8fd9f4cb851fc3976d72396a)

Reviewed: https://review.openstack.org/649365
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6494fcc2e44d9d9310e3ebaa92582f4f78d08b75
Submitter: Zuul
Branch: stable/rocky

commit 6494fcc2e44d9d9310e3ebaa92582f4f78d08b75
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)

Reviewed: https://review.openstack.org/648217
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=af67d516a5b39b883fa6fb2fca4673fb7602b292
Submitter: Zuul
Branch: stable/rocky

commit af67d516a5b39b883fa6fb2fca4673fb7602b292
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800

    Divide-and-conquer local bridge flows beasts

    The dump-flows action will get a very large sets of flow information
    if there are enormous ports or openflow security group rules. For now
    we can meet some known exception during such action, for instance,
    memory issue, timeout issue.
    So after this patch, the cleanup action of the bridge stale flows
    will be done one table by one table. But note, this only supports
    for 'native' OpenFlow interface driver.

    Related-Bug: #1813703
    Related-Bug: #1813712
    Related-Bug: #1813709
    Related-Bug: #1813708

    Change-Id: Ie06d1bebe83ffeaf7130dcbb8ca21e5e59a220fb
    (cherry picked from commit f898ffd71fba4f9b8fd9f4cb851fc3976d72396a)

Reviewed: https://review.openstack.org/648207
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7865264aaba615f6e52f5806d844531696186d56
Submitter: Zuul
Branch: stable/stein

commit 7865264aaba615f6e52f5806d844531696186d56
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800

    Divide-and-conquer local bridge flows beasts

    The dump-flows action will get a very large sets of flow information
    if there are enormous ports or openflow security group rules. For now
    we can meet some known exception during such action, for instance,
    memory issue, timeout issue.
    So after this patch, the cleanup action of the bridge stale flows
    will be done one table by one table. But note, this only supports
    for 'native' OpenFlow interface driver.

    Related-Bug: #1813703
    Related-Bug: #1813712
    Related-Bug: #1813709
    Related-Bug: #1813708

    Change-Id: Ie06d1bebe83ffeaf7130dcbb8ca21e5e59a220fb
    (cherry picked from commit f898ffd71fba4f9b8fd9f4cb851fc3976d72396a)

Reviewed: https://review.openstack.org/649682
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d7d30ea950844f11348fa2827908622e3a8c7dfb
Submitter: Zuul
Branch: stable/stein

commit d7d30ea950844f11348fa2827908622e3a8c7dfb
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)

Reviewed: https://review.openstack.org/649414
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ea3d844c75541cc2be17865bad6336cd1b8385c4
Submitter: Zuul
Branch: stable/ocata

commit ea3d844c75541cc2be17865bad6336cd1b8385c4
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800

    Divide-and-conquer local bridge flows beasts

    The dump-flows action will get a very large sets of flow information
    if there are enormous ports or openflow security group rules. For now
    we can meet some known exception during such action, for instance,
    memory issue, timeout issue.
    So after this patch, the cleanup action of the bridge stale flows
    will be done one table by one table. But note, this only supports
    for 'native' OpenFlow interface driver.

    Related-Bug: #1813703
    Related-Bug: #1813712
    Related-Bug: #1813709
    Related-Bug: #1813708

    Change-Id: Ie06d1bebe83ffeaf7130dcbb8ca21e5e59a220fb
    (cherry picked from commit f898ffd71fba4f9b8fd9f4cb851fc3976d72396a)

tags: added: in-stable-ocata

This issue was fixed in the openstack/neutron 11.0.7 release.

This issue was fixed in the openstack/neutron 13.0.3 release.

This issue was fixed in the openstack/neutron 12.0.6 release.

Reviewed: https://review.openstack.org/650389
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d7764064d0455634b18cc0931bcc44343913a1c6
Submitter: Zuul
Branch: stable/stein

commit d7764064d0455634b18cc0931bcc44343913a1c6
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

Reviewed: https://review.openstack.org/650390
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26a9765afb917901ca40e3117ff092774823ada2
Submitter: Zuul
Branch: stable/rocky

commit 26a9765afb917901ca40e3117ff092774823ada2
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

Reviewed: https://review.openstack.org/650392
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=df4e0a5394dff4cc176096abc64079d2c43fa9e7
Submitter: Zuul
Branch: stable/queens

commit df4e0a5394dff4cc176096abc64079d2c43fa9e7
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

Reviewed: https://review.openstack.org/650394
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7a4bc6e43fb7274b940fb88f13011821e283b3bb
Submitter: Zuul
Branch: stable/ocata

commit 7a4bc6e43fb7274b940fb88f13011821e283b3bb
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

Reviewed: https://review.openstack.org/650393
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb508f9e6051ccb27b6dfda05b5c52b961f7370a
Submitter: Zuul
Branch: stable/pike

commit bb508f9e6051ccb27b6dfda05b5c52b961f7370a
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

Reviewed: https://review.openstack.org/649688
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=39afe0a129b6b979d0b56ec59048a4e16bedf7a9
Submitter: Zuul
Branch: stable/queens

commit 39afe0a129b6b979d0b56ec59048a4e16bedf7a9
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
     neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)
    (cherry picked from commit 5d705468de1e495639f8b87266ccfc9391ce6135)

Reviewed: https://review.openstack.org/649683
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5d705468de1e495639f8b87266ccfc9391ce6135
Submitter: Zuul
Branch: stable/rocky

commit 5d705468de1e495639f8b87266ccfc9391ce6135
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)

Reviewed: https://review.opendev.org/649691
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fa16540d2dd80f836c8fa2a424717899ac64af60
Submitter: Zuul
Branch: stable/pike

commit fa16540d2dd80f836c8fa2a424717899ac64af60
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
     neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)
    (cherry picked from commit 5d705468de1e495639f8b87266ccfc9391ce6135)

Reviewed: https://review.opendev.org/649701
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=49df07c7039206c537c17f40140d290e1b28a3f4
Submitter: Zuul
Branch: stable/ocata

commit 49df07c7039206c537c17f40140d290e1b28a3f4
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py
            neutron/agent/common/ovs_lib.py
    Conflicts:
     neutron/agent/securitygroups_rpc.py
     neutron/common/constants.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Conflicts:
     neutron/agent/common/ovs_lib.py
     neutron/common/constants.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)
    (cherry picked from commit f5d110e15f60753d056da942414ca6ecd6b78d8a)
    (cherry picked from commit 5424b9a68cb3ac1fcc04ed8ae603c421bde2dee3)

Reviewed: https://review.opendev.org/649729
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9583dc0549da2b4529a59b5862ba42aebc5ae15f
Submitter: Zuul
Branch: stable/ocata

commit 9583dc0549da2b4529a59b5862ba42aebc5ae15f
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Conflicts:
     neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags: added: neutron-proactive-backport-potential

Reviewed: https://review.opendev.org/644613
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=eaf3ff57863a7af2a33ab189910666f6c3450019
Submitter: Zuul
Branch: master

commit eaf3ff57863a7af2a33ab189910666f6c3450019
Author: LIU Yulong <email address hidden>
Date: Tue Mar 19 21:21:45 2019 +0800

    Ignore first local port update notification

    Ovs-agent will scan and process the ports during the
    first rpc_loop, and a local port update notification
    will be sent out. This will cause these ports to
    be processed again in the ovs-agent next (second)
    rpc_loop.
    This patch passes the restart flag (iteration num 0)
    to the local port_update call trace. After this patch,
    the local port_update notification will be ignored in
    the first RPC loop.

    Related-Bug: #1813703
    Change-Id: Ic5bf718cfd056f805741892a91a8d45f7a6e0db3

Reviewed: https://review.opendev.org/670147
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e1b84c9a701a67880547cc01ad73a608bb39aaf4
Submitter: Zuul
Branch: stable/stein

commit e1b84c9a701a67880547cc01ad73a608bb39aaf4
Author: LIU Yulong <email address hidden>
Date: Tue Mar 19 21:21:45 2019 +0800

    Ignore first local port update notification

    Ovs-agent will scan and process the ports during the
    first rpc_loop, and a local port update notification
    will be sent out. This will cause these ports to
    be processed again in the ovs-agent next (second)
    rpc_loop.
    This patch passes the restart flag (iteration num 0)
    to the local port_update call trace. After this patch,
    the local port_update notification will be ignored in
    the first RPC loop.

    Related-Bug: #1813703
    Change-Id: Ic5bf718cfd056f805741892a91a8d45f7a6e0db3
    (cherry picked from commit eaf3ff57863a7af2a33ab189910666f6c3450019)

Reviewed: https://review.opendev.org/670148
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=56c070c5a37f06515c9330274ae12d87e7468421
Submitter: Zuul
Branch: stable/rocky

commit 56c070c5a37f06515c9330274ae12d87e7468421
Author: LIU Yulong <email address hidden>
Date: Tue Mar 19 21:21:45 2019 +0800

    Ignore first local port update notification

    Ovs-agent will scan and process the ports during the
    first rpc_loop, and a local port update notification
    will be sent out. This will cause these ports to
    be processed again in the ovs-agent next (second)
    rpc_loop.
    This patch passes the restart flag (iteration num 0)
    to the local port_update call trace. After this patch,
    the local port_update notification will be ignored in
    the first RPC loop.

    Related-Bug: #1813703
    Change-Id: Ic5bf718cfd056f805741892a91a8d45f7a6e0db3
    (cherry picked from commit eaf3ff57863a7af2a33ab189910666f6c3450019)

Reviewed: https://review.opendev.org/649693
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb2734b0d524aef348b69ae02988449f9dd63c56
Submitter: Zuul
Branch: stable/ocata

commit bb2734b0d524aef348b69ae02988449f9dd63c56
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
            neutron/common/constants.py
            neutron/agent/rpc.py
            neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)

Reviewed: https://review.opendev.org/638641
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8e73de8bc42067c0a6796df3cca9938d25ae754e
Submitter: Zuul
Branch: master

commit 8e73de8bc42067c0a6796df3cca9938d25ae754e
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800

    Change ovs-agent iteration log level to INFO

    Operators may want to see how long it takes in the port
    processing procedure since DEBUG log does not enable
    basically in the production envrionment.

    Related-Bug: #1813703
    Related-Bug: #1813707
    Related-Bug: #1813706
    Related-Bug: #1813709

    Change-Id: I43733546abf5421d0e3f4cd5a959d279e1b89d1e

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers