[L2] [summary] ovs-agent issues at large scale
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
LIU Yulong |
Bug Description
[L2] [summary] ovs-agent issues at large scale
Recently we have tested the ovs-agent with the openvswitch flow based security group, and we met some issues at large scale. This bug will give us a centralized location to track the following problems.
Problems:
(1) RPC timeout during ovs-agent restart
https:/
(2) local connection to ovs-vswitchd was drop or timeout
https:/
(3) ovs-agent failed to restart
https:/
(4) ovs-agent restart costs too long time (15-40mins+)
https:/
(5) unexpected flow lost
https:/
(6) unexpected tunnel lost
https:/
(7) multipe cookies flows (stale flows)
https:/
(8) dump-flows takes a lots of time
https:/
(9) really hard to do trouble shooting if one VM lost the connection, flow tables are almost unreadable (reach 30k+ flows).
https:/
Problem can be seen in the following scenarios:
(1) 2000-3000 ports related to one single security group (or one remote security group)
(2) create 2000-3000 VMs in one single subnet (network)
(3) create 2000-3000 VMs under one single security group
Yes, the scale is the main problem, when one host's VM count is closing to 150-200 (at the same time the ports number in one subnet or security group is closing 2000), the ovs-agent restart will get worse.
Test ENV:
stable/queens
Deployment topology:
neutron-server, database, message queue all have its own dedicated physical hosts, 3 nodes for each service at least.
Configurations:
ovs-agent was setup with l2pop, security group based on ovs flow, and the config was basiclly like the following:
[agent]
enable_
l2_population = True
tunnel_types = vxlan
arp_responder = True
prevent_
extensions = qos
report_interval = 60
[ovs]
bridge_mappings = tenant:
local_ip = 10.114.4.48
[securitygroup]
firewall_driver = openvswitch
enable_
Some issue tracking:
(1) mostly because the great number of ports related to one security grop or in one network
(2) uncessary RPC call during ovs-agent restart
(3) inefficient database query conditions
(4) full sync will redo again and again if any exception was raised in rpc_loop
(5) clean stale flows will dump all flows first (not once, multipe times), this is really time-consuming
So this is a summay bug for the entire scale issues we have met.
Some potential solutions:
Increase some config like rpc_response_
does not help too much, and these changes can cause the restart cost time much more. And those issues can still be seen.
One workaround is to disable the openvswitch flow based security group, the ovs-agent can restart in less than 10 mins.
description: | updated |
description: | updated |
description: | updated |
description: | updated |
Swaminathan Vasudevan (swaminathan-vasudevan) wrote : | #1 |
tags: | added: ovs |
tags: | added: l2-pop |
Swaminathan Vasudevan (swaminathan-vasudevan) wrote : | #2 |
Not sure if the each of the sub-bugs that are listed in here can be fixed individually.
We have seen these problems at scale as well with our customers.
Probably for the purpose of fixing things, as I mentioned in one of the bugs, there are couple of items that we can separate from this discussion.
1. Make ovs-agent to openvswitchd communication robust at scale. Don't get locked or disconnected.
2. Introduce some sort of throttle mechanism for syncing the port details when there is a sync.
( May be suggest some config options for the rabbitmq configurations for getting rid of timeouts and handling the rpc calls)
3. On the server side make sure even if we have 2000+ ports on a single subnet it can handle it. Meanwhile the full sync might not happen from all nodes at the same time, but the issue here is with a single subnet hosting more than 2000+ ports. There may be some tuning that we can do in the DB lookup for each and every port based on the subnet/network.
Changed in neutron: | |
status: | New → Confirmed |
importance: | Undecided → High |
LIU Yulong (dragon889) wrote : | #3 |
Yes, some of these sub-bugs may not be fixed in a short time. But since we have this location that we can trace all the issues. And for cloud users or developers, they can get some inspiration here.
Dongcan Ye (hellochosen) wrote : | #4 |
Good job, maybe all nodes down then recovery also seems a problem here.
Changed in neutron: | |
assignee: | nobody → LIU Yulong (dragon889) |
LIU Yulong (dragon889) wrote : | #5 |
I will submit some fixes of these bugs, so let me paste the way to test them:
http://
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #6 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #7 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #8 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #9 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #10 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #11 |
Related fix proposed to branch: master
Review: https:/
LIU Yulong (dragon889) wrote : | #12 |
As you may noticed, I've upload 7 small changes, yes, we can not conquer this giant beast once once for all. Each one may relate to one or small child issues. Aagin, anyone who want to test these patch sets please follow the guide here:
http://
Changed in neutron: | |
status: | Confirmed → In Progress |
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master) | #13 |
Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https:/
Reason: revisit if needed
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master) | #14 |
Fix proposed to branch: master
Review: https:/
Changed in neutron: | |
assignee: | LIU Yulong (dragon889) → Brian Haley (brian-haley) |
Changed in neutron: | |
assignee: | Brian Haley (brian-haley) → LIU Yulong (dragon889) |
LIU Yulong (dragon889) wrote : | #15 |
Let me give some updates here:
I've tested following patches many many times with 400+ ports and 1000+ security groups rules hosting in one single ovs-agent.
[1] https:/
[2] https:/
[3] https:/
[4] https:/
[5] https:/
[6] https:/
[7] https:/
The results shows:
(1) ovs-agent can start successfully with almost 95% success rate
Failures are mainly concentrated on the following:
a. neutron server has heavy load
b. ovs-agent cache takes a lot of memory, and sometimes MemoryError raised
c. local ovs-vswitchd connection still meets some timeout or drop, but this may be also addressed in this bug and fix:
https:/
https:/
(2) no flow lose (no dataplane down)
(3) no RPC timeout coming out from RPC loop anymore
(sometimes neutron server raise timeout for that report_state, IMO, for such situation, you may need to add more neutron server state report worker, or restart ovs-agents in a small set, not all at once)
(4) restart time reduced to 15min-20min averagely
(5) dump and clean stale flows action has very high success rate, at least I didn't observe failure.
(6) no remain stale flows based on (5)
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #16 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master) | #17 |
Change abandoned by LIU Yulong (<email address hidden>) on branch: master
Review: https:/
Reason: I'd prefer the the alternative: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky) | #18 |
Fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens) | #19 |
Fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike) | #20 |
Fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master) | #21 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit a5244d6d44d2b66
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800
More accurate agent restart state transfer
Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.
This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.
Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178
Change-Id: I8edc2deb509216
Changed in neutron: | |
status: | In Progress → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0rc1 | #22 |
This issue was fixed in the openstack/neutron 14.0.0.0rc1 release candidate.
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #23 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit f898ffd71fba4f9
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800
Divide-
The dump-flows action will get a very large sets of flow information
if there are enormous ports or openflow security group rules. For now
we can meet some known exception during such action, for instance,
memory issue, timeout issue.
So after this patch, the cleanup action of the bridge stale flows
will be done one table by one table. But note, this only supports
for 'native' OpenFlow interface driver.
Related-Bug: #1813703
Related-Bug: #1813712
Related-Bug: #1813709
Related-Bug: #1813708
Change-Id: Ie06d1bebe83ffe
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #24 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #25 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #26 |
Related fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #27 |
Related fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #28 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 6ac420df7eb3ed3
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Change-Id: I57bf27ec75cf84
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #29 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #30 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #31 |
Related fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #32 |
Related fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata) | #33 |
Related fix proposed to branch: stable/ocata
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #34 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 8408af4f173a0ff
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #35 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #36 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #37 |
Related fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #38 |
Related fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata) | #39 |
Related fix proposed to branch: stable/ocata
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #40 |
Related fix proposed to branch: stable/ocata
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata) | #41 |
Fix proposed to branch: stable/ocata
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #42 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 64ea642359e8f8a
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #43 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #44 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #45 |
Related fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #46 |
Related fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata) | #47 |
Related fix proposed to branch: stable/ocata
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike) | #48 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 62fe7852bbd70a2
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800
More accurate agent restart state transfer
Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.
This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.
Conflicts:
neutron/
Conflicts:
neutron/
Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178
Change-Id: I8edc2deb509216
(cherry picked from commit a5244d6d44d2b66
(cherry picked from commit cc49ab550179bdc
(cherry picked from commit 5ffca4966877454
tags: | added: in-stable-pike |
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky) | #49 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit cc49ab550179bdc
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800
More accurate agent restart state transfer
Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.
This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.
Conflicts:
neutron/
Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178
Change-Id: I8edc2deb509216
(cherry picked from commit a5244d6d44d2b66
tags: | added: in-stable-rocky |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #50 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit 98139553424375a
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
tags: | added: in-stable-stein |
tags: | added: in-stable-queens |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #51 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 195c1378317719d
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #52 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 51a766653395c11
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Conflicts:
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
(cherry picked from commit 5424b9a68cb3ac1
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #53 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit e4bfc7d50ee9450
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800
Divide-
The dump-flows action will get a very large sets of flow information
if there are enormous ports or openflow security group rules. For now
we can meet some known exception during such action, for instance,
memory issue, timeout issue.
So after this patch, the cleanup action of the bridge stale flows
will be done one table by one table. But note, this only supports
for 'native' OpenFlow interface driver.
Related-Bug: #1813703
Related-Bug: #1813712
Related-Bug: #1813709
Related-Bug: #1813708
Change-Id: Ie06d1bebe83ffe
(cherry picked from commit f898ffd71fba4f9
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #54 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit fb84771d1364d9b
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800
Divide-
The dump-flows action will get a very large sets of flow information
if there are enormous ports or openflow security group rules. For now
we can meet some known exception during such action, for instance,
memory issue, timeout issue.
So after this patch, the cleanup action of the bridge stale flows
will be done one table by one table. But note, this only supports
for 'native' OpenFlow interface driver.
Related-Bug: #1813703
Related-Bug: #1813712
Related-Bug: #1813709
Related-Bug: #1813708
Change-Id: Ie06d1bebe83ffe
(cherry picked from commit f898ffd71fba4f9
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #55 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 6494fcc2e44d9d9
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
OpenStack Infra (hudson-openstack) wrote : | #56 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit af67d516a5b39b8
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800
Divide-
The dump-flows action will get a very large sets of flow information
if there are enormous ports or openflow security group rules. For now
we can meet some known exception during such action, for instance,
memory issue, timeout issue.
So after this patch, the cleanup action of the bridge stale flows
will be done one table by one table. But note, this only supports
for 'native' OpenFlow interface driver.
Related-Bug: #1813703
Related-Bug: #1813712
Related-Bug: #1813709
Related-Bug: #1813708
Change-Id: Ie06d1bebe83ffe
(cherry picked from commit f898ffd71fba4f9
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #57 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit 7865264aaba615f
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800
Divide-
The dump-flows action will get a very large sets of flow information
if there are enormous ports or openflow security group rules. For now
we can meet some known exception during such action, for instance,
memory issue, timeout issue.
So after this patch, the cleanup action of the bridge stale flows
will be done one table by one table. But note, this only supports
for 'native' OpenFlow interface driver.
Related-Bug: #1813703
Related-Bug: #1813712
Related-Bug: #1813709
Related-Bug: #1813708
Change-Id: Ie06d1bebe83ffe
(cherry picked from commit f898ffd71fba4f9
OpenStack Infra (hudson-openstack) wrote : | #58 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit d7d30ea950844f1
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata) | #59 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit ea3d844c75541cc
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 19:46:53 2019 +0800
Divide-
The dump-flows action will get a very large sets of flow information
if there are enormous ports or openflow security group rules. For now
we can meet some known exception during such action, for instance,
memory issue, timeout issue.
So after this patch, the cleanup action of the bridge stale flows
will be done one table by one table. But note, this only supports
for 'native' OpenFlow interface driver.
Related-Bug: #1813703
Related-Bug: #1813712
Related-Bug: #1813709
Related-Bug: #1813708
Change-Id: Ie06d1bebe83ffe
(cherry picked from commit f898ffd71fba4f9
tags: | added: in-stable-ocata |
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.7 | #60 |
This issue was fixed in the openstack/neutron 11.0.7 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.3 | #61 |
This issue was fixed in the openstack/neutron 13.0.3 release.
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.6 | #62 |
This issue was fixed in the openstack/neutron 12.0.6 release.
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #63 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit d7764064d045563
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #64 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 26a9765afb91790
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #65 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit df4e0a5394dff4c
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata) | #66 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 7a4bc6e43fb7274
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #67 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit bb508f9e6051ccb
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #68 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 39afe0a129b6b97
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
(cherry picked from commit 5d705468de1e495
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #69 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 5d705468de1e495
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #70 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit fa16540d2dd80f8
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
(cherry picked from commit 5d705468de1e495
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata) | #71 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 49df07c7039206c
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Conflicts:
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
(cherry picked from commit 5424b9a68cb3ac1
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata) | #72 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 9583dc0549da2b4
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800
More accurate agent restart state transfer
Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.
This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.
Conflicts:
neutron/
Conflicts:
neutron/
Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178
Change-Id: I8edc2deb509216
(cherry picked from commit a5244d6d44d2b66
(cherry picked from commit cc49ab550179bdc
(cherry picked from commit 5ffca4966877454
tags: | added: neutron-proactive-backport-potential |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #73 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit eaf3ff57863a7af
Author: LIU Yulong <email address hidden>
Date: Tue Mar 19 21:21:45 2019 +0800
Ignore first local port update notification
Ovs-agent will scan and process the ports during the
first rpc_loop, and a local port update notification
will be sent out. This will cause these ports to
be processed again in the ovs-agent next (second)
rpc_loop.
This patch passes the restart flag (iteration num 0)
to the local port_update call trace. After this patch,
the local port_update notification will be ignored in
the first RPC loop.
Related-Bug: #1813703
Change-Id: Ic5bf718cfd056f
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #74 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #75 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #76 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit e1b84c9a701a678
Author: LIU Yulong <email address hidden>
Date: Tue Mar 19 21:21:45 2019 +0800
Ignore first local port update notification
Ovs-agent will scan and process the ports during the
first rpc_loop, and a local port update notification
will be sent out. This will cause these ports to
be processed again in the ovs-agent next (second)
rpc_loop.
This patch passes the restart flag (iteration num 0)
to the local port_update call trace. After this patch,
the local port_update notification will be ignored in
the first RPC loop.
Related-Bug: #1813703
Change-Id: Ic5bf718cfd056f
(cherry picked from commit eaf3ff57863a7af
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #77 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 56c070c5a37f065
Author: LIU Yulong <email address hidden>
Date: Tue Mar 19 21:21:45 2019 +0800
Ignore first local port update notification
Ovs-agent will scan and process the ports during the
first rpc_loop, and a local port update notification
will be sent out. This will cause these ports to
be processed again in the ovs-agent next (second)
rpc_loop.
This patch passes the restart flag (iteration num 0)
to the local port_update call trace. After this patch,
the local port_update notification will be ignored in
the first RPC loop.
Related-Bug: #1813703
Change-Id: Ic5bf718cfd056f
(cherry picked from commit eaf3ff57863a7af
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata) | #78 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit bb2734b0d524aef
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #79 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 8e73de8bc42067c
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Change-Id: I43733546abf542
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #80 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #81 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #82 |
Related fix proposed to branch: stable/queens
Review: https:/
tags: | removed: neutron-proactive-backport-potential |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #83 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit a10413eb3fa52de
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Conflicts:
Change-Id: I43733546abf542
(cherry picked from commit 8e73de8bc42067c
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #84 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 41fe9ff147244eb
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Conflicts:
Change-Id: I43733546abf542
(cherry picked from commit 8e73de8bc42067c
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #85 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 713ad71c6f4e389
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Conflicts:
Change-Id: I43733546abf542
(cherry picked from commit 8e73de8bc42067c
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol | #86 |
This issue was fixed in the openstack/neutron ocata-eol release.
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #87 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master) | #88 |
Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https:/
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.
It is good to collect all these bugs in a single location.
Thanks for the update.