[L2][scale issue] ovs-agent restart costs too long time
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
When subnets or security group ports quantity reach 2000+, the ovs-agent will take more than 15-40 mins+ to restart.
During this restart time, the ovs will not process any port, aka VM booting on this host will not get the L2 flows established.
This is a subproblem of bug #1813703, for more information, please see the summary:
https:/

Swaminathan Vasudevan (swaminathan-vasudevan) wrote : | #1 |
tags: | added: ovs |

LIU Yulong (dragon889) wrote : | #2 |
Again this function setup_port_filters is mostly time-consuming:
https:/
https:/
But that retry fullsync can also waste time.
https:/

Dirk Mueller (dmllr) wrote : | #3 |
Is there a debug log available of that restart? it would be good to see the timings of
and
in that setup.

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #4 |
Related fix proposed to branch: master
Review: https:/

OpenStack Infra (hudson-openstack) wrote : | #5 |
Related fix proposed to branch: master
Review: https:/

OpenStack Infra (hudson-openstack) wrote : | #6 |
Related fix proposed to branch: master
Review: https:/

OpenStack Infra (hudson-openstack) wrote : | #7 |
Related fix proposed to branch: master
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #8 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 6ac420df7eb3ed3
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Change-Id: I57bf27ec75cf84

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #9 |
Related fix proposed to branch: stable/stein
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #10 |
Related fix proposed to branch: stable/rocky
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #11 |
Related fix proposed to branch: stable/queens
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #12 |
Related fix proposed to branch: stable/pike
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #13 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 8408af4f173a0ff
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #14 |
Related fix proposed to branch: stable/stein
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #15 |
Related fix proposed to branch: stable/rocky
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #16 |
Related fix proposed to branch: stable/queens
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #17 |
Related fix proposed to branch: stable/pike
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata) | #18 |
Related fix proposed to branch: stable/ocata
Review: https:/

OpenStack Infra (hudson-openstack) wrote : | #19 |
Related fix proposed to branch: stable/ocata
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #20 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 64ea642359e8f8a
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #21 |
Related fix proposed to branch: stable/stein
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #22 |
Related fix proposed to branch: stable/rocky
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #23 |
Related fix proposed to branch: stable/queens
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #24 |
Related fix proposed to branch: stable/pike
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata) | #25 |
Related fix proposed to branch: stable/ocata
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #26 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit 98139553424375a
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
tags: | added: in-stable-stein |
tags: | added: in-stable-queens |

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #27 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 195c1378317719d
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
tags: | added: in-stable-pike |

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #28 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 51a766653395c11
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Conflicts:
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
(cherry picked from commit 5424b9a68cb3ac1

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #29 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 6494fcc2e44d9d9
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
tags: | added: in-stable-rocky |

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #30 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit d7d30ea950844f1
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff

OpenStack Infra (hudson-openstack) wrote : | #31 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit d7764064d045563
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #32 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 26a9765afb91790
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #33 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit df4e0a5394dff4c
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata) | #34 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 7a4bc6e43fb7274
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a
tags: | added: in-stable-ocata |

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #35 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit bb508f9e6051ccb
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800
Change default local ovs connection timeout
Large number of flows can cause local ovs connection
timeout. Ultimately getting succeed will be better
than a retry or fullsync.
Related-Bug: #1813703
Related-Bug: #1813705
Related-Bug: #1813707
Related-Bug: #1813709
Change-Id: Ifa0608a7e131df
(cherry picked from commit 64ea642359e8f8a

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #36 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 39afe0a129b6b97
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
(cherry picked from commit 5d705468de1e495

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #37 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 5d705468de1e495
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #38 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit fa16540d2dd80f8
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
(cherry picked from commit 5d705468de1e495

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata) | #39 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 49df07c7039206c
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Conflicts:
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
(cherry picked from commit 5424b9a68cb3ac1
tags: | added: neutron-proactive-backport-potential |

OpenStack Infra (hudson-openstack) wrote : | #40 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit bb2734b0d524aef
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #41 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 8e73de8bc42067c
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Change-Id: I43733546abf542

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #42 |
Related fix proposed to branch: stable/stein
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #43 |
Related fix proposed to branch: stable/rocky
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #44 |
Related fix proposed to branch: stable/queens
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #45 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit a10413eb3fa52de
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Conflicts:
Change-Id: I43733546abf542
(cherry picked from commit 8e73de8bc42067c

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #46 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 41fe9ff147244eb
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Conflicts:
Change-Id: I43733546abf542
(cherry picked from commit 8e73de8bc42067c

OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #47 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 713ad71c6f4e389
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800
Change ovs-agent iteration log level to INFO
Operators may want to see how long it takes in the port
processing procedure since DEBUG log does not enable
basically in the production envrionment.
Related-Bug: #1813703
Related-Bug: #1813707
Related-Bug: #1813706
Related-Bug: #1813709
Conflicts:
Change-Id: I43733546abf542
(cherry picked from commit 8e73de8bc42067c
Changed in neutron: | |
status: | New → Fix Released |

OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #48 |
Related fix proposed to branch: master
Review: https:/

OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master) | #49 |
Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https:/
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.
So do you know where the 15-4-mins is spent by the ovs-agent, is it just spending time in the polling loop or is it locked by some resource.