[L2][scale issue] RPC timeout during ovs-agent restart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Unassigned |
Bug Description
When ports quantity under one subnet or security group reaches 2000+, the ovs-agent will always get RPC timeout during restart.
This is a subproblem of bug #1813703, for more information, please see the summary:
https:/
Swaminathan Vasudevan (swaminathan-vasudevan) wrote : | #1 |
tags: | added: ovs |
LIU Yulong (dragon889) wrote : | #2 |
The RPC timeout wes only seen in ovs-agent side. Timeout was met in this function setup_port_filters with high probability:
https:/
Mostly because there are some heavy query in neutron server side, but we did not see exception in neutron server side. Or maybe neutron server does not complete the data response. Or there are large table join query or insufficient conditions.
So for this issue, IMO, we should narrow down the RPC query request size for this method.
Changed in neutron: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #3 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #4 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #5 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 6ac420df7eb3ed3
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Change-Id: I57bf27ec75cf84
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #6 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #7 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #8 |
Related fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #9 |
Related fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master) | #10 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 8408af4f173a0ff
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein) | #11 |
Related fix proposed to branch: stable/stein
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky) | #12 |
Related fix proposed to branch: stable/rocky
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens) | #13 |
Related fix proposed to branch: stable/queens
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike) | #14 |
Related fix proposed to branch: stable/pike
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata) | #15 |
Related fix proposed to branch: stable/ocata
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #16 |
Related fix proposed to branch: stable/ocata
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #17 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit 98139553424375a
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
tags: | added: in-stable-stein |
tags: | added: in-stable-queens |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #18 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 195c1378317719d
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
tags: | added: in-stable-pike |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #19 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 51a766653395c11
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Conflicts:
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
(cherry picked from commit 5424b9a68cb3ac1
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #20 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 6494fcc2e44d9d9
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
tags: | added: in-stable-rocky |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein) | #21 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/stein
commit d7d30ea950844f1
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens) | #22 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit 39afe0a129b6b97
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
(cherry picked from commit 5d705468de1e495
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky) | #23 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/rocky
commit 5d705468de1e495
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
Arjun Baindur (abaindur) wrote : | #24 |
Does anyone have a recommended rpc_response_
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike) | #25 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit fa16540d2dd80f8
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
neutron/
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
(cherry picked from commit d7d30ea950844f1
(cherry picked from commit 5d705468de1e495
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata) | #26 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 49df07c7039206c
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800
Divide-
In one specific compute node, the security group rules
can be enormous quantity. This patch adds a step-by-step
processing method to deal with the large number of the
security group rules. And also changes or adds some LOG.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813707
Conflicts:
neutron/
Conflicts:
neutron/
neutron/
neutron/
Conflicts:
neutron/
neutron/
Change-Id: I57bf27ec75cf84
(cherry picked from commit 6ac420df7eb3ed3
(cherry picked from commit f5d110e15f60753
(cherry picked from commit 5424b9a68cb3ac1
tags: | added: in-stable-ocata |
OpenStack Infra (hudson-openstack) wrote : | #27 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit bb2734b0d524aef
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800
Do not call update_device_list in large sets
Ovs-agent can process the ports in large sets, then all
of these ports will have to update DB status or attributes.
But neutron server is centralized. It may have to do
something else, or the database processing can be also
time-consuming. Because of these, it sometimes returns
the RPC timeout exception to ovs-agent. And a fullsync
will be triggered in next rpc loop. The restart time is
becoming longer and longer.
Adds a default step to update the port to reduce
the probability of RPC timeout.
Related-Bug: #1813703
Related-Bug: #1813704
Related-Bug: #1813706
Related-Bug: #1813707
Conflicts:
Change-Id: Ie37f4a4869969e
(cherry picked from commit 8408af4f173a0ff
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master) | #28 |
Related fix proposed to branch: master
Review: https:/
Changed in neutron: | |
status: | Confirmed → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master) | #29 |
Change abandoned by "liuyulong <email address hidden>" on branch: master
Review: https:/
Reason: reopen if we want this
Does playing around with the rabbitmq config options help, in accepting more requests?
So with heavy load upto 2000+ ports info, the rabbitmq also is not able to handle from the server side or is it just the ovs-agent issue?