[L2][scale issue] RPC timeout during ovs-agent restart

Bug #1813704 reported by LIU Yulong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

When ports quantity under one subnet or security group reaches 2000+, the ovs-agent will always get RPC timeout during restart.
This is a subproblem of bug #1813703, for more information, please see the summary:
https://bugs.launchpad.net/neutron/+bug/1813703

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Does playing around with the rabbitmq config options help, in accepting more requests?

So with heavy load upto 2000+ ports info, the rabbitmq also is not able to handle from the server side or is it just the ovs-agent issue?

tags: added: ovs
Revision history for this message
LIU Yulong (dragon889) wrote :

The RPC timeout wes only seen in ovs-agent side. Timeout was met in this function setup_port_filters with high probability:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1791-L1792
Mostly because there are some heavy query in neutron server side, but we did not see exception in neutron server side. Or maybe neutron server does not complete the data response. Or there are large table join query or insufficient conditions.
So for this issue, IMO, we should narrow down the RPC query request size for this method.

Changed in neutron:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/638642

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/638646

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/638642
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6ac420df7eb3ed324669472c61fec41b3d9cb35b
Submitter: Zuul
Branch: master

commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.openstack.org/649343

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/649365

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/649366

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/649369

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/638646
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8408af4f173a0ffde354599e26c49bf9e17e8bef
Submitter: Zuul
Branch: master

commit 8408af4f173a0ffde354599e26c49bf9e17e8bef
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.openstack.org/649682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/649683

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/649688

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/649691

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649693

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649701

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.openstack.org/649343
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=98139553424375a2a0ec18fb6b07b4bf30fe88d0
Submitter: Zuul
Branch: stable/stein

commit 98139553424375a2a0ec18fb6b07b4bf30fe88d0
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)

tags: added: in-stable-stein
tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/649366
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=195c1378317719d548dfd149ecc0ec9b01d53eef
Submitter: Zuul
Branch: stable/queens

commit 195c1378317719d548dfd149ecc0ec9b01d53eef
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py

    Conflicts:
     neutron/agent/securitygroups_rpc.py
     neutron/common/constants.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)
    (cherry picked from commit f5d110e15f60753d056da942414ca6ecd6b78d8a)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/649369
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=51a766653395c11985b7dd5d3e3549224ae4ca88
Submitter: Zuul
Branch: stable/pike

commit 51a766653395c11985b7dd5d3e3549224ae4ca88
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py

    Conflicts:
     neutron/agent/securitygroups_rpc.py
     neutron/common/constants.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Conflicts:
     neutron/agent/common/ovs_lib.py
     neutron/common/constants.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)
    (cherry picked from commit f5d110e15f60753d056da942414ca6ecd6b78d8a)
    (cherry picked from commit 5424b9a68cb3ac1fcc04ed8ae603c421bde2dee3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/649365
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6494fcc2e44d9d9310e3ebaa92582f4f78d08b75
Submitter: Zuul
Branch: stable/rocky

commit 6494fcc2e44d9d9310e3ebaa92582f4f78d08b75
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.openstack.org/649682
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d7d30ea950844f11348fa2827908622e3a8c7dfb
Submitter: Zuul
Branch: stable/stein

commit d7d30ea950844f11348fa2827908622e3a8c7dfb
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/649688
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=39afe0a129b6b979d0b56ec59048a4e16bedf7a9
Submitter: Zuul
Branch: stable/queens

commit 39afe0a129b6b979d0b56ec59048a4e16bedf7a9
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
     neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)
    (cherry picked from commit 5d705468de1e495639f8b87266ccfc9391ce6135)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/649683
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5d705468de1e495639f8b87266ccfc9391ce6135
Submitter: Zuul
Branch: stable/rocky

commit 5d705468de1e495639f8b87266ccfc9391ce6135
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)

Revision history for this message
Arjun Baindur (abaindur) wrote :

Does anyone have a recommended rpc_response_timeout to set? The default of 60 is far too low, IMO, even with these fixes. Is there a downside to setting it as high as 480 or even 600? If the network is down, it won't matter. If neutron-server, it is better to wait than retry the request and making neutron-server to double the work.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.opendev.org/649691
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fa16540d2dd80f836c8fa2a424717899ac64af60
Submitter: Zuul
Branch: stable/pike

commit fa16540d2dd80f836c8fa2a424717899ac64af60
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
     neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)
    (cherry picked from commit 5d705468de1e495639f8b87266ccfc9391ce6135)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.opendev.org/649701
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=49df07c7039206c537c17f40140d290e1b28a3f4
Submitter: Zuul
Branch: stable/ocata

commit 49df07c7039206c537c17f40140d290e1b28a3f4
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 16:47:42 2019 +0800

    Divide-and-conquer security group beasts

    In one specific compute node, the security group rules
    can be enormous quantity. This patch adds a step-by-step
    processing method to deal with the large number of the
    security group rules. And also changes or adds some LOG.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813707

    Conflicts:
     neutron/common/constants.py
            neutron/agent/common/ovs_lib.py
    Conflicts:
     neutron/agent/securitygroups_rpc.py
     neutron/common/constants.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Conflicts:
     neutron/agent/common/ovs_lib.py
     neutron/common/constants.py

    Change-Id: I57bf27ec75cf848271c5a28b22beee12b8bd5faa
    (cherry picked from commit 6ac420df7eb3ed324669472c61fec41b3d9cb35b)
    (cherry picked from commit f5d110e15f60753d056da942414ca6ecd6b78d8a)
    (cherry picked from commit 5424b9a68cb3ac1fcc04ed8ae603c421bde2dee3)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/649693
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb2734b0d524aef348b69ae02988449f9dd63c56
Submitter: Zuul
Branch: stable/ocata

commit bb2734b0d524aef348b69ae02988449f9dd63c56
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
            neutron/common/constants.py
            neutron/agent/rpc.py
            neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/806329

Changed in neutron:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "liuyulong <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/806329
Reason: reopen if we want this

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.