[L2][scale issue] ovs-agent failed to restart

Bug #1813706 reported by LIU Yulong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Unassigned

Bug Description

When subnets or security group ports quantity reach 2000+, the ovs-agent failed to restart and do fullsync infinitely.
This is a subproblem of bug #1813703, for more information, please see the summary:
https://bugs.launchpad.net/neutron/+bug/1813703

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Again I have the same question here. Do you have any logs from the ovs-agent that what is doing or where it is held up.
Is it having trouble to contact the ovswitchd and is that causing any problems?
Is there any information to track?

tags: added: ovs
Revision history for this message
LIU Yulong (dragon889) wrote :

All the following bugs can cause such infinitely fullsync:
(1) RPC timeout during ovs-agent restart
https://bugs.launchpad.net/neutron/+bug/1813704
(2) local connection to ovs-vswitchd was drop or timeout
https://bugs.launchpad.net/neutron/+bug/1813705
(7) multipe cookies flows (stale flows) (failed to clean stale flows)
https://bugs.launchpad.net/neutron/+bug/1813712
(8) dump-flows takes a lots of time (failed to dump flows)
https://bugs.launchpad.net/neutron/+bug/1813709

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/638646

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/638646
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8408af4f173a0ffde354599e26c49bf9e17e8bef
Submitter: Zuul
Branch: master

commit 8408af4f173a0ffde354599e26c49bf9e17e8bef
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.openstack.org/649682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/649683

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/649688

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/649691

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649693

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.openstack.org/649682
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d7d30ea950844f11348fa2827908622e3a8c7dfb
Submitter: Zuul
Branch: stable/stein

commit d7d30ea950844f11348fa2827908622e3a8c7dfb
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/649688
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=39afe0a129b6b979d0b56ec59048a4e16bedf7a9
Submitter: Zuul
Branch: stable/queens

commit 39afe0a129b6b979d0b56ec59048a4e16bedf7a9
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
     neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)
    (cherry picked from commit 5d705468de1e495639f8b87266ccfc9391ce6135)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/649683
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5d705468de1e495639f8b87266ccfc9391ce6135
Submitter: Zuul
Branch: stable/rocky

commit 5d705468de1e495639f8b87266ccfc9391ce6135
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.opendev.org/649691
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fa16540d2dd80f836c8fa2a424717899ac64af60
Submitter: Zuul
Branch: stable/pike

commit fa16540d2dd80f836c8fa2a424717899ac64af60
Author: LIU Yulong <email address hidden>
Date: Fri Apr 12 18:47:24 2019 +0300

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
     neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)
    (cherry picked from commit d7d30ea950844f11348fa2827908622e3a8c7dfb)
    (cherry picked from commit 5d705468de1e495639f8b87266ccfc9391ce6135)

tags: added: in-stable-pike
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.opendev.org/649693
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb2734b0d524aef348b69ae02988449f9dd63c56
Submitter: Zuul
Branch: stable/ocata

commit bb2734b0d524aef348b69ae02988449f9dd63c56
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:39:50 2019 +0800

    Do not call update_device_list in large sets

    Ovs-agent can process the ports in large sets, then all
    of these ports will have to update DB status or attributes.
    But neutron server is centralized. It may have to do
    something else, or the database processing can be also
    time-consuming. Because of these, it sometimes returns
    the RPC timeout exception to ovs-agent. And a fullsync
    will be triggered in next rpc loop. The restart time is
    becoming longer and longer.

    Adds a default step to update the port to reduce
    the probability of RPC timeout.

    Related-Bug: #1813703
    Related-Bug: #1813704
    Related-Bug: #1813706
    Related-Bug: #1813707

    Conflicts:
            neutron/common/constants.py
            neutron/agent/rpc.py
            neutron/tests/unit/plugins/ml2/test_rpc.py

    Change-Id: Ie37f4a4869969e235ce16b73cdfcbdc98626823e
    (cherry picked from commit 8408af4f173a0ffde354599e26c49bf9e17e8bef)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/638641
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8e73de8bc42067c0a6796df3cca9938d25ae754e
Submitter: Zuul
Branch: master

commit 8e73de8bc42067c0a6796df3cca9938d25ae754e
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800

    Change ovs-agent iteration log level to INFO

    Operators may want to see how long it takes in the port
    processing procedure since DEBUG log does not enable
    basically in the production envrionment.

    Related-Bug: #1813703
    Related-Bug: #1813707
    Related-Bug: #1813706
    Related-Bug: #1813709

    Change-Id: I43733546abf5421d0e3f4cd5a959d279e1b89d1e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/721239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/721240

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/721242

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/721239
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a10413eb3fa52de2f330b165ac81c9dd47aeda57
Submitter: Zuul
Branch: stable/stein

commit a10413eb3fa52de2f330b165ac81c9dd47aeda57
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800

    Change ovs-agent iteration log level to INFO

    Operators may want to see how long it takes in the port
    processing procedure since DEBUG log does not enable
    basically in the production envrionment.

    Related-Bug: #1813703
    Related-Bug: #1813707
    Related-Bug: #1813706
    Related-Bug: #1813709

    Conflicts:
        neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I43733546abf5421d0e3f4cd5a959d279e1b89d1e
    (cherry picked from commit 8e73de8bc42067c0a6796df3cca9938d25ae754e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.opendev.org/721240
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=41fe9ff147244eb6d7573235fa12f45ed56be9b3
Submitter: Zuul
Branch: stable/rocky

commit 41fe9ff147244eb6d7573235fa12f45ed56be9b3
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800

    Change ovs-agent iteration log level to INFO

    Operators may want to see how long it takes in the port
    processing procedure since DEBUG log does not enable
    basically in the production envrionment.

    Related-Bug: #1813703
    Related-Bug: #1813707
    Related-Bug: #1813706
    Related-Bug: #1813709

    Conflicts:
        neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I43733546abf5421d0e3f4cd5a959d279e1b89d1e
    (cherry picked from commit 8e73de8bc42067c0a6796df3cca9938d25ae754e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.opendev.org/721242
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=713ad71c6f4e389f3a224ce2898b09977b5045bb
Submitter: Zuul
Branch: stable/queens

commit 713ad71c6f4e389f3a224ce2898b09977b5045bb
Author: LIU Yulong <email address hidden>
Date: Wed Feb 20 14:01:08 2019 +0800

    Change ovs-agent iteration log level to INFO

    Operators may want to see how long it takes in the port
    processing procedure since DEBUG log does not enable
    basically in the production envrionment.

    Related-Bug: #1813703
    Related-Bug: #1813707
    Related-Bug: #1813706
    Related-Bug: #1813709

    Conflicts:
        neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py

    Change-Id: I43733546abf5421d0e3f4cd5a959d279e1b89d1e
    (cherry picked from commit 8e73de8bc42067c0a6796df3cca9938d25ae754e)

Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.