[L2][scale issue] local connection to ovs-vswitchd was drop or timeout

Bug #1813705 reported by LIU Yulong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Unassigned

Bug Description

When subnets or security group ports quantity reach 2000+, the ovs-agent connection to ovs-vswitchd may get lost, drop or timeout during restart.
This is a subproblem of bug #1813703, for more information, please see the summary:
https://bugs.launchpad.net/neutron/+bug/1813703

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Are we running out of some OS resources and is that the reason for the ovs-agent connection to get dropped to the ovs-vswitchd.
If that is the case, is it possible for us to throttle the number of ports for initial sync when the agent comes up.
We did similar throttling mechanism for the number of routers that are handled by the l3-agents, may be a similar approach to throttle the amount of ports on a particular node may be solution.

Again if the logs are not communicating to us anything about the issue, we should probably also update the logs to communicate the exact problem.

tags: added: ovs
Revision history for this message
LIU Yulong (dragon889) wrote :

We do not find out the root cause of this issue, but we can see those drop or broken in the ovs-agent log.
IMO, this issue is someting related to coroutine switching. When restart the ovs-agent, there heavy I/O for ovs-agent to commincate the message queue to do sync action. So if there are large data transfer. The connection belongs to other coroutine (greenlet) for ovs-agent can be hang up for a long time. So the server side (ovs-vswitchd) may close the connection due to long time no heart beat.

Revision history for this message
LIU Yulong (dragon889) wrote :

Paste some logs:
2019-02-20 11:00:41.944 2268474 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Broken pipe 2019-02-20 11:00:41.968 2268474 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Broken pipe)
2019-02-20 11:01:46.027 2268474 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None
2019-02-20 11:01:47.331 2268474 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json] dies due to the error: None

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/638645

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/638645
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c
Submitter: Zuul
Branch: master

commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.openstack.org/650389

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/650390

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/650392

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/650393

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/650394

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/stein)

Reviewed: https://review.openstack.org/650389
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d7764064d0455634b18cc0931bcc44343913a1c6
Submitter: Zuul
Branch: stable/stein

commit d7764064d0455634b18cc0931bcc44343913a1c6
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

tags: added: in-stable-stein
tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/650390
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26a9765afb917901ca40e3117ff092774823ada2
Submitter: Zuul
Branch: stable/rocky

commit 26a9765afb917901ca40e3117ff092774823ada2
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/650392
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=df4e0a5394dff4cc176096abc64079d2c43fa9e7
Submitter: Zuul
Branch: stable/queens

commit df4e0a5394dff4cc176096abc64079d2c43fa9e7
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/650394
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7a4bc6e43fb7274b940fb88f13011821e283b3bb
Submitter: Zuul
Branch: stable/ocata

commit 7a4bc6e43fb7274b940fb88f13011821e283b3bb
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/650393
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb508f9e6051ccb27b6dfda05b5c52b961f7370a
Submitter: Zuul
Branch: stable/pike

commit bb508f9e6051ccb27b6dfda05b5c52b961f7370a
Author: LIU Yulong <email address hidden>
Date: Thu Feb 21 16:34:40 2019 +0800

    Change default local ovs connection timeout

    Large number of flows can cause local ovs connection
    timeout. Ultimately getting succeed will be better
    than a retry or fullsync.

    Related-Bug: #1813703
    Related-Bug: #1813705
    Related-Bug: #1813707
    Related-Bug: #1813709

    Change-Id: Ifa0608a7e131df3cad2f7727426720afce641a58
    (cherry picked from commit 64ea642359e8f8aee2ebe494e037ecdfe8cf1b2c)

tags: added: in-stable-pike
tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.