OVS L2 agent polling is too cpu intensive

Bug #1177973 reported by Maru Newby
70
This bug affects 11 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Maru Newby
Havana
Fix Released
Medium
Terry Wilson

Bug Description

On a Devstack-deployed, single-node install, the ovs l2 agent is using an order of magnitude more cpu than any other service. On a nested-virt VM running on a 2.5GHz host, with no VM's provisioned:

 - WIth L2 agent running, 10-100% cpu usage recorded, averaging ~40%
 - With L2 agent stopped, 0-10% cpu usage recorded

Casual inspection with something like top or htop will show the excessive cpu usage, but won't give a good indication of the culprit. Enabling the CUTIME and CSTIME stats is necessary to highlight the problem, as all the time taken by the agent is in subprocess-invoked polling whose execution time is not directly included in the parent's cpu usage.

Tags: ovs
Maru Newby (maru)
description: updated
Revision history for this message
Maru Newby (maru) wrote :

The current polling-based checks for ovs bridge changes would ideally be replaced by something event-based.

Changed in quantum:
status: New → Confirmed
Maru Newby (maru)
Changed in quantum:
assignee: nobody → Maru Newby (maru)
tags: added: ovs
Changed in quantum:
importance: Undecided → Medium
Revision history for this message
Édouard Thuleau (ethuleau) wrote :

We encounter the same problem with the Grizzly release and OVS plugin.
I set up a node which handle l2 agent, l3 agent and DHCP agent.
I doesn't modify intervals 'agent_down_time' (5sec) and 'report_interval' (4sec).

When the l2 agent has about 25 interfaces to handle, the pooling-base check take more time than the interval 'report_interval':

2013-06-03 15:44:21 WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 0.80157 sec

And the Quantum server think that the agent is down because the report state takes more than interval 'agent_down_time'.

It's possible to change the intervals value to correct the problem but when the number of interfaces increases these intervals time have a chance to be reach.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/34762

Changed in neutron:
assignee: Maru Newby (maru) → Francois Eleouet (fanchon)
status: Confirmed → In Progress
Changed in neutron:
assignee: Francois Eleouet (fanchon) → nobody
Revision history for this message
Julian Sternberg (jules-i) wrote :

any solution for this yet? the patch got abandoned?

Maru Newby (maru)
Changed in neutron:
assignee: nobody → Maru Newby (maru)
milestone: none → havana-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/43256

Revision history for this message
Maru Newby (maru) wrote :

My intention is to provide a new event-based approach that will not be turned on by default so as to minimize the potential for breakage leading up to the Havana release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/43256
Committed: http://github.com/openstack/neutron/commit/1f9b4e77d93b165a0aeeaae0de610389debd5198
Submitter: Jenkins
Branch: master

commit 1f9b4e77d93b165a0aeeaae0de610389debd5198
Author: Maru Newby <email address hidden>
Date: Thu Aug 22 07:57:00 2013 +0000

    Minimize ovs l2 agent calls to get_vif_port_set()

    The ovs l2 agent was previously calling get_vif_port_set() on the
    integration bridge once per rpc_loop() iteration and then again in
    the periodic _report_state() call that returns the current device
    count to the neutron service. Since get_vif_port_set() is an
    expensive call (relying on shell commands) and since there
    is minimal risk associated with reporting stats that are a few
    seconds old, this patch caches the device count for reuse by
    _report_state().

    Partial-Bug: 1177973

    Change-Id: Ice73384ed1ba1e97120028cd0a9bff94a62a41a4

Changed in neutron:
milestone: havana-3 → havana-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/45676

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/45677

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/45678

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I am the one who moved this target to RC1, but I now wonder if the new 4 patches which have been pushed are still targeting Havana, or are meant to be merged in Icehouse.

I am not entirely sure these patches fulfill the criteria needed for acceptance in RC phase; comments welcome.

tags: added: havana-rc-potential
Changed in neutron:
milestone: havana-rc1 → none
Thierry Carrez (ttx)
tags: added: havana-backport-potential
removed: havana-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/45676
Committed: http://github.com/openstack/neutron/commit/acf0209b28e21eed60158967fab77468eb195e7c
Submitter: Jenkins
Branch: master

commit acf0209b28e21eed60158967fab77468eb195e7c
Author: Maru Newby <email address hidden>
Date: Mon Sep 9 01:29:54 2013 -0700

    Add support for managing async processes

    Interacting with a long-running asynchronous process requires the
    use of non-blocking io. This change adds a helper class that can
    launch a long-running process and read stdout and stderr in a
    non-blocking fashion via eventlet.

    This functionality is intended to support monitoring ovsdb via
    a long-running and root-privileged invocation of ovsdb-client.

    The complexity of the system interaction in this patch suggested
    the addition of a functional test that validated actual behaviour.
    The test was added under the neutron/tests/functional path which
    is now included in the testr search path.

    Partial-Bug: #1177973

    Change-Id: I9969e556acecf7a9e77d873371cc2ec2647be011

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/45677
Committed: http://github.com/openstack/neutron/commit/010bd1f392e67a6fcd276593b8c79acfe41d1cc7
Submitter: Jenkins
Branch: master

commit 010bd1f392e67a6fcd276593b8c79acfe41d1cc7
Author: Maru Newby <email address hidden>
Date: Mon Sep 9 09:58:12 2013 +0000

    Improve ovs_lib bridge management

    Add the ability to add and remove bridges, check for bridge
    existence, and lookup the bridge associated with a port.

    This change is in support of functional testing for an ovsdb
    monitor.

    Partial-Bug: #1177973

    Change-Id: I419923d8d77983997cd347fcf063b0bc367c0bbc

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/45678
Committed: http://github.com/openstack/neutron/commit/cb0df591a9508e863ad5d5d71190eca349dc551f
Submitter: Jenkins
Branch: master

commit cb0df591a9508e863ad5d5d71190eca349dc551f
Author: Maru Newby <email address hidden>
Date: Mon Sep 9 10:06:49 2013 +0000

    Add the option to minimize ovs l2 polling

    This change adds the ability to monitor the local ovsdb for
    interface changes so that the l2 agent can avoid unnecessary
    polling. Minimal changes are made to the agent so the risk
    of breakage should be low. Future efforts to make the agent
    entirely event-based may be able to use OvsdbMonitor as a
    starting point.

    By default polling minimization is not done, and can only be
    enabled by setting 'minimize_polling = True' in the ovs
    section of the l2 agent's config file.

    Closes-Bug: #1177973

    Change-Id: I26c035b48a74df2148696869c5a9affae5ab3d27

Aaron Rosen (arosen)
tags: removed: havana-backport-potential
Changed in neutron:
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
Peter Feiner (pete5) wrote :

I wonder why this isn't considered havana backport potential anymore? Using minimze_polling has a significant effect on performance, as the bug report describes. Moreover, when too much polling happens, the openvswitch agent starts to thrash, making the host useless for starting new instances. See http://lists.openstack.org/pipermail/openstack-dev/2013-December/021323.html.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I said I would support the backport, but I forgot this patch adds a dependency (psutil I think)

I hope we can find a way to work around this issue.

For the comments on the OVS agent, we've observed improvements testified by significant reduction of "timeouts" failures on the gate since enabling it. However, the agent under load can still be quite a mess. For more info see: https://bugs.launchpad.net/neutron/+bug/1253993

Aaron Rosen (arosen)
tags: added: havana-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/65808

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/65809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/65810

Alan Pevec (apevec)
tags: removed: havana-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/65808
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2e9f5a7a3156ad91c901099f7577179f0c588a9c
Submitter: Jenkins
Branch: stable/havana

commit 2e9f5a7a3156ad91c901099f7577179f0c588a9c
Author: Maru Newby <email address hidden>
Date: Mon Sep 9 01:29:54 2013 -0700

    Add support for managing async processes

    Interacting with a long-running asynchronous process requires the
    use of non-blocking io. This change adds a helper class that can
    launch a long-running process and read stdout and stderr in a
    non-blocking fashion via eventlet.

    This functionality is intended to support monitoring ovsdb via
    a long-running and root-privileged invocation of ovsdb-client.

    The complexity of the system interaction in this patch suggested
    the addition of a functional test that validated actual behaviour.
    The test was added under the neutron/tests/functional path which
    is now included in the testr search path.

    Partial-Bug: #1177973

    Change-Id: I9969e556acecf7a9e77d873371cc2ec2647be011
    (cherry picked from commit acf0209b28e21eed60158967fab77468eb195e7c)

tags: added: in-stable-havana
Revision history for this message
Alan Pevec (apevec) wrote :
tags: removed: in-stable-havana
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/65809
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6dd647ae73a404724f11b422f5fc1bb04e48aecf
Submitter: Jenkins
Branch: stable/havana

commit 6dd647ae73a404724f11b422f5fc1bb04e48aecf
Author: Maru Newby <email address hidden>
Date: Mon Sep 9 09:58:12 2013 +0000

    Improve ovs_lib bridge management

    Add the ability to add and remove bridges, check for bridge
    existence, and lookup the bridge associated with a port.

    This change is in support of functional testing for an ovsdb
    monitor.

    Partial-Bug: #1177973

    Change-Id: I419923d8d77983997cd347fcf063b0bc367c0bbc
    (cherry picked from commit 010bd1f392e67a6fcd276593b8c79acfe41d1cc7)

tags: added: in-stable-havana
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/65810
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=39ef5187b5d0bd9484f8e5773e90ba112fcf06e1
Submitter: Jenkins
Branch: stable/havana

commit 39ef5187b5d0bd9484f8e5773e90ba112fcf06e1
Author: Maru Newby <email address hidden>
Date: Mon Sep 9 10:06:49 2013 +0000

    Add the option to minimize ovs l2 polling

    This change adds the ability to monitor the local ovsdb for
    interface changes so that the l2 agent can avoid unnecessary
    polling. Minimal changes are made to the agent so the risk
    of breakage should be low. Future efforts to make the agent
    entirely event-based may be able to use OvsdbMonitor as a
    starting point.

    By default polling minimization is not done, and can only be
    enabled by setting 'minimize_polling = True' in the ovs
    section of the l2 agent's config file.

    Closes-Bug: #1177973

    Change-Id: I26c035b48a74df2148696869c5a9affae5ab3d27
    (cherry picked from commit cb0df591a9508e863ad5d5d71190eca349dc551f)

Alan Pevec (apevec)
tags: removed: in-stable-havana
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.