openvswitch agent plugin should execute batched ovs-vsctl CLI statements on __init__()

Bug #1264608 reported by Jay Pipes on 2013-12-28
52
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Unassigned

Bug Description

When restarting neutron-plugin-openvswitch-agent, the plugin goes through a series of initialization calls, many of which call out to ovs-vsctl to set up things like the integration bridge, tun bridge and all the flows.

I've seen this __init__() process take upwards of 2 hours (!) to complete on a small (<100 compute node) zone. Most of the time is spent spawning rootwrap'd calls to ovs-vsctl to set up the various tunnels, flows and bridges.

ovs-vsctl can take a series of action arguments in the same CLI statement, and will perform those actions as an atomic unit. At the very least, the plugin's __init__() function's setup_xxx() subroutines should take advantages of this instead of looping over possibly dozens or hundreds of port and flows, each with a separate subprocess call out to utils.execute('ovs.vsctl...').

Jay Pipes (jaypipes) on 2013-12-28
summary: - openvswitch agent plugin should execute a single ovs-vsctl CLI statement
+ openvswitch agent plugin should execute batched ovs-vsctl CLI statements
on __init__()
Amir Sadoughi (amir-sadoughi) wrote :

This sounds more like a blueprint than a bug. Also, it sounds related to blueprint neutron-agent-soft-restart <https://blueprints.launchpad.net/neutron/+spec/neutron-agent-soft-restart>.

Changed in neutron:
assignee: nobody → Manish Godara (manishatyhoo)
Kyle Mestery (mestery) on 2014-04-29
Changed in neutron:
importance: Undecided → Medium
status: New → Triaged
Wei Wang (damon-devops) wrote :

Hi @manishatyhoo,

The soft-restart blueprint surely will improve VMs' traffic and restart's time, but batched start'll also improve the hard restart's speed.

OVSNeutronAgent use OVSBridge to run "ovs-vsctl" and these slow speed.

Damon

Manish Godara (manishatyhoo) wrote :

@damon-devops: looking at the bp + digging into this bug.

Manish Godara (manishatyhoo) wrote :

it is recommended that we wait for the modular l2 agent architecture to be solidified before working on this. Will talk to bainx about details.

Changed in neutron:
assignee: Manish Godara (manishatyhoo) → nobody
Changed in neutron:
assignee: nobody → meenakshi kaushik (meenakshi-kaushik)

 Greetings, I picked this bug to work on during Upstream Training Vancouver. I was recommended to pick another bug, hence removing my name from the assignee list.

Changed in neutron:
assignee: meenakshi kaushik (meenakshi-kaushik) → nobody
Prakashan Korambath (ppk-o) wrote :

Hello,
I will pick up this assignment during UpStream Training. Thanks.

Prakashan

Changed in neutron:
assignee: nobody → Prakashan Korambath (ppk-o)
Sreekumar S (sreesiv) on 2015-09-23
Changed in neutron:
assignee: Prakashan Korambath (ppk-o) → Sreekumar S (sreesiv)
Sreekumar S (sreesiv) wrote :

I've looked at the neutron-openvswitch-agent and see some scope for improvement, although I am not sure how much perf improvement it may bring in. Some of the vsctl calls cannot be clubbed due to their inter dependencies for data/logic as it is structured in the current agent code.
 Also there is open flow add/mod calls to ofctl which also could be optimized if they can be clubbed. I couldn't find a way to provide multiple commands to ofctl, although I saw an option called 'bundle' || '--bundle' in some of the man pages. Not sure if it serves the same purpose as intended _or_ whether it may introduce some version dependency with OVS tools (_if_ 'bundle'ing can be done).

Request to provide suggestions/pointers on what can be done so that I can investigate more, and may be provide a patch for an early look, and boot up more detailed discussions.

Sreekumar S (sreesiv) wrote :

 I have made few modifications to group some ovs-vsctl ops as transactions.

- ofctl flow adds/mods are still not batched/bundled
- getportofport() calls are still not batched, although this could bring in additional perf improvements. Will do that as next step.
- Does not include test case / unit test code modifications.

This patch is to get some early feedback to see if I am headed in the right direction, and if it is bringing in some perf improvements in 100 (many) node environments. If the results are positive, I can continue working on this to further optimize, and possibly find a way to bundle flows.

I've verified the changes in devstack single node environment only.

Please provide you suggestions & comments.

Ryan Moats (rmoats) on 2015-09-30
tags: added: loadimpact
removed: performance
Ryan Moats (rmoats) wrote :

@sreessiv: rather than attaching a patch in LP, it would be better to submit the patch to gerrit - that way it will get more attention/review.

Sreekumar S (sreesiv) wrote :

Thanks Ryan, I am doing that currently; have merged to master and tested functionally. But some py27 tests have failed, so working to modify those tests and submit as patch on master via gerrit. I'll tag it as a 'Partial-Bug:'.

Fix proposed to branch: master
Review: https://review.openstack.org/239721

Changed in neutron:
status: Triaged → In Progress

Change abandoned by Doug Wiegley (<email address hidden>) on branch: master
Review: https://review.openstack.org/239721
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Needs new owner.

Changed in neutron:
status: In Progress → Incomplete
assignee: Sreekumar S (sreesiv) → nobody
Sreekumar S (sreesiv) wrote :

I am working on resurrecting the old patch which was a WIP, with some new modifications.

Changed in neutron:
assignee: nobody → Sreekumar S (sreesiv)
Eugene Nikanorov (enikanorov) wrote :

This might be of a less importance since we have pythonovs library that does things much more quickly

Sreekumar S (sreesiv) wrote :

Does it do batch execution? I did a master sync yesterday, and was not seeing batching (vsctl or ofctl, and the OpenFlow version is still at 13 and 10). How does the new pythonovs lib help here? Is it already fully integrated?

I believe the benefit of batching will be more obvious when we have many node environments (like 100+ as mentioned in bug), where there is a cross tunneling setup while restarting the agent.

Please let know the reasons why you think this will be less important/relevant!

Eugene Nikanorov (enikanorov) wrote :

batching is a benefit when you call external program. Calling a program (especially rootwrap) is very time consuming.
python-ovs connects to ovs directly.

Sreekumar S (sreesiv) wrote :

I am sorry, I didn't understand your point! May be I've totally lost it, but Can you please point out an example of how it is being used connecting directly?

In my case, I am talking about https://github.com/openstack/neutron/blob/fe55ae8a21232b2215a694253f70ef510599c8cb/neutron/agent/ovsdb/impl_vsctl.py#L45-L71

and batching tunnel creations based on this logic.

Sreekumar S (sreesiv) wrote :

Are you suggesting to use some new library? From what I've read, either we can call the vsctl and ofctl commands and batch it. ofctl bundling only works with OpenFlow v1.4; _or_ we can use ovsdb protocols directly!
I didn't see any reference in the code, which uses ovsdb protocol, and so I wanted to stick to the first approach of batching existing commands and flows.

So are you suggesting to use a new library which implements the ovsdb protocol?

Eugene Nikanorov (enikanorov) wrote :

https://github.com/openstack/neutron/tree/fe55ae8a21232b2215a694253f70ef510599c8cb/neutron/agent/ovsdb/native

While the library itself is python-ovs

OVS agent can be configured to use native library rather than issue ovs-vsctl commands.

Hello Eugene, is that documented somewhere ?

Sreekumar S (sreesiv) wrote :

Thanks Eugene, now I understand where the library is integrated.
But why do you think, batching will not be effective here?

Don't you think
https://github.com/openstack/neutron/blob/0363ec23e744a5cae0f85318031e3acaa9e45d15/neutron/agent/ovsdb/impl_idl.py#L79-L92
this will be much faster if the operations are batched in a txn?

Savings here will not be as high as the rootwrap/vsctl case, but vsctl is still the default
https://github.com/openstack/neutron/blob/0363ec23e744a5cae0f85318031e3acaa9e45d15/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini#L43-L46

I've read twilson's (bp patch author) blog on this http://blog.otherwiseguy.com/replacing-ovs-vsctl-calls-with-native-ovsdb-in-neutron/

I think, it will definitely be faster in both cases if we batch the operations, especially tunnel creations when there are many nodes.

Sreekumar S (sreesiv) wrote :

Added twilson,

Hi Terry,

I am trying to change the flow of adding bridges, ports, tunnels etc by the agent during startup so that we could batch them. You can see my abandoned patch https://review.openstack.org/#/c/239721/ which I am trying to clean-up & resurrect now with some modifications.
I think this will be beneficial in many node environments.

https://review.openstack.org/#/c/239721/4/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py,unified
Please see the `def tunnel_sync(self):` changes...

Can you please see the comments above and suggest if it is worthwhile doing such a thing as I intend?

Fix proposed to branch: master
Review: https://review.openstack.org/286580

Changed in neutron:
status: Incomplete → In Progress
tags: added: scale
Ryan Moats (rmoats) wrote :
tags: removed: scale
Sreekumar S (sreesiv) wrote :

Some perf number with the latest patch (PS #21) compared to the master code, with and without the native plugin.
I've taken these numbers on a 16 node DevStack setup (1 controller/nw node + 15 compute nodes) all running VMs.

Contrary to what we expect, the gains are a bit higher with the native plugin!

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/286580
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Might be worth resume work on this.

Sreekumar S (sreesiv) wrote :

Please see if the PS https://review.openstack.org/286580 is the right way to mitigate this issue.
I've also posted some perf numbers there, please have a look at them and let me know if it worth continuing to work on this. It's been a while since those numbers were posted, so I am not so sure if they are still valid, but in any case that will give a relative idea about the gains, and also `gain vs complexity` factor.

Change abandoned by Sreekumar S (<email address hidden>) on branch: master
Review: https://review.openstack.org/286580
Reason: Abandoning, as there is not much interest in squeezing perf out of vsctl batching. Native ovsdb interface provides almost comparable performance and also the additional complexity is outweighing the benefits.

Sreekumar S (sreesiv) wrote :

As per Kevin, "With hitless restart and the native interfaces, it's more than fast enough to not bother tenants."
I recommend marking this is as a no fix.

Changed in neutron:
assignee: Sreekumar S (sreesiv) → nobody
Sreekumar S (sreesiv) wrote :

"Won't fix" to be precise.

RIP

Changed in neutron:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers