Network creation and Router Creation times degrade with large number of instances

Bug #1497396 reported by Uday
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

We are trying to analyze why creation of routers and networks degrades when there is a large number of instances of these. Running cprofile on the L3 agent indicated that ensure_namespace function seems to degrade when a large number of namespaces are present.

Looking through the code all the namespaces are listed with the "ip netns list" command and then compared against the one that is of interest. This scales badly since with large number if instances the number of comparisons increases.

An alternate way to achieve the same result could be to check for the desired namespace ("ls /var/run/netns/qrouter-<id>") or to run a command (maybe the date commands?) and check the response.

Either method described above would have a constant time for execution, rather than the linear time as seen presently.

Thanks,
-Uday

Revision history for this message
Uday (nagaraj) wrote :

Can someone comment on whether this modification can be done? Presently we have a need to create 1000s of router and network namespace instances, and ensure_namespace degrades badly since it needs to perform a very large number of comparisons.

Thanks,
-Uday

Revision history for this message
John Kasperski (jckasper) wrote :

I'll take a look at this bug.

Changed in neutron:
assignee: nobody → John Kasperski (jckasper)
Changed in neutron:
status: New → In Progress
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Please don't set the status to "In progress". This will be done automatically by gerrit when you post a patch on review.

Changed in neutron:
importance: Undecided → Medium
status: In Progress → Confirmed
Revision history for this message
John Kasperski (jckasper) wrote :

The performance of "ip netns list" with 5 namespaces vs. 3000 namespaces is basically the same. The "for" loop logic in neutron.agent.linux.ip_lib.IpNetnsCommand.exists method can be changed from:

        for line in output.split('\n'):
            if name == line.strip():
                return True
to
        if name in output:
                return True

That will help some.

Revision history for this message
John Kasperski (jckasper) wrote :

The suggestion to run a command in the IP namespace to verify that the namespace exists .... looks like it would have a negative impact on the performance. Running a command in a namespace takes roughly 10x longer than it takes to grab list of all of the namespaces.

Revision history for this message
John Kasperski (jckasper) wrote :

The best performance solution for this bug is to simply check for the existence of the /var/run/netns/NAMESPACE. It appears that /var/run/netns is standard across Linux implementations AND that it can be checked without having to switch to root.

Revision history for this message
Ryan Moats (rmoats) wrote :

jckasper: are you spinning a patch to do this?

Revision history for this message
John Kasperski (jckasper) wrote :

Patch for comment #6 (and updated test cases) will be checked in shortly.

Revision history for this message
Uday (nagaraj) wrote :

Hi Jack,

I agree that the listing of a particular namespace would be the most efficient.

I just wanted to point out some numbers on how ensure_namespace degrades, on a controller+network node setup that has 12 cores, here are the times:

1st Router creation: 0.067ms
41st Router creation: 0.363ms

These numbers are from cprofile. Also, the cprofile was run when we created subnets and routers, so ensure namespace would likely have been done for the dhcp namespace too.

Thanks,
-Uday

Revision history for this message
Uday (nagaraj) wrote :

Sorry to call you Jack, I meant John

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/227589

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
John Kasperski (jckasper) wrote :

Uday,

The numbers you stated in #9 are for how much of the code path? Ryan Moats found the following:

   http://ibin.co/2GlgipnbjsSo

The focus of this specific bug is just on ensure_namespace ... which does not seem to grow that much as the number of routers is increased. Still an optimization in this code path will help some.

Revision history for this message
Uday (nagaraj) wrote :

Hi John,

This was for the creation of about 100 router instances.

Thanks,
-Uday

Revision history for this message
Kevin Benton (kevinbenton) wrote :

These kind of performance bugs are unacceptable without actual numbers for how much they are impacting. How much time is being spent with 100 routers on listing namespaces? Is this multiple seconds?

Also, the reason this is run as root is because the permissions of the namespace directory can be restricted to keep regular users from seeing the namespaces on a node.

Changed in neutron:
status: In Progress → Incomplete
Changed in neutron:
status: Incomplete → In Progress
Revision history for this message
John Kasperski (jckasper) wrote :

@Kevin,
The patch has been re-spun to handle the permission case that you mentioned.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Please provide steps to reproduce. I want to know that this method is a significant factor in router creation time and it's impossible to tell from the bug report.

Changed in neutron:
status: In Progress → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/230343

Changed in neutron:
status: Incomplete → In Progress
tags: added: l3-ipam-dhcp loadimpact
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/227589
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=81823e86328e62850a89aef9f0b609bfc0a6dacd
Submitter: Jenkins
Branch: master

commit 81823e86328e62850a89aef9f0b609bfc0a6dacd
Author: John Kasperski <email address hidden>
Date: Thu Sep 24 18:16:18 2015 -0500

    Improve performance of ensure_namespace

    The ensure_namespace method calls IpNetnsCommand.exists to
    determine if the specified namespace exists or not. This is
    accomplished by listing all namespaces with "ip netns list"
    and then looping through the output to determine if the specified
    namespace was included in the output.

    Research of various Linux operating systems has indicated that
    namespaces are represented as files in /var/run/netns and root
    authority is "typically" not required in order to look at the
    files in this subdirectory.

    The existing configuration option "use_helper_for_ns_read"
    will be used to determine if the root-helper should be used to
    to retrieve the list of namespaces. If this configuraton option
    is set to False, the native python os.listdir(/var/run/netns)
    will be used.

    Related-Bug: #1311804
    Closes-Bug: #1497396
    Change-Id: I9da627d07d6cbb6e5ef1a921a5f22963317a04e2

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Kevin Benton (kevinbenton) wrote :

It's not clear that the patch addresses the title of the bug. Uday, can you confirm that there is no longer a degradation?

Changed in neutron:
status: Fix Committed → Confirmed
status: Confirmed → Incomplete
Revision history for this message
Uday (nagaraj) wrote :

Hi Kevin,

I'll check this patch in the next few days, my setups are bit tied up presently and I would like to collect this info on the same setup running the same tests with and without the patch.

That said, one comment I'd like to make is that ensure_namespace is not the only function that causes router creation time to scale badly. There are plenty of other heavy hitters out there. But ensure_namespace seemed ripe for optimization looking at the way it was implemented.

Thanks,
-Uday

Revision history for this message
Kevin Benton (kevinbenton) wrote :

After talking with Ryan Moats, it seems that the ensure_namespace patch was an optimization on something that already took only ~1% of router creation time. For now I will leave this as unassigned and incomplete until we can get more info about the issue.

Changed in neutron:
assignee: John Kasperski (jckasper) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/230343
Reason: ensure_namespace was taking a trivial time compared to other operations. will revisit it if it turns out to be a bottleneck later

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.