API response time degradation

Bug #1513782 reported by Andrey Pavlov
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Ihar Hrachyshka

Bug Description

We found response time degradation in list operations for network objects.

Such degradation we found in our rally jobs.
This job works in devstack with 'fake_virt' libvirt driver.
One part of this job creates 200 servers with floating ip and lists them and their objects.

I saw that 18.08.2015 response times was good [1] but 21.08.2015 they became bad [2].
Right now response times still bad [3]...

I suggest that this is a neutron problem because we have several tests that measure different aspects.
For example, listing of regions and images have same times as in the past.
But degradation of addresses` listing is in ~ten times:
was (for 200 servers): 0.719 seconds
now (for 100 servers): 5.039 seconds
subnets: 1.358 vs 5.606
network_interfaces: 1.292 vs 10.298

Also I've asked in mailing list [4] but there was no sensible answer...

[1] http://logs.openstack.org/13/211613/6/experimental/ec2-api-rally-dsvm-fakevirt/fac263e/
[2] http://logs.openstack.org/74/213074/7/experimental/ec2-api-rally-dsvm-fakevirt/91d0675/
[3] http://logs.openstack.org/74/213074/7/experimental/ec2-api-rally-dsvm-fakevirt/91d0675/rally-plot/detailed.txt.gz
[4] http://lists.openstack.org/pipermail/openstack-dev/2015-September/073577.html

description: updated
Revision history for this message
Kyle Mestery (mestery) wrote :

I think we want to fix this one. It's a usability issue when you get to a high number of tenant resources.

Changed in neutron:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Can you confirm that it's still degraded? We had a big performance regression with interface listings that was fixed after that time.

Changed in neutron:
status: Triaged → Incomplete
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Here is where it was fixed: https://review.openstack.org/#/c/226581/

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

This patchset was merged more than month ago but slow behavior still happens...

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

@andrey, any hint of where the degraded performance could be coming from?.

Could you help us use rally to bisect and find the commit which introduced this?.

Changed in neutron:
status: Incomplete → New
tags: added: needs-attention
tags: added: liberty-backport-potential
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Added liberty-backport-potential, which will depend on the nature of the performance regression itself, and the final fix.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I can help find the cause, I just need to know what operations to look at on the Neutron side.

In the report, you have "subnets: 1.358 vs 5.606". Is that just a regular list subnets call? If so, how many subnets?

For "network_interfaces: 1.292 vs 10.298", Neutron doesn't have a 'network_interfaces' endpoint. It does have 'ports' though. Can you confirm that listing ports is slower so we know Nova didn't impact it?

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

here is info from logs:

For list of network interfaces:
2015-11-09 21:23:19.218 3701 DEBUG ec2api.api action: DescribeNetworkInterfaces
2015-11-09 21:23:19.227 3701 DEBUG keystoneclient.session REQ: curl -g -i -X GET http://127.0.0.1:9696/v2.0/ports.json?tenant_id=f0ec4b1aee0249dfa2538595dac6efb8
2015-11-09 21:23:24.947 3701 DEBUG keystoneclient.session RESP: [200] Date: Mon, 09 Nov 2015 21:23:24 GMT Connection: keep-alive Content-Type: application/json; charset=UTF-8 Content-Length: 111881
2015-11-09 21:23:24.970 3701 DEBUG keystoneclient.session REQ: curl -g -i -X GET http://127.0.0.1:9696/v2.0/floatingips.json?tenant_id=f0ec4b1aee0249dfa2538595dac6efb8
2015-11-09 21:23:25.043 3701 DEBUG keystoneclient.session RESP: [200] Date: Mon, 09 Nov 2015 21:23:25 GMT Connection: keep-alive Content-Type: application/json; charset=UTF-8 Content-Length: 35450
2015-11-09 21:23:25.048 3701 DEBUG keystoneclient.session REQ: curl -g -i -X GET http://127.0.0.1:9696/v2.0/security-groups.json?tenant_id=f0ec4b1aee0249dfa2538595dac6efb8
2015-11-09 21:23:25.119 3701 DEBUG keystoneclient.session RESP: [200] Date: Mon, 09 Nov 2015 21:23:25 GMT Connection: keep-alive Content-Type: application/json; charset=UTF-8 Content-Length: 22300
2015-11-09 21:23:25.124 3701 DEBUG keystoneclient.session REQ: curl -g -i -X GET http://127.0.0.1:9696/v2.0/ports.json?tenant_id=f0ec4b1aee0249dfa2538595dac6efb8
2015-11-09 21:23:30.415 3701 DEBUG keystoneclient.session RESP: [200] Date: Mon, 09 Nov 2015 21:23:30 GMT Connection: keep-alive Content-Type: application/json; charset=UTF-8 Content-Length: 111881
2015-11-09 21:23:30.441 3701 DEBUG ec2api.api.apirequest <DescribeNetworkInterfacesResponse ...
...

For list of subnets:
2015-11-09 21:23:12.244 3701 DEBUG ec2api.api action: DescribeSubnets
2015-11-09 21:23:12.254 3701 DEBUG keystoneclient.session REQ: curl -g -i -X GET http://127.0.0.1:9696/v2.0/networks.json?tenant_id=f0ec4b1aee0249dfa2538595dac6efb8
2015-11-09 21:23:12.537 3701 DEBUG keystoneclient.session RESP: [200] Date: Mon, 09 Nov 2015 21:23:12 GMT Connection: keep-alive Content-Type: application/json; charset=UTF-8 Content-Length: 7955
2015-11-09 21:23:12.538 3701 DEBUG keystoneclient.session REQ: curl -g -i -X GET http://127.0.0.1:9696/v2.0/ports.json?tenant_id=f0ec4b1aee0249dfa2538595dac6efb8 -H "User-Agent: python-neutronclient"
2015-11-09 21:23:18.060 3701 DEBUG keystoneclient.session RESP: [200] Date: Mon, 09 Nov 2015 21:23:18 GMT Connection: keep-alive Content-Type: application/json; charset=UTF-8 Content-Length: 111881
2015-11-09 21:23:18.065 3701 DEBUG keystoneclient.session REQ: curl -g -i -X GET http://127.0.0.1:9696/v2.0/subnets.json?tenant_id=f0ec4b1aee0249dfa2538595dac6efb8
2015-11-09 21:23:18.751 3701 DEBUG keystoneclient.session RESP: [200] Date: Mon, 09 Nov 2015 21:23:18 GMT Connection: keep-alive Content-Type: application/json; charset=UTF-8 Content-Length: 14319
2015-11-09 21:23:18.762 3701 DEBUG ec2api.api.apirequest <DescribeSubnetsResponse ...
...

So I confirm that only ports` request now very long.
In this test:
subnets count = 25
ports count = 100

tags: added: loadimpact
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Could it be related to RBAC shared field on networks?

Revision history for this message
Assaf Muller (amuller) wrote :

Some initial info... Trying to reproduce on master:
I created 200 networks, each with 1 subnet, then 200 ports, 1 in each network, all in the same tenant.

neutron port-list, neutron net-list and neutron subnet-list, all returned in constant time (0.5 seconds), regardless if I filter by tenant or not.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The logs go back to August, the bug report was filed in November. As much as I'd love to kill this performance regression I don't think we can afford to chase the rabbit down the hole.

If the submitter can provide us exact steps to reproduce the issue and environmental condition, that would be highly appreciated. As of now, and as specified in [1], this is Incomplete.

[1] http://docs.openstack.org/developer/neutron/policies/bugs.html#bug-screening-best-practices

Changed in neutron:
status: New → Incomplete
importance: High → Undecided
Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

steps to reproduce:

1) run devstack with fake virt driver

2) create isolated tenant/user. update quotas fro the project.

3) run instances:
for i in range(0, 25):
  # create network
  # from https://github.com/openstack/ec2-api/blob/master/ec2api/api/vpc.py#L44
  create_router
  create_security_group
  # from https://github.com/openstack/ec2-api/blob/master/ec2api/api/subnet.py#L42
  create_network
  create_subnet
  add_interface_router
  add_gateway_router
  update host_routes and gateway_ip for subnet

  # run instances
  run_instances(count=4, network)

for instance in instances:
  assign floating ip to instance

---------------
4) list ports

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: Incomplete → In Progress
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Hi Andrey,

Sorry about the delay getting to this. Can you see if applying this patch helps? https://review.openstack.org/#/c/256609/

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Hi Kevin,

Our case still shows same behaviour.

I've made my review depending on yours and runs rally tests.
review: https://review.openstack.org/#/c/255829/
times without your fix http://logs.openstack.org/29/255829/2/experimental/ec2-api-rally-dsvm-fakevirt/b4d3eaa/rally-plot/detailed.txt.gz
times with your fix: http://logs.openstack.org/29/255829/7/experimental/ec2-api-rally-dsvm-fakevirt/e7422c8/rally-plot/detailed.txt.gz

I see on describe_subnets and describe_network_interfaces in last table.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/257222

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Ok. Well there are a couple of other patches that deal with performance in master. I will mark them all as partial bugs of this and then when they all merge we can re-evaluate.

I don't know what describe_subnets_ec2api and describe_network_interfaces_ec2api actually do, but it sounds like something that calls Nova. I would much prefer specific rally Neutron tests than Nova tests to rule out regressions in Nova, but we can worry about that if it's still degraded after everything is merged.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

describe_subnets_ec2api and describe_network_interfaces_ec2api calls ec2-api.
ec2-api emulates Amazon EC2 API and it is a proxy for other services like nova/glance/cinder/neutron.
but these functions call neutron for data.
I've posted in comment #8 which calls they do.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/256609
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f5b950dffecc049df11981dcfb5574121be3eb69
Submitter: Jenkins
Branch: master

commit f5b950dffecc049df11981dcfb5574121be3eb69
Author: Kevin Benton <email address hidden>
Date: Fri Dec 11 09:56:01 2015 -0800

    Separate rbac calculation from _make_network_dict

    When a subnet dict was being created, it was calling
    _make_network_dict to get the 'shared' flag for the
    subnet. The issue with this is that the _make_network_dict
    function would iterate over the subnets on the passed in
    network object, which would trigger a database lookup
    of all of the subnets.

    This patch just separates the 'shared' flag calculation out
    into a separate function that both calls can leverage.

    Change-Id: I2cb766ce1fd8ddcc75209f9e92221a3b77015ea2
    Closes-Bug: #1525295
    Partial-Bug: #1513782

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/256846
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=806e67538fbea49530aa0eedb75694cd8bad439d
Submitter: Jenkins
Branch: master

commit 806e67538fbea49530aa0eedb75694cd8bad439d
Author: Kevin Benton <email address hidden>
Date: Fri Dec 11 10:55:38 2015 -0800

    Batch db segment retrieval

    A net-list operation was calling extend_network_dict_provider for
    each network individually which would result in a database call for
    each network.

    This adds a new call in the manager to extend multiple networks at
    once and then it adds a bulk version of get_network_segments that
    it calls.

    Now 1 net list of any number of networks will only result in 1
    segment DB call.

    Change-Id: I2543b3bdbb178ee4bb8d1288e9a27af1c5c8c8b4
    Closes-Bug: #1525423
    Partial-Bug: #1513782

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/258343

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/258346

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/258346
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=31f184733946b7915f1a1927076deb19e4cbe6f2
Submitter: Jenkins
Branch: stable/liberty

commit 31f184733946b7915f1a1927076deb19e4cbe6f2
Author: Kevin Benton <email address hidden>
Date: Fri Dec 11 09:56:01 2015 -0800

    Separate rbac calculation from _make_network_dict

    When a subnet dict was being created, it was calling
    _make_network_dict to get the 'shared' flag for the
    subnet. The issue with this is that the _make_network_dict
    function would iterate over the subnets on the passed in
    network object, which would trigger a database lookup
    of all of the subnets.

    This patch just separates the 'shared' flag calculation out
    into a separate function that both calls can leverage.

    Change-Id: I2cb766ce1fd8ddcc75209f9e92221a3b77015ea2
    Closes-Bug: #1525295
    Partial-Bug: #1513782
    (cherry picked from commit f5b950dffecc049df11981dcfb5574121be3eb69)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/258343
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7c1fb08045f86e36b4971acdc61f9712d56f0345
Submitter: Jenkins
Branch: stable/liberty

commit 7c1fb08045f86e36b4971acdc61f9712d56f0345
Author: Kevin Benton <email address hidden>
Date: Fri Dec 11 10:55:38 2015 -0800

    Batch db segment retrieval

    A net-list operation was calling extend_network_dict_provider for
    each network individually which would result in a database call for
    each network.

    This adds a new call in the manager to extend multiple networks at
    once and then it adds a bulk version of get_network_segments that
    it calls.

    Now 1 net list of any number of networks will only result in 1
    segment DB call.

    Change-Id: I2543b3bdbb178ee4bb8d1288e9a27af1c5c8c8b4
    Closes-Bug: #1525423
    Partial-Bug: #1513782
    (cherry picked from commit 806e67538fbea49530aa0eedb75694cd8bad439d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/257222
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4885ef4885de072b0321858a2ae334b708da4bea
Submitter: Jenkins
Branch: master

commit 4885ef4885de072b0321858a2ae334b708da4bea
Author: Kevin Benton <email address hidden>
Date: Mon Dec 14 00:44:16 2015 -0800

    Add tests that constrain db query count

    This patch adds unit tests to ML2 and L3 that ensure that the
    number of DB calls during list operations for ports, networks,
    subnets, routers, and floating IPs remains constant regardless
    of the number of ports.

    These will prevent changes from slipping in that result in
    a separate DB query for each object in a list operation
    (for changes to the extensions used by ML2 and the DVR plugin).

    Related-Bug: #1525295
    Related-Bug: #1513782
    Related-Bug: #1525423
    Related-Bug: #1525740
    Related-Bug: #1526644

    Change-Id: I1958fc7c318bbf73238a3ad5be133fa7800c8290

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

Hi Kevin,

Current run showed that response times still the same - http://logs.openstack.org/41/261541/8/experimental/ec2-api-rally-dsvm-fakevirt/deb3dda/rally-plot/detailed.txt.gz

But this run has direct calls to neutron for resources and it shows that port's listing is the most time-consuming operation.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/267467

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/269892

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/271830

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/267467
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0d5d7c72feaa0a25b799632cefcd873933d2a28b
Submitter: Jenkins
Branch: stable/liberty

commit 0d5d7c72feaa0a25b799632cefcd873933d2a28b
Author: Kevin Benton <email address hidden>
Date: Mon Dec 14 00:44:16 2015 -0800

    Add tests that constrain db query count

    This patch adds unit tests to ML2 and L3 that ensure that the
    number of DB calls during list operations for ports, networks,
    subnets, routers, and floating IPs remains constant regardless
    of the number of ports.

    These will prevent changes from slipping in that result in
    a separate DB query for each object in a list operation
    (for changes to the extensions used by ML2 and the DVR plugin).

    Related-Bug: #1525295
    Related-Bug: #1513782
    Related-Bug: #1525423
    Related-Bug: #1525740
    Related-Bug: #1526644

    Conflicts:
     doc/source/devref/effective_neutron.rst
     neutron/tests/unit/extensions/test_l3.py

    Change-Id: I1958fc7c318bbf73238a3ad5be133fa7800c8290
    (cherry picked from commit 4885ef4885de072b0321858a2ae334b708da4bea)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/269892
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4398f14a9ab177c162d3267b1d2b0c7c50bb82a5
Submitter: Jenkins
Branch: master

commit 4398f14a9ab177c162d3267b1d2b0c7c50bb82a5
Author: Ihar Hrachyshka <email address hidden>
Date: Tue Jan 19 23:10:25 2016 +0100

    Postpone heavy policy check for ports to later

    When a port is validated, we check for the user to be the owner of
    corresponding network, among other things. Sadly, this check requires a
    plugin call to fetch the network, which goes straight into the database.
    Now, if there are multiple ports to validate with current policy, and
    the user is not admin, we fetch the network for each port, f.e. making
    list operation on ports to scale badly.

    To avoid that, we should postpone OwnerCheck (tenant_id) based
    validations that rely on foreign keys, tenant_id:%(network:...)s, to as
    late as possible. It will make policy checks avoid hitting database in
    some cases, like when a port is owned by current user.

    Also, added some unit tests to avoid later regressions:

    DbOperationBoundMixin now passes user context into API calls. It allows
    us to trigger policy engine checks when executing listing operations.

    Change-Id: I99e0c4280b06d8ebab0aa8adc497662c995133ad
    Closes-Bug: #1513782

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/277095

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

bug has fixed by last review.

rally job - [1]

ports listing by neitron was about 5sec
now it 0.203sec

[1] - http://logs.openstack.org/49/276849/2/experimental/ec2-api-rally-dsvm-fakevirt/9360c79/rally-plot/detailed.txt.gz

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/277095
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9cb659199786b95581553318cfdda93649c21c28
Submitter: Jenkins
Branch: stable/liberty

commit 9cb659199786b95581553318cfdda93649c21c28
Author: Ihar Hrachyshka <email address hidden>
Date: Tue Jan 19 23:10:25 2016 +0100

    Postpone heavy policy check for ports to later

    When a port is validated, we check for the user to be the owner of
    corresponding network, among other things. Sadly, this check requires a
    plugin call to fetch the network, which goes straight into the database.
    Now, if there are multiple ports to validate with current policy, and
    the user is not admin, we fetch the network for each port, f.e. making
    list operation on ports to scale badly.

    To avoid that, we should postpone OwnerCheck (tenant_id) based
    validations that rely on foreign keys, tenant_id:%(network:...)s, to as
    late as possible. It will make policy checks avoid hitting database in
    some cases, like when a port is owned by current user.

    Note: the patch excludes regression unit tests to avoid breaking
    external plugin repos that rely on base db test classes private methods
    like _create_network or _create_router.

    See Ic323e148f7d6d333194aa339774afa953b241aa7 for details about possible
    third party breakage.

    Change-Id: I99e0c4280b06d8ebab0aa8adc497662c995133ad
    Closes-Bug: #1513782

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b3

This issue was fixed in the openstack/neutron 8.0.0.0b3 development milestone.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.4

This issue was fixed in the openstack/neutron 7.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/271830
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

tags: removed: liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/271830
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3d6d9393c858ee40395295571f9008844127255f
Submitter: Jenkins
Branch: master

commit 3d6d9393c858ee40395295571f9008844127255f
Author: Ihar Hrachyshka <email address hidden>
Date: Sun Jan 24 21:41:14 2016 +0100

    policy: cache extracted parent fields for OwnerCheck

    Parent foreign key extraction requires another database fetch per object
    validates, which does not scale well. To mitigate the issue, cache
    extracted parent key values in a global policy cache dictionary.

    Use oslo.cache to maintain cache. Hardcode expiration timeout for
    policy lookups to 5 secs.

    Change-Id: I6b3d7c96b7487c9bef6d39a28c76fea0721c3098
    Related-Bug: #1513782

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.