Security groups RBAC cause a major performance degradation

Bug #1830679 reported by Adit Sarfaty on 2019-05-28
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
High
Miguel Lavalle

Bug Description

On stable/Stein & Train, a setup with about 6000 security groups of different tenants.
Using admin user, getting all security groups with GET /v2.0/security-groups HTTP/1.1 takes about 70 seconds.
Using the credentials of one of the tenants, who has only 1 security groups takes about 800 seconds.

Looking at the mysql DB logs reveals lots of RBAC related queries during thoee 800 seconds.
Tried to revert the RBAC PATCH https://review.opendev.org/#/c/635311/ that is a partial fix of https://bugs.launchpad.net/neutron/+bug/1817119 , and it solved the issue completely.
Now it takes less than a seconds to get security groups of a tenant.

tags: added: api loadimpact
Akihiro Motoki (amotoki) on 2019-05-29
Changed in neutron:
importance: Undecided → High
Akihiro Motoki (amotoki) wrote :

RBAC queries for security groups are joined as "subquery" and I guess this leads to a lot of queries when listing security groups for a tenant. We need to check how we can improve the RBAC performance.

Boden R (boden) wrote :

Slawek / Akihiro - what is the plan for this one?
Are we open to reverting the RBAC patch while we figure out the proper approach for this code?
As noted in the description of this bug, the impacts are significant.
Please feel free to ping me if there's something I can do to help this one along.

Slawek Kaplonski (slaweq) wrote :

Patch to revert is quite big, it provides implementation for new RFE. Also in addition to this patch there is API change in neutron-lib https://review.opendev.org/#/c/635313/ and neutron-tempest-plugin change: https://review.opendev.org/#/c/635312/ which we should probably also revert (at least this neutron-tempest-plugin change).

Do You have cycles to check how to improve this to maybe not revert this patch? If not, I guess we will have to revert it but IMHO this should be done only if we will not have any other solution.

Boden R (boden) wrote :

Not sure I have the bandwidth to redesign the implementation considering I have little knowledge of it at this point, but if there are smaller things I can help with I can spare a few (not many) cycles. It appears the original author of the work is still active; has anyone talked to Doug about this?

I regards to a revert;
- I don't see any of the neutron-lib changes being impactful if we did revert. Worst thing is that the API ref will still have the info for this extension when it's in fact reverted.
- Yeah I guess the tempest patch may also need reverting, unless we can just disable those tests for now or something.

Miguel Lavalle (minsel) on 2019-06-10
Changed in neutron:
assignee: nobody → Miguel Lavalle (minsel)
Boden R (boden) wrote :

Miguel - are you thinking we can revert this code until we can come up with a more performant solution??
If so I don't mind working on the revert patches; just let me know.

Changed in neutron:
assignee: Miguel Lavalle (minsel) → Boden R (boden)
status: New → In Progress
Boden R (boden) wrote :

I proposed a revert in neutron since it doesn't seem like there's going to be a rework/solution anytime soon.
IMO reverting is a last resort, but in this case we are talking 800x perf degradation for listing tenant SGs (as per bug description) so I think it's warranted.

Change abandoned by Miguel Lavalle (<email address hidden>) on branch: master
Review: https://review.opendev.org/665515
Reason: Test objective was achieved

Miguel Lavalle (minsel) wrote :
Download full text (5.4 KiB)

I am able to replicate this bug with a development system setup with master branch. Executing the following python script, I created 6000+ security groups in my development system (showing only relevant part):

secgroups_ids = []
projects_ids = []
secgroups_in_project = 0
created_secgroups_count = 0
created_projects = 0
try:
    while created_secgroups_count < 6000:
        if secgroups_in_project <= 0:
            project = keystone.projects.create(
                name='test-project-%s' % created_projects,
                domain=login_data['user_domain_id'])
            secgroups_in_project = random.randint(1, 7)
            projects_ids.append(project.id)
            new_project = True
            created_projects += 1
        body = {'security_group': {
                    'name': 'test-secgroup-%s' % created_secgroups_count,
                    'description': ('Test security group %s' %
                                    created_secgroups_count),
                    'project_id': project.id}}
        secgroup = neutron.create_security_group(body=body)
        secgroups_in_project -= 1
        secgroups_ids.append(secgroup['security_group']['id'])
        created_secgroups_count += 1
        if new_project:
            new_project = False
            default_secgroup = neutron.list_security_groups(
                name='default', project_id=project.id)['security_groups'][0]
            secgroups_ids.append(default_secgroup['id'])
            created_secgroups_count += 1
        print(created_secgroups_count)
    import pdb
    pdb.set_trace()
finally:
    for secgroup_id in secgroups_ids:
        neutron.delete_security_group(secgroup_id)
    for project_id in projects_ids:
        keystone.projects.delete(project_id)

After creating the security groups (and leaving above script stopped at the pdb.set_trace() line), I execute the following bash script:

source ~/demo-openrc
start_time="$(date -u +%s)"
openstack security group list
end_time="$(date -u +%s)"
elapsed="$(($end_time-$start_time))"
echo "Total of $elapsed seconds elapsed for process"

One of the executions produces the following output (all runs showed about the same elapsed time):

+--------------------------------------+---------+------------------------+----------------------------------+------+
| ID | Name | Description | Project | Tags |
+--------------------------------------+---------+------------------------+----------------------------------+------+
| 58d172f3-b1ce-4592-a954-7151f9e7708e | default | Default security group | 29b2f9ab6dc04e0ab8ab7e48c8baaa8d | [] |
+--------------------------------------+---------+------------------------+----------------------------------+------+
Total of 525 seconds elapsed for process

However, executing the following python script:

import time

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

from neutron.db.models import securitygroup

engine = create_engine(
    'mysql+pymysql://root:devstack@127.0.0.1/neutron?charset=utf8',
    echo=False)
session = sessionmaker()
session.configure(bind=engine)
s = session()
start = time.t...

Read more...

Fix proposed to branch: master
Review: https://review.opendev.org/665566

Changed in neutron:
assignee: Boden R (boden) → Miguel Lavalle (minsel)
Miguel Lavalle (minsel) wrote :

After deploying this change https://review.opendev.org/#/c/665566 in my development system, I executed the bash script mentioned in note #9 and this is the result:

+--------------------------------------+---------+------------------------+----------------------------------+------+
| ID | Name | Description | Project | Tags |
+--------------------------------------+---------+------------------------+----------------------------------+------+
| 58d172f3-b1ce-4592-a954-7151f9e7708e | default | Default security group | 29b2f9ab6dc04e0ab8ab7e48c8baaa8d | [] |
+--------------------------------------+---------+------------------------+----------------------------------+------+
Total of 50 seconds elapsed for process

We haven't returned to the performance before https://review.opendev.org/#/c/635311, but we are moving in the right direction. More optimization work forthcoming

Miguel Lavalle (minsel) wrote :

Added 3 LOG calls to https://review.opendev.org/#/c/665566, 2 right before and after https://github.com/openstack/neutron/blob/90c975292bdc1b9db97cf6ae40a6e6da9dbfc323/neutron/objects/rbac_db.py#L110-L112 and one right before returning in L118. If I execute with a non admin user with only one security group:

openstack security group list

the elapsed time is 49 seconds. Looking at the neutron server log file, 47 seconds were spent in the query in L110-L112, while the loop in L114-L117 took 2 seconds:

Jun 17 14:45:56 allinone neutron-server[17621]: DEBUG neutron.objects.rbac_db [None req-5494f56d-07fd-4753-aeaa-8f96acbefcfb demo demo] mlavalle {{(pid=17631) get_objects /opt/stack/neutron/neutron/objects/rbac_db.py:116}}
Jun 17 14:46:43 allinone neutron-server[17621]: DEBUG neutron.objects.rbac_db [None req-5494f56d-07fd-4753-aeaa-8f96acbefcfb demo demo] mlavalle {{(pid=17631) get_objects /opt/stack/neutron/neutron/objects/rbac_db.py:121}}
Jun 17 14:46:43 allinone neutron-server[17621]: DEBUG neutron.objects.rbac_db [None req-5494f56d-07fd-4753-aeaa-8f96acbefcfb demo demo] mlavalle {{(pid=17631) get_objects /opt/stack/neutron/neutron/objects/rbac_db.py:126}}

And we know from note #9 above that querying all the security groups directly with SQLAlchemy takes less than a second. This point to OVO code being the next bottleneck to tackle.

Change abandoned by boden (<email address hidden>) on branch: master
Review: https://review.opendev.org/665447
Reason: There's work towards a fix in https://review.opendev.org/#/c/665566

Miguel Lavalle (minsel) wrote :

Going one level deeper and adding LOG debug statements to https://github.com/openstack/neutron/blob/b7b399f16e351a708e9e387dfcb32b2845a6d37b/neutron/objects/base.py#L617-L643, we can see that the DB query itself (lines 639-640) takes less than a second, while looping over the data retrieved (lines 642-643) to build the OVOs takes ~ 50 seconds.

Miguel Lavalle (minsel) wrote :

Added cProfile calls to profile the query in these lines: https://github.com/openstack/neutron/blob/master/neutron/objects/rbac_db.py#L110-L112. The 10 calls that with the highest cumulative CPU time are:

         53354917 function calls (51922841 primitive calls) in 71.184 seconds

   Ordered by: cumulative time
   List reduced from 2111 to 10 due to restriction <10>

   ncalls tottime percall cumtime percall filename:lineno(function)
20499/6007 0.068 0.000 72.137 0.012 /opt/stack/neutron/neutron/objects/base.py:539(_load_object)
     6007 0.066 0.000 71.713 0.012 /opt/stack/neutron/neutron/objects/securitygroup.py:79(from_db_object)
20499/6009 0.280 0.000 63.881 0.011 /opt/stack/neutron/neutron/objects/base.py:460(from_db_object)
20499/6009 0.425 0.000 60.429 0.010 /opt/stack/neutron/neutron/objects/base.py:800(load_synthetic_db_fields)
      7/1 0.021 0.003 59.218 59.218 /opt/stack/neutron/neutron/objects/base.py:616(get_objects)
12159/11952 0.057 0.000 48.715 0.004 /usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py:3318(__iter__)
     6007 0.024 0.000 38.071 0.006 /usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/dynamic.py:291(__iter__)
157996/103316 0.052 0.000 37.716 0.000 {iter}
18166/17858 0.101 0.000 32.537 0.002 /usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py:3343(_execute_and_instances)
18214/17861 0.064 0.000 31.850 0.002 /usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py:922(execute)

As shown here, /opt/stack/neutron/neutron/objects/base.py:800(load_synthetic_db_fields) has a cumulative time of 60.429. The only synthetic field added to the security group object when enabling RBAC was "shared" https://review.opendev.org/#/c/635311/22/neutron/objects/securitygroup.py@48. So a big portion of the time spent in load_synthetic_db_fields must be due to this field.

The "shared" field is wasn't added as an attribute of security groups in the API when RBACs were enabled: https://github.com/openstack/neutron-lib/blob/master/neutron_lib/api/definitions/rbac_security_groups.py. So perhaps we can remove the "shared" field from the security groups OVO and optimize even further the response time of GET /v2.0/security-groups. Testing that is the next step

Miguel Lavalle (minsel) wrote :

Maybe this is better view of the cProfile stats: http://paste.openstack.org/show/753283/

Reviewed: https://review.opendev.org/665566
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a240c68022d96c8639652cbdf57e707e68fb2a88
Submitter: Zuul
Branch: master

commit a240c68022d96c8639652cbdf57e707e68fb2a88
Author: Miguel Lavalle <email address hidden>
Date: Sun Jun 16 19:59:03 2019 -0500

    Fix list security groups performance with RBAC

    After change [1], if the system has a high number of security groups
    with no associated RBAC entries, a non admin user owning only one
    security group will experience unacceptable response times when
    listing her security groups.

    Change [1] added methods get_object and get_objects to class
    RbacNeutronDbObjectMixin in neutron.objects.rbac_db, which retrieve with
    and admin context all the objects (networks, subnets or security groups)
    in the DB and then decide in memory whether the project that made the
    query has access to them or not, based on their associated RBAC
    policies. This change proposes to remove those methods and revert to
    their counterparts in NeutronDbObject (neutron.objects.base), which use
    a DB query scoped to the project to retrieve the objects based on their
    associated RBAC policies by calling [2]. In this way, the potential
    number of objects that are retrieved from the DB and that have to be
    converted to OVOs is greatly reduced, improving significantly the
    response time to the user.

    [1] https://review.opendev.org/#/c/635311
    [2] https://github.com/openstack/neutron-lib/blob/7a58374fde64fdc14e327940dde6bea4a8a39345/neutron_lib/db/model_query.py#L100

    Change-Id: Idd303778d83089da8fbeff40e3dda2bd19008d8e
    Closes-Bug: #1830679

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/670075
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=73e3f7d281ce606f25d4668b98a7002de34bf3c9
Submitter: Zuul
Branch: stable/stein

commit 73e3f7d281ce606f25d4668b98a7002de34bf3c9
Author: Miguel Lavalle <email address hidden>
Date: Sun Jun 16 19:59:03 2019 -0500

    Fix list security groups performance with RBAC

    After change [1], if the system has a high number of security groups
    with no associated RBAC entries, a non admin user owning only one
    security group will experience unacceptable response times when
    listing her security groups.

    Change [1] added methods get_object and get_objects to class
    RbacNeutronDbObjectMixin in neutron.objects.rbac_db, which retrieve with
    and admin context all the objects (networks, subnets or security groups)
    in the DB and then decide in memory whether the project that made the
    query has access to them or not, based on their associated RBAC
    policies. This change proposes to remove those methods and revert to
    their counterparts in NeutronDbObject (neutron.objects.base), which use
    a DB query scoped to the project to retrieve the objects based on their
    associated RBAC policies by calling [2]. In this way, the potential
    number of objects that are retrieved from the DB and that have to be
    converted to OVOs is greatly reduced, improving significantly the
    response time to the user.

    [1] https://review.opendev.org/#/c/635311
    [2] https://github.com/openstack/neutron-lib/blob/7a58374fde64fdc14e327940dde6bea4a8a39345/neutron_lib/db/model_query.py#L100

    Change-Id: Idd303778d83089da8fbeff40e3dda2bd19008d8e
    Closes-Bug: #1830679
    (cherry picked from commit a240c68022d96c8639652cbdf57e707e68fb2a88)

tags: added: in-stable-stein
tags: added: neutron-proactive-backport-potential

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers