[scale issue] regression for security group list between Newton and Rocky+

Bug #1865223 reported by James Denton
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

We recently upgraded an environment from Newton -> Rocky, and experienced a dramatic increase in the amount of time it takes to return a full security group list. For ~8,000 security groups, it takes nearly 75 seconds. This was not observed in Newton.

I was able to replicate this in the following 4 environments:

Newton (virtual machine)
Rocky (baremetal)
Stein (virtual machine)
Train (baremetal)

Command: openstack security group list

> Sec Grps vs. Seconds

Qty Newton VM Rocky BM Stein VM Train BM
200 4.1 3.7 5.4 5.2
500 5.3 7 11 9.4
1000 7.2 12.4 19.2 16
2000 9.2 24.2 35.3 30.7
3000 12.1 36.5 52 44
4000 16.1 47.2 73 58.9

At this time, we do not know if this increase in time extends to other 'list' commands at scale. The 'show' commands appear to be fairly performant. This increase in time does have a negative impact on user perception, scripts, other dependent resources, etc. The Stein VM is slower than Train, but could be due to VM vs BM. The Newton environment is virtual, too, so I would expect even better performance on bare metal.

Any assistance or insight into what might have changed between releases to cause this would be helpful.

Tags: api
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think this may be similar issue like https://bugs.launchpad.net/neutron/+bug/1863201
Can You try to apply https://review.opendev.org/#/c/708695/ in Your env and check if that will help?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I think this is the same problem described in https://bugs.launchpad.net/neutron/+bug/1863201.

As commented in c#9 and c#10, the patch [1], if valid, should be backported up to Queens. If possible, can you validate it?

Regards.

[1] https://review.opendev.org/#/c/708695

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

(I didn't say, "as commented previously by Slawek")

Revision history for this message
James Denton (james-denton) wrote :

Hi Slawek,

I applied the patch but it did not seem to have any impact. Just FYI: The bulk (99-100%) of security groups in these test environments are all owned by the same project, and consist of the default egress rules only.

Command: openstack security group rule list

Before Patch: ~23 seconds
After Patch: ~23 seconds
Total Rules: ~12,000

In that regard, the patch did not have any affect. It also did not help with the 'openstack security group list' command, either.

In these test environments, we can simulate the slowdown with a large number of security groups in at least Queens, Rocky, Stein, and Train. Each of the security groups would have the default egress rules only.

# for i in {1..5000}; do openstack security group create test-$i; done

<wait a while>

# time openstack security group list

Newton: 18.4
Queens (OSP13): 61.2
Rocky: 55
Stein: 90
Train: 84

Thanks again for the assist.

tags: added: api
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Continuing with http://lists.openstack.org/pipermail/openstack-discuss/2020-March/012931.html.

I've identified a (very good) feature that can help us on this regression. In [1][2], a lazy loader for the SG rules was implemented. That means that if the "rules" parameter is not requested in the "fields" parameter in "SecurityGroupDbMixin.get_security_groups()", the OVO do not load those parameters, reducing the load time. In an environment with 1000 SG (each one with the default 2 SG rules), the load time goes from 11.7 seconds to 3.7 seconds.

I'll propose the needed patches for OSsdk and OSclient to include this "fields" parameter in the API call in the "list" OSC command.

Regards.

[1]https://review.opendev.org/#/c/630401/
[2]https://review.opendev.org/#/c/637407/

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
James Denton (james-denton) wrote :

Hi Rodolfo,

Happy to report that the SDK and Client patches you provided have drastically reduced the list time:

With 6,000 security groups (2-rules per group):

Pre patch real 1m49.207s
post patch real 0m12.327s

Very impressive. Thank you!

Changed in neutron:
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
James Denton (james-denton) wrote :

Continuing from the mailing list...

In Train, when you perform an 'openstack security group delete <name>', the initial lookup by name fails and the client falls back to using the 'name' parameter/filter (/security-groups?name=<name>). This lookup is quick and the security group is found and deleted. However, on Rocky/Stein (e.g. client 3.18.1), instead of searching by parameter, the client appears to perform a full list of security-groups without limiting the fields and takes a long time.

'openstack security group list' with patch:
REQ: curl -g -i -X GET "http://10.0.236.150:9696/v2.0/security-groups?fields=set%28%5B%27description%27%2C+%27project_id%27%2C+%27id%27%2C+%27tags%27%2C+%27name%27%5D%29" -H "Accept: application/json" -H "User-Agent: openstacksdk/0.27.0 keystoneauth1/3.13.1 python-requests/2.21.0 CPython/2.7.17" -H "X-Auth-Token: {SHA256}3e747da939e8c4befe72d5ca7105971508bd56cdf36208ba6b960d1aee6d19b6"

'openstack security group delete <name>':

Train (notice the name param):
REQ: curl -g -i -X GET http://10.20.0.11:9696/v2.0/security-groups/train-test-1755 -H "User-Agent: openstacksdk/0.36.0 keystoneauth1/3.17.1 python-requests/2.22.0 CPython/3.6.7" -H "X-Auth-Token: {SHA256}bf291d5f12903876fc69151db37d295da961ba684a575e77fb6f4829b55df1bf"
http://10.20.0.11:9696 "GET /v2.0/security-groups/train-test-1755 HTTP/1.1" 404 125
REQ: curl -g -i -X GET "http://10.20.0.11:9696/v2.0/security-groups?name=train-test-1755" -H "Accept: application/json" -H "User-Agent: openstacksdk/0.36.0 keystoneauth1/3.17.1 python-requests/2.22.0 CPython/3.6.7" -H "X-Auth-Token: {SHA256}bf291d5f12903876fc69151db37d295da961ba684a575e77fb6f4829b55df1bf"
http://10.20.0.11:9696 "GET /v2.0/security-groups?name=train-test-1755 HTTP/1.1" 200 1365

Stein & below (notice lack of fields):
REQ: curl -g -i -X GET http://10.0.236.150:9696/v2.0/security-groups/stein-test-5189 -H "User-Agent: openstacksdk/0.27.0 keystoneauth1/3.13.1 python-requests/2.21.0 CPython/2.7.17" -H "X-Auth-Token: {SHA256}e9f87afe851ff5380d8402ee81199c466be9c84fe67ed0302e8b178f33aa1fc2"
http://10.0.236.150:9696 "GET /v2.0/security-groups/stein-test-5189 HTTP/1.1" 404 125
REQ: curl -g -i -X GET http://10.0.236.150:9696/v2.0/security-groups -H "Accept: application/json" -H "User-Agent: openstacksdk/0.27.0 keystoneauth1/3.13.1 python-requests/2.21.0 CPython/2.7.17" -H "X-Auth-Token: {SHA256}e9f87afe851ff5380d8402ee81199c466be9c84fe67ed0302e8b178f33aa1fc2"

<wait awhile while it compiles and returns the full list, then the single SG object is deleted>

Probably a fix that can piggyback the existing patch, but if not, happy to open another bug.

Thanks!

Revision history for this message
James Denton (james-denton) wrote :

I missed the ML comment:

"Yes, this is a known issue in OSclient: most of the "objects" (networks, subnets, routers, etc) to
be retrieved, can usually can be retrieved by ID and by name. OSclient tries first to use the ID
because is unique and a DB key. Then, instead of asking the server for a unique register (filtered
by the name), the client retrieves the whole list and filters the results.

But this problem was resolved in Train: https://review.opendev.org/#/c/637238/. Can you check, in
openstacksdk, that you have this patch? At least in T."

FWIW, I successfully patched client 3.18.1 (Stein) which sped up the DELETE operation. Thanks again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-openstackclient 5.1.0

This issue was fixed in the openstack/python-openstackclient 5.1.0 release.

Revision history for this message
Brian Haley (brian-haley) wrote :

Looks like this was fixed, will close.

Changed in neutron:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.