Bug #2037107 “slow port creation with a large amount of networkr...” : Bugs : neutron

Revision history for this message

Max (maxlamprecht) wrote on 2023-09-22:

#1

sqlquery_and_neutron-profiling.txt Edit (810.3 KiB, text/plain)

Revision history for this message

Max (maxlamprecht) wrote on 2023-09-25:

#2

Hi,

After ralonsoh´s hint to the following patch-series [1], I started to debug these queries and found a few things:

For me it seems like the change [2] from lazy='subquery' to 'joined' makes queries much slower because sqlalchemy always joins the model to networkrbacs with all fields in the SELECT statement.
This creates a huge amount of results which results in huge response times(see first post)
=> query: https://paste.opendev.org/show/821796/

The group by patch[4] that was introduced for project scoped queries will not help for admin/elevated queries like for example when creating a fip.
It seems like that patch is also not helping for project scoped requests. At least in my setup when executing -> query.group_by(model.id)) (like introduced in [4]), I can see the following query which makes no sense in my eyes.
=> query: https://paste.opendev.org/show/821798/
Sqlalchemy is doing a subquery for the group_by which will not group the amount of networkrbacs in the outer query. Not sure why sqlalchemy is doing that :D

[1] https://review.opendev.org/q/topic:bug%252F1918145
[2] https://review.opendev.org/c/openstack/neutron/+/884877/4/neutron/db/models_v2.py
[3] https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html
[4] https://review.opendev.org/c/openstack/neutron-lib/+/884878/1/neutron_lib/db/model_query.py

Steps to reproduce the issue:
1. create a fresh devstack(with master branches)
2. create a few subnets and rbacs
=> for i in $(seq 0 15); do openstack subnet create --network public --subnet-range 10.0.$i.0/24 test$i; done
=> for i in $(seq 0 2500); do openstack network rbac create --target-project $(openstack project create test$i -c id -f value) --action access_as_shared --type network public; done
3. create a fip (--network public)
+-----------------------------------------------------------------------------------------+-----------+
| URL | Seconds |
+-----------------------------------------------------------------------------------------+-----------+
| GET http://10.1.0.70/identity | 0.019584 |
| POST http://10.1.0.70/identity/v3/auth/tokens | 0.188901 |
| GET http://10.1.0.70:9696/networking/v2.0/networks/64a06492-c675-418b-8b6d-9f5a875795e9 | 12.323395 |
| POST http://10.1.0.70:9696/networking/v2.0/floatingips | 48.035449 |
| Total | 60.567329 |
+-----------------------------------------------------------------------------------------+-----------+

Hi,

After ralonsoh´s hint to the following patch-series [1], I started to debug these queries and found a few things:

For me it seems like the change [2] from lazy='subquery' to 'joined' makes queries much slower because sqlalchemy always joins the model to networkrbacs with all fields in the SELECT statement.
This creates a huge amount of results which results in huge response times(see first post)
=> query: https://paste.opendev.org/show/821796/

The group by patch[4] that was introduced for project scoped queries will not help for admin/elevated queries like for example when creating a fip.
It seems like that patch is also not helping for project scoped requests. At least in my setup when executing -> query.group_by(model.id)) (like introduced in [4]), I can see the following query which makes no sense in my eyes.
=> query: https://paste.opendev.org/show/821798/
Sqlalchemy is doing a subquery for the group_by which will not group the amount of networkrbacs in the outer query. Not sure why sqlalchemy is doing that :D
  
[1] https://review.opendev.org/q/topic:bug%252F1918145
[2] https://review.opendev.org/c/openstack/neutron/+/884877/4/neutron/db/models_v2.py
[3] https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html
[4] https://review.opendev.org/c/openstack/neutron-lib/+/884878/1/neutron_lib/db/model_query.py

Steps to reproduce the issue:
1. create a fresh devstack(with master branches)
2. create a few subnets and rbacs
=> for i in $(seq 0 15); do openstack subnet create --network public --subnet-range 10.0.$i.0/24 test$i; done
=> for i in $(seq 0 2500); do openstack  network rbac create --target-project $(openstack project create test$i -c id -f value) --action access_as_shared --type network public; done
3. create a fip (--network public)
+-----------------------------------------------------------------------------------------+-----------+
| URL                                                                                     |   Seconds |
+-----------------------------------------------------------------------------------------+-----------+
| GET http://10.1.0.70/identity                                                           |  0.019584 |
| POST http://10.1.0.70/identity/v3/auth/tokens                                           |  0.188901 |
| GET http://10.1.0.70:9696/networking/v2.0/networks/64a06492-c675-418b-8b6d-9f5a875795e9 | 12.323395 |
| POST http://10.1.0.70:9696/networking/v2.0/floatingips                                  | 48.035449 |
| Total                                                                                   | 60.567329 |
+-----------------------------------------------------------------------------------------+-----------+

Bernard Cafarelli (bcafarel) on 2023-09-25

tags:	added: db loadimpact
Changed in neutron:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Max (maxlamprecht) wrote on 2023-09-25:

#3

I also discovered that we use subquery for the subnets in the network class -> [1]
This will load all subnets and filters afterwards for the right network. I would propose to change the lazy option to lazy="selectin". [2]
With that option we will move the filtering to mysql and will benefit from less sqlalchemy orm parsing.
before:
GET http://10.1.0.70:9696/networking/v2.0/networks/64a06492-c675-418b-8b6d-9f5a875795e9 = 2.932295 sec
after:
GET http://10.1.0.70:9696/networking/v2.0/networks/64a06492-c675-418b-8b6d-9f5a875795e9 = 0.886842 sec

I also would like to change the rbac filtering.
Things like [3] are not really performant and should be handeled by mysql.

[1]https://opendev.org/openstack/neutron/src/commit/8e38e57b8000ca6ce9ab84692a9aba6220556a3d/neutron/db/models_v2.py#L320
[2]https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html
[3]https://opendev.org/openstack/neutron/src/commit/8e38e57b8000ca6ce9ab84692a9aba6220556a3d/neutron/db/db_base_plugin_common.py#L348-L356

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2023-09-25:

#4

Hello:

I'm using master branch to reproduce this issue, I don't have a Yoga environment. I can't reproduce the issue using the steps provided. I've created up to 2000 RBAC registers and the time spent to create a FIP in the public network is around 5 seconds:
INFO neutron.wsgi [None req-685f2007-eecf-462c-972a-fd1ff52e604a None admin] 192.168.10.100 "POST /networking/v2.0/floatingips HTTP/1.1" status: 201 len: 753 time: 5.1634228

Considering that I'm running standalone devstack deployment in a virtual machine, this time is a content time.

Replying to some questions referred in previous comments:
1) The change of the RBAC registers loading strategy.
As commented in the commit message of [1], this is increasing the complexity of the resource queries because is joining the RBAC registers query to the resource query. Here we need to differentiate two situations:
a) When the query is performed by an admin, the query performed will retrieve the RBAC registers just matching the resource ID (check [3])
b) When the query is performed by a non-admin user, the query will be improved by [4]: the RBAC complex filter query will have a very low cardinality when the number of RBAC registers is high. However, the n-lib group-by clause will reduce the returned query to just the number of resources (number of subnets in your case).

2) The group-by clause won't help in the admin case.
That is replied in the previous question. It is not needed because in the admin case we don't filter the RBAC registers by action or target project. The admin has all permissions and we don't need to filter them (check [3]).

3) Change the network subnet load method to lazy="selectin".
Please check [5] to understand the rationale of this change. When the network is retrieved, the subnets are not by default. That is OK in most of the cases where we don't need the subnet values. If you change the load method, we'll be retrieving the subnet information any time we retrieve a network, reducing the API performance for some network get calls.

4) Change the RBAC filtering.
"Things like [3] are not really performant and should be handeled by mysql." Sorry, I don't understand this statement. We have already improved the RBAC retrieval; at this point the network resource has the RBAC registers stored in the OVO. This check, regardless of the number of RBAC registers, will be very fast.

To confirm that this is not a problem in your deployment, please enable the MySQL slow query register and report any query above 1-2 seconds. I've enabled that in my env and I don't see any. Please check the MySQL version you are running and the SQLAlchemy version installed.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/884877
[2]https://github.com/openstack/neutron-lib/blob/bdebe1de3c9e4032a7a40785846065b676b45b64/neutron_lib/db/model_query.py#L123
[3]https://paste.opendev.org/show/bSPFHtntEymdGOzBAQzU/
[4]https://review.opendev.org/q/topic:bug%252F1918145
[5]https://review.opendev.org/c/openstack/neutron/+/409901
[6]https://opendev.org/openstack/neutron/src/commit/8e38e57b8000ca6ce9ab84692a9aba6220556a3d/neutron/db/db_base_plugin_common.py#L348-L356

Hello:

I'm using master branch to reproduce this issue, I don't have a Yoga environment. I can't reproduce the issue using the steps provided. I've created up to 2000 RBAC registers and the time spent to create a FIP in the public network is around 5 seconds:
INFO neutron.wsgi [None req-685f2007-eecf-462c-972a-fd1ff52e604a None admin] 192.168.10.100 "POST /networking/v2.0/floatingips HTTP/1.1" status: 201  len: 753 time: 5.1634228

Considering that I'm running standalone devstack deployment in a virtual machine, this time is a content time.

Replying to some questions referred in previous comments:
1) The change of the RBAC registers loading strategy.
As commented in the commit message of [1], this is increasing the complexity of the resource queries because is joining the RBAC registers query to the resource query. Here we need to differentiate two situations:
a) When the query is performed by an admin, the query performed will retrieve the RBAC registers just matching the resource ID (check [3])
b) When the query is performed by a non-admin user, the query will be improved by [4]: the RBAC complex filter query will have a very low cardinality when the number of RBAC registers is high. However, the n-lib  group-by clause will reduce the returned query to just the number of resources (number of subnets in your case).

2) The group-by clause won't help in the admin case.
That is replied in the previous question. It is not needed because in the admin case we don't filter the RBAC registers by action or target project. The admin has all permissions and we don't need to filter them (check [3]).

3) Change the network subnet load method to lazy="selectin".
Please check [5] to understand the rationale of this change. When the network is retrieved, the subnets are not by default. That is OK in most of the cases where we don't need the subnet values. If you change the load method, we'll be retrieving the subnet information any time we retrieve a network, reducing the API performance for some network get calls.

4) Change the RBAC filtering.
"Things like [3] are not really performant and should be handeled by mysql." Sorry, I don't understand this statement. We have already improved the RBAC retrieval; at this point the network resource has the RBAC registers stored in the OVO. This check, regardless of the number of RBAC registers, will be very fast.

To confirm that this is not a problem in your deployment, please enable the MySQL slow query register and report any query above 1-2 seconds. I've enabled that in my env and I don't see any. Please check the MySQL version you are running and the SQLAlchemy version installed.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/884877
[2]https://github.com/openstack/neutron-lib/blob/bdebe1de3c9e4032a7a40785846065b676b45b64/neutron_lib/db/model_query.py#L123
[3]https://paste.opendev.org/show/bSPFHtntEymdGOzBAQzU/
[4]https://review.opendev.org/q/topic:bug%252F1918145
[5]https://review.opendev.org/c/openstack/neutron/+/409901
[6]https://opendev.org/openstack/neutron/src/commit/8e38e57b8000ca6ce9ab84692a9aba6220556a3d/neutron/db/db_base_plugin_common.py#L348-L356

Changed in neutron:
status:	Confirmed → Incomplete

Revision history for this message

Max (maxlamprecht) wrote on 2023-09-26:

#5

Hi,

for the network model the lazy=joined with rbac_entries makes sense because its a 1:N to mapping.
But for subnets it makes no sense to me (N:M mapping). For example:
with lazy="joined"
=> the results will be multiplied(subnets X networkrbacs)
=> 13 subnets X 1270 networkrbacs => ~16000 result rows (in our prod env we currently have 2600rbacs and 20 subnets)
With lazy="subquery"
=> we fetch the 13 subnets and with a seperate query the 1270 rbacs
=> ~1283 rows

And it´s not only that the fip/port creation is slow in this network, just the network show/subnet list takes ~5 seconds (prod env).

The problem is definitely not our database.
We see no slow queries in the db (executed by hand 0.01sec to get the response)
We traced the problem down to the OVO processing(sqlalchemy read rows...)
=> Pls have a look at comment #1 in this thread at the neutron profiling from our prod env (2600rbacs and 20 subnets)
ncalls cumtime
1 14.418 /usr/local/lib/python3.10/site-packages/neutron/db/ipam_backend_mixin.py:677(_ipam_get_subnets)
1 14.418 /usr/local/lib/python3.10/site-packages/neutron/objects/subnet.py:320(find_candidate_subnets)
46324 6.222 /usr/local/lib/python3.10/site-packages/pymysql/connections.py:1266(_read_rowdata_packet)
1 7.627 /usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py:2889(_iter)
1 7.627 /usr/local/lib/python3.10/site-packages/sqlalchemy/engine/result.py:1353(all)
1 4.393 /usr/local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py:1108(_populate_full)

Revision history for this message

Max (maxlamprecht) wrote on 2023-09-27:

#6

Additionally the rbac filtering still makes no sense for me.
I checked your query snipped [3] and it proves the problem (https://paste.opendev.org/show/bSPFHtntEymdGOzBAQzU/).

The group by is not working there in the non-admin-query
-> sqlalchemy builds a subquery for the group by (not sure why)
-> left outer join to networkrbacs happens afterwards again due to the orm model
Because of that left outer join we produce again the combinatorial blow-up mentioned in https://bugs.launchpad.net/neutron/+bug/1649317 for (subnet X networkrbacs)

I´m not sure how to get rid of that combinatorial blow-up without changing the the subnet.rbac_entries orm relationship lazy option to anything else than 'joined'.

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2023-09-28:

#7

Hello:

The issue with the subnet network RBACs is, precisely, that the RBAC register don't belong to the resource itself, these are parent resource RBACs. The GROUP BY clauses introduced and the lazy="joined" strategy work for any parent resource having these RBACs associated registers.

In the case of the subnet we have the following issues:
* The RBACs don't have direct match with the resource but through the subnet network ID.
* In the IPAM module we are manually creating the SQL query, adding Subnet, SubnetServiceType and other resources (as RBACs) to this query. The GROUP BY clause is not applied here.
* If we move the loading method to lazy="subquery", the GROUP BY clause won't apply when using the OVO to retrieve the resources. For example the "get_subnets" query will return a non-optimized SQL query.

The solution here could be to modify the Subnet model and instead of adding "rbac_entries" to the model view, adding the network view instead, using lazy='subquery' loading strategy. The model query does retrieve the RBAC entries in an optimized way and then inherit the Network.shared.

Regards.

Rodolfo Alonso (rodolfo-alonso-hernandez) on 2023-09-28

Changed in neutron:
status:	Incomplete → Confirmed
assignee:	nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)

Revision history for this message

Max (maxlamprecht) wrote on 2023-10-04:

#8

Hi Rodolfo,

thx for the fast reply :)
I like the proposed solution to just get the subnet.shared field from the network view.

But we should also check the network_rbacs loading for networks.
I checked the code now a few times and it seems that the group_by is also not working in the network view.

I attached a PDB snipped with the sql queries [1].
Here we can see that the group_by does not work
-> It moves the grouping/filtering to a subquery (also the previous WHERE condition)
-> after the subquery there is still a left outer join to networkrbacs
=> 4000 results instead amount of networks

I guess that sqlalchemy does that because we specify lazy='joined' at the orm model and also have the outerjoin in line [2]
-> so sqlalchemy excepts a "full" join to the left table(networkrbacs)

This means that if we use group_by like in the current implementation it will not help.

[1] https://paste.opendev.org/show/bCCkZ01zquynsUm6lUSw/
[2] https://review.opendev.org/c/openstack/neutron-lib/+/884878/1/neutron_lib/db/model_query.py#125

Best regards
Max

Revision history for this message

Max (maxlamprecht) wrote on 2023-10-04:

#9

Addition to #8:

The sqlalchemy docs says:
"The joins produced by joinedload() are anonymously aliased. The criteria by which the join proceeds cannot be modified, nor can the ORM-enabled Select or legacy Query refer to these joins in any way, including ordering" [1]

I guess ordering or grouping is the same here.
Since we use in the orm model lazy=joined which is the joinedload() option the group_by will not work for the joined tables.

Same issue on stackoverflow -> [2]

[1] https://docs.sqlalchemy.org/en/20/orm/queryguide/relationships.html
[2] https://stackoverflow.com/questions/28763372/sqlalchemy-group-by-after-joinloaded

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2023-10-04:

#10

Hello Max:

Yes, any RBAC resource is still returning a query result with a very low cardinality. But the main problem (not the only one of course) solved in [1] was the cardinality of the inner query. From your example: [2]. This query was returning millions of results with a big amount of projects and RBACs, all defined from the same project. Please check [3] commit message:

"""
This SQL result has a very low cardinality; that means there are many
duplicated registers. For example, with 10 external network, 1000
projects and 2500 RBAC rules, this query returns 1.4 million rows.
Instead if a "GROUP BY resource_id" (in this case network_id) clause is
added, the number of rows is reduced to 10 (considering this project
has a RBAC per network).
"""

This topic should be handled in other LP bug. Let's focus the current one in the subnet network RBAC optimization.

Regards.

[1]https://review.opendev.org/q/topic:bug%252F1918145
[2]https://paste.opendev.org/show/bPMAw60F7fZfnVYUPnGP/
[3]https://review.opendev.org/c/openstack/neutron/+/884877

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-06: Fix proposed to neutron (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/897578

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-12-18: Change abandoned on neutron (master)

#12

Change abandoned by "Rodolfo Alonso <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/897578

neutron

slow port creation with a large amount of networkrbacs

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches