Combinatorial blow-up with the Alchemy strategy lazy='joined'

Bug #1649317 reported by Pierre Crégut on 2016-12-12
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
Kevin Benton

Bug Description

A regular tenant can create objects that will require a lot of time to enumerate because of the strategy used by the ORM to build back the objects from the different tables in the database.

The script attached can be used to reproduce the problem. It creates a network with several subnetworks (with several routes and DNS servers), several tags, several RBAC policies but it does not exceed any typical quota. Because the network is retrieved at each stage it is modified, it is hard to run this script until its end in a typical setup.

Using the strategy lazy='joined' means that a single request is performed to retrieve an object and all its parts that may be expressed in several tables. For example when one asks for the network list, a complex query will be issued that also retrieves subnets, subnetpools, dns agents, etc. The exact query is visible at http://paste.openstack.org/show/592120/

Unfortunately using the strategy lazy=joined has another impact when the relation between the parent object and the sub-object has a ?-n arity. Rather than giving back exactly the row needed, the single query builds a kind of cross-product of the answers sharing the join keys. For example if we have a network with 4 tags and 4 subnetworks, we will have at least 16 rows for each combination of tags and subnetworks. Other fields like rbac rules, special routes, dns servers can amplify the problem.

It is not clear if the heavy usage of the database server and neutron server could lead to a real denial of service for other users.

Pierre Crégut (pcregut) wrote :
Pierre Crégut (pcregut) wrote :

The solution is to apply the lazy='subquery' strategy:
- it solves the combinatorial blowup by querying each table with exactly the ids that are useful for the objects (supplied as a list in the where clause).
- the number of requests is limited by the complexity of the data model (the number of tables involved) not by the number of elements to retrieve (that would be the case with lazy='select').
- testing it against the current solution shows at most 20% slowdown in the worst case (and better behaviour in a lot of cases).

The test case is the listing of 1000 networks having a single subnet, one dns server and a subnet pool. The results are given here: http://paste.openstack.org/show/592122/

Funny you just reported that, Kevin has been playing with this a bit:

https://review.openstack.org/#/c/408143/

Changed in neutron:
status: New → Confirmed
assignee: nobody → Kevin Benton (kevinbenton)
importance: Undecided → Low
Pierre Crégut (pcregut) on 2016-12-12
description: updated
description: updated
Thomas Morin (tmmorin-orange) wrote :

Cc'ing you Kevin, as a follow-up on our brief corridor discussion on the topic in Barcelona.

Kevin Benton (kevinbenton) wrote :

Yeah, this can get pretty bad. I will clean up my patch to switch to subqueries so we pay a higher query count price but it's constant.

Changed in neutron:
importance: Low → High

Fix proposed to branch: master
Review: https://review.openstack.org/409901

Changed in neutron:
status: Confirmed → In Progress
Kevin Benton (kevinbenton) wrote :

I think I've identified the problematic relationships as the subnets on networks and network via subnet in conjunction with rbac_entries. The former two and the latter impose a penalty by themselves, but they seem to be responsible together for the explosion in result size.

Thomas Morin (tmmorin-orange) wrote :

Doesn't it make sense to address the general issue by generalizing lazy=subquery (at least on ?-n relations), rather than fix this specific subcase only ?

(Or perhaps you intent to do precisely this, but in separate changes ?)

Pierre Crégut (pcregut) wrote :

Here are some other relations (on port, router and subnet objects) that may be the root causes for the same problem.

subnet <-> subnetpool <-> subnetpoolprefixes
       <-> ipallocationpools
       <-> dnsnameservers
       <-> subnetroutes
       <-> network_rbacs
       <-> subnetservice_types

port <-> ipallocations
     <-> allowedaddresspairs
     <-> subports
     <-> extra_dhcp_opts
     <-> trunks <-> subports
     <-> securitygroupportbindings
     <-> portdnses

router <-gw-> port (gateway_port)
       <-> routerroutes
       <-> routerl3agentbindings

Kevin Benton (kevinbenton) wrote :

Unfortunately lazy='subquery' is not 1:1 transactionally with the 'joined' approach. 'subquery' reads outside of transactions can result in things like the network relationship on subnet being None because the network was deleted between the subnet read and the subquery (as happened in [1]).

We are going to have to slowly move away from 'joined', but it's not just a search and replace operation so we need to prioritize problematic relationships like joins between top-level objects.

1. https://review.openstack.org/409901

Kevin Benton (kevinbenton) wrote :

The patch I proposed does eliminate the problem of the server not responding as introduced by the provided script.

Piotr Misiak (piotr-misiak) wrote :

Kevin,
is your patch safe in terms of transactions issue you wrote about in #10 comment?

Reviewed: https://review.openstack.org/409901
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a802b382d331423bf3449665dcfa7c48d7b8e094
Submitter: Jenkins
Branch: master

commit a802b382d331423bf3449665dcfa7c48d7b8e094
Author: Kevin Benton <email address hidden>
Date: Mon Dec 12 11:36:51 2016 -0800

    Use subqueries for rbac_entries and subnets<->network

    Loading subnets as part of the networks list and networks
    as part of the subnets list appears to have a significant
    impact when the network has tags and the subnets have
    extra routes entries. This is even further compounded by
    the network having rbac entries (likely due to the subnet
    inheriting the RBAC entries of the network with the custom
    join condition in the model).

    This patch converts rbac_entries on both subnet and network
    to use a subquery and converts the network and subnets
    relationships on the subnet and network models (respectively)
    to use subqueries as well.

    On my dev environment after running the script in the report,
    a network list took 5 minutes. Converting just the rbac_entries
    or just the network/subnet relationship to subqueries reduced it
    to 3-5 seconds. Converting both (as this patch does), reduces it
    back down to a couple of hundred milliseconds (normal perf of my
    development env with the current network count).

    Subqueries will just cost us a constant number of queries and
    won't scale up with result count so this should not impact scalability
    in any way.

    None of these fields are queryable from the API, so we don't need
    to worry about breaking queries against the models.

    Partial-Bug: #1649317
    Change-Id: Ic1947e3d78d58a79b21344b10cb7ab0e573e419f

Reviewed: https://review.openstack.org/413979
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ff1d93db355730c84a4bcaff5dd0e10d6670667e
Submitter: Jenkins
Branch: stable/newton

commit ff1d93db355730c84a4bcaff5dd0e10d6670667e
Author: Kevin Benton <email address hidden>
Date: Mon Dec 12 11:36:51 2016 -0800

    Use subqueries for rbac_entries and subnets<->network

    Loading subnets as part of the networks list and networks
    as part of the subnets list appears to have a significant
    impact when the network has tags and the subnets have
    extra routes entries. This is even further compounded by
    the network having rbac entries (likely due to the subnet
    inheriting the RBAC entries of the network with the custom
    join condition in the model).

    This patch converts rbac_entries on both subnet and network
    to use a subquery and converts the network and subnets
    relationships on the subnet and network models (respectively)
    to use subqueries as well.

    On my dev environment after running the script in the report,
    a network list took 5 minutes. Converting just the rbac_entries
    or just the network/subnet relationship to subqueries reduced it
    to 3-5 seconds. Converting both (as this patch does), reduces it
    back down to a couple of hundred milliseconds (normal perf of my
    development env with the current network count).

    Subqueries will just cost us a constant number of queries and
    won't scale up with result count so this should not impact scalability
    in any way.

    None of these fields are queryable from the API, so we don't need
    to worry about breaking queries against the models.

    Partial-Bug: #1649317
    Change-Id: Ic1947e3d78d58a79b21344b10cb7ab0e573e419f
    (cherry picked from commit a802b382d331423bf3449665dcfa7c48d7b8e094)

tags: added: in-stable-newton

Reviewed: https://review.openstack.org/410502
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3ea5f7ce5627599b7e1f0f1c1d583dd5466b7d31
Submitter: Jenkins
Branch: master

commit 3ea5f7ce5627599b7e1f0f1c1d583dd5466b7d31
Author: Kevin Benton <email address hidden>
Date: Tue Dec 13 18:07:42 2016 -0800

    Get rid of ml2 port model hook join

    The binding is already joined to the port via a backref relationship
    so we can just utilize that rather than join to the table an additional
    time.

    Partial-Bug: #1649317
    Change-Id: I267a808b411f44b2128955dc93bd8da34d1fac91

Reviewed: https://review.openstack.org/417878
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=495b7863a0c9c1f4ab319bb114ff0bec442376df
Submitter: Jenkins
Branch: master

commit 495b7863a0c9c1f4ab319bb114ff0bec442376df
Author: Kevin Benton <email address hidden>
Date: Mon Jan 9 05:30:56 2017 -0800

    Get rid of _network_model_hook for external_net

    The network already has a joined relationship to the external
    network table so we can leverage that instead of causing an
    additional join for the filtering criteria.

    Partial-Bug: #1649317
    Change-Id: Idfee69b124f4ab8e2998da8492c5fa627f705bb9

Reviewed: https://review.openstack.org/418136
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f204728e37d4f18e741dfa295d6b3da5529efd6c
Submitter: Jenkins
Branch: master

commit f204728e37d4f18e741dfa295d6b3da5529efd6c
Author: Kevin Benton <email address hidden>
Date: Mon Jan 9 14:14:57 2017 -0800

    Get rid of additional fixed_ip filter join

    Partial-Bug: #1649317
    Change-Id: I692b4b85d539af3465a48eed83e40f2ad5b87e51

tags: added: neutron-proactive-backport-potential
Thomas Morin (tmmorin-orange) wrote :

Not sure why openstack infra bot is not notifying launchpad.

For the record: https://review.openstack.org/#/c/408143/ is being progressed .

----
Switch to 'subquery' for 1-M relationships

This switches to the use of subqueries for 1-m relationships which will result in a higher constant query factor but will eliminate the potential for cross-product explosions.

[...]

Reviewed: https://review.openstack.org/410501
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=72154cc0cfa3c51ca8e0fcaf62875546a59bc0ef
Submitter: Jenkins
Branch: master

commit 72154cc0cfa3c51ca8e0fcaf62875546a59bc0ef
Author: Kevin Benton <email address hidden>
Date: Tue Dec 13 17:50:31 2016 -0800

    Elminate join for network owner filter

    Coverage provided by test_list_ports_for_network_owner
    and test_get_ports_count.

    Partial-Bug: #1649317
    Change-Id: I56a15f4b5b47f46fa75b6b5174478cf62b3d0670

tags: added: neutron-easy-proactive-backport-potential
tags: removed: neutron-easy-proactive-backport-potential neutron-proactive-backport-potential
tags: added: neutron-proactive-backport-potential
tags: added: ocata-rc-potential
Changed in neutron:
milestone: none → ocata-rc1

Reviewed: https://review.openstack.org/408143
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3ffe006743b33f48bca6fce541e0a8f64f844fb7
Submitter: Jenkins
Branch: master

commit 3ffe006743b33f48bca6fce541e0a8f64f844fb7
Author: Kevin Benton <email address hidden>
Date: Mon Jan 9 05:02:42 2017 -0800

    Switch to 'subquery' for 1-M relationships

    This switches to the use of subqueries for 1-m relationships
    which will result in a higher constant query factor but will
    eliminate the potential for cross-product explosions.

    Closes-Bug: #1649317
    Change-Id: I6952c48236153a8e2f2f155375b70573ddc2cf0f

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/430406
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=54a3dd605a058f4ca18a783e8df46b67e0bd3489
Submitter: Jenkins
Branch: stable/ocata

commit 54a3dd605a058f4ca18a783e8df46b67e0bd3489
Author: Kevin Benton <email address hidden>
Date: Mon Jan 9 05:02:42 2017 -0800

    Switch to 'subquery' for 1-M relationships

    This switches to the use of subqueries for 1-m relationships
    which will result in a higher constant query factor but will
    eliminate the potential for cross-product explosions.

    Closes-Bug: #1649317
    Change-Id: I6952c48236153a8e2f2f155375b70573ddc2cf0f
    (cherry picked from commit 3ffe006743b33f48bca6fce541e0a8f64f844fb7)

tags: added: in-stable-ocata

This issue was fixed in the openstack/neutron 10.0.0.0rc2 release candidate.

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/446916

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/446933

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/446937

Reviewed: https://review.openstack.org/446933
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=acebb8f3f51d5a21c6b07860eb92c35ad3e342fb
Submitter: Jenkins
Branch: stable/newton

commit acebb8f3f51d5a21c6b07860eb92c35ad3e342fb
Author: Kevin Benton <email address hidden>
Date: Mon Jan 9 05:30:56 2017 -0800

    Get rid of _network_model_hook for external_net

    The network already has a joined relationship to the external
    network table so we can leverage that instead of causing an
    additional join for the filtering criteria.

    Conflicts:
     neutron/db/external_net_db.py

    Partial-Bug: #1649317
    Change-Id: Idfee69b124f4ab8e2998da8492c5fa627f705bb9
    (cherry picked from commit 495b7863a0c9c1f4ab319bb114ff0bec442376df)

Reviewed: https://review.openstack.org/446937
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5935b617fbda72a79ed5d1148d1b176ef8adade7
Submitter: Jenkins
Branch: stable/newton

commit 5935b617fbda72a79ed5d1148d1b176ef8adade7
Author: Kevin Benton <email address hidden>
Date: Tue Dec 13 18:07:42 2016 -0800

    Get rid of ml2 port model hook join

    The binding is already joined to the port via a backref relationship
    so we can just utilize that rather than join to the table an additional
    time.

    Partial-Bug: #1649317
    Change-Id: I267a808b411f44b2128955dc93bd8da34d1fac91
    (cherry picked from commit 3ea5f7ce5627599b7e1f0f1c1d583dd5466b7d31)

Reviewed: https://review.openstack.org/446916
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1215eb5769f8e8e2355985f115783432622931bc
Submitter: Jenkins
Branch: stable/newton

commit 1215eb5769f8e8e2355985f115783432622931bc
Author: Kevin Benton <email address hidden>
Date: Mon Jan 9 14:14:57 2017 -0800

    Get rid of additional fixed_ip filter join

    Partial-Bug: #1649317
    Change-Id: I692b4b85d539af3465a48eed83e40f2ad5b87e51
    (cherry picked from commit f204728e37d4f18e741dfa295d6b3da5529efd6c)

Reviewed: https://review.openstack.org/446915
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=00785f68691df232049ed530d306a80408207ef6
Submitter: Jenkins
Branch: stable/newton

commit 00785f68691df232049ed530d306a80408207ef6
Author: Kevin Benton <email address hidden>
Date: Mon Jan 9 05:02:42 2017 -0800

    Switch to 'subquery' for 1-M relationships

    This switches to the use of subqueries for 1-m relationships
    which will result in a higher constant query factor but will
    eliminate the potential for cross-product explosions.

    Conflicts:
     neutron/db/models/l3.py
     neutron/db/models/metering.py
     neutron/db/models/segment.py
     neutron/db/models/tag.py
     neutron/db/models_v2.py
     neutron/plugins/ml2/models.py

    Closes-Bug: #1649317
    Change-Id: I6952c48236153a8e2f2f155375b70573ddc2cf0f
    (cherry picked from commit 54a3dd605a058f4ca18a783e8df46b67e0bd3489)

Daniel Alvarez (dalvarezs) wrote :

All patches related to this bug have been backported to stable/newton

tags: removed: neutron-proactive-backport-potential

This issue was fixed in the openstack/neutron 9.3.1 release.

This issue was fixed in the openstack/neutron 11.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers