Bug #1838025 “Blazar resource allocations API call very slow (15...” : Bugs : Blazar

Pierre Riteau (priteau) on 2019-09-02

Changed in blazar:
importance:	Undecided → High

Revision history for this message

Tetsuro Nakamura (tetsuro0907) wrote on 2019-09-10:

#1

Raw SQL provided in [1] is:

SELECT reservations.id AS reservations_id, reservations.lease_id AS reservations_lease_id, computehost_allocations.compute_host_id AS computehost_allocations_compute_host_id
FROM reservations JOIN leases ON leases.id = reservations.lease_id JOIN computehost_allocations ON reservations.id = computehost_allocations.reservation_id
WHERE computehost_allocations.compute_host_id IN (?, ?) AND leases.end_date >= ? AND leases.start_date <= ?

Raw SQL provided in [2] is:

SELECT reservations.id AS reservations_id, reservations.lease_id AS reservations_lease_id, computehost_allocations.compute_host_id AS computehost_allocations_compute_host_id
FROM reservations JOIN leases ON leases.id = reservations.lease_id JOIN computehost_allocations ON reservations.id = computehost_allocations.reservation_id
WHERE computehost_allocations.compute_host_id IN (?, ?) AND (leases.status = ? OR leases.status = ?) AND (reservations.status = ? OR reservations.status = ?)

[1] https://review.opendev.org/#/c/678755
[2] https://review.opendev.org/#/c/679935

-----

It would be nice to see the difference of the result of the following command in your environment for debugging

mysql> EXPLAIN ANALYZE [the raw sql above]

Revision history for this message

Jacob Colleran (jcolleran) wrote on 2019-09-10:

#2

Had a chance now to look at this a little closer. We got some pretty big gains from breaking up the 3-table join into two queries with sqlaclhemy in Blazar. With the original code it was taking 40+ seconds for us to get a return. When I was testing this out with raw SQL, I was getting similar long 40+ second results. However, I noticed Tetsuro was specifying what ids to join the tables on, and when I test this out in raw sql it is more efficient then breaking up the queries into two. However, this is not what our code in Blazar is currently doing. I think any of the optimization updates need to add what fields to join on, example below:

    query = (session.query(models.Reservation, models.ComputeHostAllocation)
             .join(models.Lease, models.Lease.id == models.Reservation.lease_id)
             .join(models.ComputeHostAllocation,
                   models.ComputeHostAllocation.reservation_id == models.Reservation.id)
             .filter(models.ComputeHostAllocation.compute_host_id
                     .in_(host_ids))
             .filter(sa.and_(border0, border1)))

I'm not totally sure what the ORM is doing when we do not specify what fields to join the tables on, but it's so inefficient that I wouldn't be surprised if it were essentially running a cross join instead of an inner join. Below is a raw sql representation of what I believe sqlalchemy is actually running in blazar/db/sqlalchemy/utils/get_reservation_allocations_by_host_ids():

    SELECT reservations.*, computehost_allocations.* FROM reservations
    JOIN leases
    JOIN computehost_allocations
    WHERE computehost_allocations.compute_host_id IN (?,?)
    AND leases.end_date <= ?
    AND leases.start_date >= ?;

Run on Chameleon's UC baremetal cluster (155 nodes):

Two Queries: 0.050 seconds
example 1 Tetsuro: 0.024 seconds
example 2 Tetsuro: 0.047 seconds
original (raw above)*: 74.433 seconds

* this was run with a limit of 1000000 because I got impatient waiting for it to finish.

I'll leave comments in the patches Tetsuro opened up.

Had a chance now to look at this a little closer. We got some pretty big gains from breaking up the 3-table join into two queries with sqlaclhemy in Blazar. With the original code it was taking 40+ seconds for us to get a return. When I was testing this out with raw SQL, I was getting similar long 40+ second results. However, I noticed Tetsuro was specifying what ids to join the tables on, and when I test this out in raw sql it is more efficient then breaking up the queries into two. However, this is not what our code in Blazar is currently doing. I think any of the optimization updates need to add what fields to join on, example below:

query = (session.query(models.Reservation, models.ComputeHostAllocation)
             .join(models.Lease, models.Lease.id == models.Reservation.lease_id)
             .join(models.ComputeHostAllocation,
                   models.ComputeHostAllocation.reservation_id == models.Reservation.id)
             .filter(models.ComputeHostAllocation.compute_host_id
                     .in_(host_ids))
             .filter(sa.and_(border0, border1)))

I'm not totally sure what the ORM is doing when we do not specify what fields to join the tables on, but it's so inefficient that I wouldn't be surprised if it were essentially running a cross join instead of an inner join. Below is a raw sql representation of what I believe sqlalchemy is actually running in blazar/db/sqlalchemy/utils/get_reservation_allocations_by_host_ids():

SELECT reservations.*, computehost_allocations.* FROM reservations
    JOIN leases
    JOIN computehost_allocations
    WHERE computehost_allocations.compute_host_id IN (?,?)
    AND leases.end_date <= ?
    AND leases.start_date >= ?;
 
Run on Chameleon's UC baremetal cluster (155 nodes):

Two Queries:               0.050 seconds
example 1 Tetsuro:         0.024 seconds
example 2 Tetsuro:         0.047 seconds
original (raw above)*:    74.433 seconds

* this was run with a limit of 1000000 because I got impatient waiting for it to finish.

I'll leave comments in the patches Tetsuro opened up.

Revision history for this message

Pierre Riteau (priteau) wrote on 2019-09-11:

#3

FYI, this is the SQL that I see generated in blazar master, before any of the patches. It's joining with lots of other tables! Maybe this is due to the lazy='joined' setting.

http://paste.openstack.org/show/775135/

Revision history for this message

Pierre Riteau (priteau) wrote on 2019-09-11:

#4

With https://review.opendev.org/#/c/678755/ patch set 4: http://paste.openstack.org/show/775138/

Revision history for this message

Pierre Riteau (priteau) wrote on 2019-09-11:

#5

And if I removed the join on ComputeHostAllocation: http://paste.openstack.org/show/775139/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-12: Change abandoned on blazar (master)

#6

Change abandoned by Jacob Colleran (<email address hidden>) on branch: master
Review: https://review.opendev.org/672994
Reason: Abandoned in favor of https://review.opendev.org/#/c/678755/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-13: Fix merged to blazar (master)

#7

Reviewed: https://review.opendev.org/678755
Committed: https://git.openstack.org/cgit/openstack/blazar/commit/?id=ddfa988e7b6df115606d68bbf65322cb050f9e4e
Submitter: Zuul
Branch: master

commit ddfa988e7b6df115606d68bbf65322cb050f9e4e
Author: Tetsuro Nakamura <email address hidden>
Date: Thu Aug 22 16:22:43 2019 +0000

Select only necessary columns in allocation API

    For performance optimization, this patch refactors the database API,
    get_reservation_allocations_by_host_ids() to select only required
    columns.

    Co-Authored-By: Jacob Colleran <email address hidden>
    Change-Id: If8c1eb2ccad39ab70933e51166f4ae979877fb52
    Partial-Bug: #1838025

Blazar

Blazar resource allocations API call very slow (15+ seconds)

Bug Description

Other bug subscribers

Remote bug watches