Blazar resource allocations API call very slow (15+ seconds)

Bug #1838025 reported by Jacob Colleran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Blazar
New
High
Unassigned

Bug Description

On our OS deployment (~500 nodes in two regions), the API call to retrieve host allocations can take longer than 15-20 seconds. This is way longer than is necessary for just a few hundred nodes. The problems seems to be with the ORM. The query_host_allocations function in the host plugin grabs reservation and allocation information from get_reservation_allocation_by_host_ids. It looks like it is trying to join three tables and only applies a few filters. Some optimization may be in order.

def get_reservation_allocations_by_host_ids(host_ids, start_date, end_date,
                                            lease_id=None,
                                            reservation_id=None):
    session = get_session()
    border0 = models.Lease.end_date < start_date
    border1 = models.Lease.start_date > end_date
    query = (session.query(models.Reservation, models.ComputeHostAllocation)
             .join(models.Lease).join(models.ComputeHostAllocation)
             .filter(models.ComputeHostAllocation.compute_host_id
                     .in_(host_ids))
             .filter(~sa.or_(border0, border1)))
    if lease_id:
        query = query.filter(models.Reservation.lease_id == lease_id)
    if reservation_id:
        query = query.filter(models.Reservation.id == reservation_id)
    return query.all()

Pierre Riteau (priteau)
Changed in blazar:
importance: Undecided → High
Revision history for this message
Tetsuro Nakamura (tetsuro0907) wrote :

Raw SQL provided in [1] is:

SELECT reservations.id AS reservations_id, reservations.lease_id AS reservations_lease_id, computehost_allocations.compute_host_id AS computehost_allocations_compute_host_id
FROM reservations JOIN leases ON leases.id = reservations.lease_id JOIN computehost_allocations ON reservations.id = computehost_allocations.reservation_id
WHERE computehost_allocations.compute_host_id IN (?, ?) AND leases.end_date >= ? AND leases.start_date <= ?

Raw SQL provided in [2] is:

SELECT reservations.id AS reservations_id, reservations.lease_id AS reservations_lease_id, computehost_allocations.compute_host_id AS computehost_allocations_compute_host_id
FROM reservations JOIN leases ON leases.id = reservations.lease_id JOIN computehost_allocations ON reservations.id = computehost_allocations.reservation_id
WHERE computehost_allocations.compute_host_id IN (?, ?) AND (leases.status = ? OR leases.status = ?) AND (reservations.status = ? OR reservations.status = ?)

[1] https://review.opendev.org/#/c/678755
[2] https://review.opendev.org/#/c/679935

-----

It would be nice to see the difference of the result of the following command in your environment for debugging

mysql> EXPLAIN ANALYZE [the raw sql above]

Revision history for this message
Jacob Colleran (jcolleran) wrote :

Had a chance now to look at this a little closer. We got some pretty big gains from breaking up the 3-table join into two queries with sqlaclhemy in Blazar. With the original code it was taking 40+ seconds for us to get a return. When I was testing this out with raw SQL, I was getting similar long 40+ second results. However, I noticed Tetsuro was specifying what ids to join the tables on, and when I test this out in raw sql it is more efficient then breaking up the queries into two. However, this is not what our code in Blazar is currently doing. I think any of the optimization updates need to add what fields to join on, example below:

    query = (session.query(models.Reservation, models.ComputeHostAllocation)
             .join(models.Lease, models.Lease.id == models.Reservation.lease_id)
             .join(models.ComputeHostAllocation,
                   models.ComputeHostAllocation.reservation_id == models.Reservation.id)
             .filter(models.ComputeHostAllocation.compute_host_id
                     .in_(host_ids))
             .filter(sa.and_(border0, border1)))

I'm not totally sure what the ORM is doing when we do not specify what fields to join the tables on, but it's so inefficient that I wouldn't be surprised if it were essentially running a cross join instead of an inner join. Below is a raw sql representation of what I believe sqlalchemy is actually running in blazar/db/sqlalchemy/utils/get_reservation_allocations_by_host_ids():

    SELECT reservations.*, computehost_allocations.* FROM reservations
    JOIN leases
    JOIN computehost_allocations
    WHERE computehost_allocations.compute_host_id IN (?,?)
    AND leases.end_date <= ?
    AND leases.start_date >= ?;

Run on Chameleon's UC baremetal cluster (155 nodes):

Two Queries: 0.050 seconds
example 1 Tetsuro: 0.024 seconds
example 2 Tetsuro: 0.047 seconds
original (raw above)*: 74.433 seconds

* this was run with a limit of 1000000 because I got impatient waiting for it to finish.

I'll leave comments in the patches Tetsuro opened up.

Revision history for this message
Pierre Riteau (priteau) wrote :

FYI, this is the SQL that I see generated in blazar master, before any of the patches. It's joining with lots of other tables! Maybe this is due to the lazy='joined' setting.

http://paste.openstack.org/show/775135/

Revision history for this message
Pierre Riteau (priteau) wrote :
Revision history for this message
Pierre Riteau (priteau) wrote :

And if I removed the join on ComputeHostAllocation: http://paste.openstack.org/show/775139/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on blazar (master)

Change abandoned by Jacob Colleran (<email address hidden>) on branch: master
Review: https://review.opendev.org/672994
Reason: Abandoned in favor of https://review.opendev.org/#/c/678755/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to blazar (master)

Reviewed: https://review.opendev.org/678755
Committed: https://git.openstack.org/cgit/openstack/blazar/commit/?id=ddfa988e7b6df115606d68bbf65322cb050f9e4e
Submitter: Zuul
Branch: master

commit ddfa988e7b6df115606d68bbf65322cb050f9e4e
Author: Tetsuro Nakamura <email address hidden>
Date: Thu Aug 22 16:22:43 2019 +0000

    Select only necessary columns in allocation API

    For performance optimization, this patch refactors the database API,
    get_reservation_allocations_by_host_ids() to select only required
    columns.

    Co-Authored-By: Jacob Colleran <email address hidden>
    Change-Id: If8c1eb2ccad39ab70933e51166f4ae979877fb52
    Partial-Bug: #1838025

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.