Blazar resource allocations API call very slow (15+ seconds)

Bug #1838025 reported by Jacob Colleran on 2019-07-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Blazar
High
Unassigned

Bug Description

On our OS deployment (~500 nodes in two regions), the API call to retrieve host allocations can take longer than 15-20 seconds. This is way longer than is necessary for just a few hundred nodes. The problems seems to be with the ORM. The query_host_allocations function in the host plugin grabs reservation and allocation information from get_reservation_allocation_by_host_ids. It looks like it is trying to join three tables and only applies a few filters. Some optimization may be in order.

def get_reservation_allocations_by_host_ids(host_ids, start_date, end_date,
                                            lease_id=None,
                                            reservation_id=None):
    session = get_session()
    border0 = models.Lease.end_date < start_date
    border1 = models.Lease.start_date > end_date
    query = (session.query(models.Reservation, models.ComputeHostAllocation)
             .join(models.Lease).join(models.ComputeHostAllocation)
             .filter(models.ComputeHostAllocation.compute_host_id
                     .in_(host_ids))
             .filter(~sa.or_(border0, border1)))
    if lease_id:
        query = query.filter(models.Reservation.lease_id == lease_id)
    if reservation_id:
        query = query.filter(models.Reservation.id == reservation_id)
    return query.all()

Pierre Riteau (priteau) on 2019-09-02
Changed in blazar:
importance: Undecided → High
Tetsuro Nakamura (tetsuro0907) wrote :

Raw SQL provided in [1] is:

SELECT reservations.id AS reservations_id, reservations.lease_id AS reservations_lease_id, computehost_allocations.compute_host_id AS computehost_allocations_compute_host_id
FROM reservations JOIN leases ON leases.id = reservations.lease_id JOIN computehost_allocations ON reservations.id = computehost_allocations.reservation_id
WHERE computehost_allocations.compute_host_id IN (?, ?) AND leases.end_date >= ? AND leases.start_date <= ?

Raw SQL provided in [2] is:

SELECT reservations.id AS reservations_id, reservations.lease_id AS reservations_lease_id, computehost_allocations.compute_host_id AS computehost_allocations_compute_host_id
FROM reservations JOIN leases ON leases.id = reservations.lease_id JOIN computehost_allocations ON reservations.id = computehost_allocations.reservation_id
WHERE computehost_allocations.compute_host_id IN (?, ?) AND (leases.status = ? OR leases.status = ?) AND (reservations.status = ? OR reservations.status = ?)

[1] https://review.opendev.org/#/c/678755
[2] https://review.opendev.org/#/c/679935

-----

It would be nice to see the difference of the result of the following command in your environment for debugging

mysql> EXPLAIN ANALYZE [the raw sql above]

Jacob Colleran (jcolleran) wrote :

Had a chance now to look at this a little closer. We got some pretty big gains from breaking up the 3-table join into two queries with sqlaclhemy in Blazar. With the original code it was taking 40+ seconds for us to get a return. When I was testing this out with raw SQL, I was getting similar long 40+ second results. However, I noticed Tetsuro was specifying what ids to join the tables on, and when I test this out in raw sql it is more efficient then breaking up the queries into two. However, this is not what our code in Blazar is currently doing. I think any of the optimization updates need to add what fields to join on, example below:

    query = (session.query(models.Reservation, models.ComputeHostAllocation)
             .join(models.Lease, models.Lease.id == models.Reservation.lease_id)
             .join(models.ComputeHostAllocation,
                   models.ComputeHostAllocation.reservation_id == models.Reservation.id)
             .filter(models.ComputeHostAllocation.compute_host_id
                     .in_(host_ids))
             .filter(sa.and_(border0, border1)))

I'm not totally sure what the ORM is doing when we do not specify what fields to join the tables on, but it's so inefficient that I wouldn't be surprised if it were essentially running a cross join instead of an inner join. Below is a raw sql representation of what I believe sqlalchemy is actually running in blazar/db/sqlalchemy/utils/get_reservation_allocations_by_host_ids():

    SELECT reservations.*, computehost_allocations.* FROM reservations
    JOIN leases
    JOIN computehost_allocations
    WHERE computehost_allocations.compute_host_id IN (?,?)
    AND leases.end_date <= ?
    AND leases.start_date >= ?;

Run on Chameleon's UC baremetal cluster (155 nodes):

Two Queries: 0.050 seconds
example 1 Tetsuro: 0.024 seconds
example 2 Tetsuro: 0.047 seconds
original (raw above)*: 74.433 seconds

* this was run with a limit of 1000000 because I got impatient waiting for it to finish.

I'll leave comments in the patches Tetsuro opened up.

Pierre Riteau (priteau) wrote :

FYI, this is the SQL that I see generated in blazar master, before any of the patches. It's joining with lots of other tables! Maybe this is due to the lazy='joined' setting.

http://paste.openstack.org/show/775135/

Pierre Riteau (priteau) wrote :

And if I removed the join on ComputeHostAllocation: http://paste.openstack.org/show/775139/

Change abandoned by Jacob Colleran (<email address hidden>) on branch: master
Review: https://review.opendev.org/672994
Reason: Abandoned in favor of https://review.opendev.org/#/c/678755/

Reviewed: https://review.opendev.org/678755
Committed: https://git.openstack.org/cgit/openstack/blazar/commit/?id=ddfa988e7b6df115606d68bbf65322cb050f9e4e
Submitter: Zuul
Branch: master

commit ddfa988e7b6df115606d68bbf65322cb050f9e4e
Author: Tetsuro Nakamura <email address hidden>
Date: Thu Aug 22 16:22:43 2019 +0000

    Select only necessary columns in allocation API

    For performance optimization, this patch refactors the database API,
    get_reservation_allocations_by_host_ids() to select only required
    columns.

    Co-Authored-By: Jacob Colleran <email address hidden>
    Change-Id: If8c1eb2ccad39ab70933e51166f4ae979877fb52
    Partial-Bug: #1838025

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers