Inefficient multi-cell instance list

Bug #1787977 reported by Matt Riedemann on 2018-08-20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Dan Smith

Bug Description

This is based on some performance and scale testing done by Huawei, reported in this dev ML thread:

In that scenario, they have 10 cells with 10000 instances in each cell. They then run through a few GET /servers/detail scenarios with multiple cells and varying limits.

The thread discussion pointed out that they were wasting time pulling 1000 records (the default [api]/max_limit) from all 10 cells and then throwing away 9000 of those results, so the DB query time per cell was small, but the sqla/ORM/python was chewing up the time.

Dan Smith has a series of changes here:

Which allow us to batch the DB queries per cell which, when distributed across the 10 cells, e.g. 1000 / 10 = 100 batch size per cell, ends up cutting the time spent in about half (around 11 sec to around 6 sec).

This is clearly a performance issue which we have a fix, and we arguably should backport the fix.

Note this is less of an issue for deployments that leverage the [api]/instance_list_per_project_cells option (like CERN):

Dan Smith (danms) wrote :

The only argument against backporting is that we identified this as a potential situation at PTG in Denver (the first one), and said we would deal with it if/when it came up. At the time we had the most information from CERN, which is mostly immune to this situation.

That said, the batching is a lot less complicated than I originally expected and there isn't really any technical reason not to backport it so I think we should.

Changed in nova:
status: Triaged → In Progress

Submitter: Zuul
Branch: master

commit 0a88916911e2b02055a2a707bda026c975f4472c
Author: Dan Smith <email address hidden>
Date: Thu Aug 16 13:52:01 2018 -0700

    Batch results per cell when doing cross-cell listing

    This extends the multi_cell_list module with batching support to avoid
    querying N*$limit total results when listing resources across cells.
    Instead, if our total limit is over a given threshold, we should query
    smaller batches in the per-cell thread until we reach the total limit
    or are stopped because the sort feeder has found enough across all cells
    to satisfy the requirements. In many cases, this can drop the total number
    of results we load and process from N*$limit to (best case) $limit+$batch
    or (usual case) $limit+(N*$batch).

    Since we return a generator from our scatter-gather function, this should
    mean we basically finish the scatter immediately after the first batch query
    to each cell database, keeping the threads alive until they produce all the
    results possible from their cell, or are terminated in the generator loop
    by the master loop hitting the total_limit condition. As a result, the
    checking over results that we do immediately after the scatter finishes
    will no longer do anything since we start running the query code for the
    first time as heapq.merge() starts hitting the generators. So, this brings
    a query_wrapper() specific to the multi_cell_list code which can mimic the
    timeout and error handling abilities of scatter_gather_cells, but inline
    as we're processing so that we don't interrupt the merge sort for a

    Related-Bug: #1787977
    Change-Id: Iaa4759822e70b39bd735104d03d4deec988d35a1

Submitter: Zuul
Branch: master

commit c3a77f80b1863e114109af9c32ea01b205c1a735
Author: Dan Smith <email address hidden>
Date: Fri Aug 17 07:56:05 2018 -0700

    Make instance_list perform per-cell batching

    This makes the instance_list module support batching across cells
    with a couple of different strategies, and with room to add more
    in the future.

    Before this change, an instance list with limit 1000 to a
    deployment with 10 cells would generate a query to each cell
    database with the same limit. Thus, that API request could end
    up processing up to 10,000 instance records despite only
    returning 1000 to the user (because of the limit).

    This uses the batch functionality in the base code added in
    by providing a couple of strategies by which the batch size
    per cell can be determined. These should provide a lot of gain
    in the short term, and we can extend them with other strategies
    as we identify some with additional benefits.

    Closes-Bug: #1787977
    Change-Id: Ie3a5f5dc49f8d9a4b96f1e97f8a6ea0b5738b768

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers