os-simple-tenant-usage performs poorly with many instances

Bug #1421471 reported by Richard Jones on 2015-02-13
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Wishlist
Diana Clarke

Bug Description

The SQL underlying the os-simple-tenant-usage API call results in very slow operations when the database has many (20,000+) instances. In testing, the objects.InstanceList.get_active_by_window_joined call in nova/api/openstack/compute/contrib/simple_tenant_usage.py:SimpleTenantUsageController._tenant_usages_for_period takes 24 seconds to run.

Some basic timing analysis has shown that the initial query in nova/db/sqlalchemy/api.py:instance_get_active_by_window_joined runs in *reasonable* time (though still 5-6 seconds) and the bulk of the time is spent in the subsequent _instances_fill_metadata call which pulls in system_metadata info by using a SELECT with an IN clause containing the 20,000 uuids listed, resulting in execution times over 15 seconds.

Tony Breeds (o-tony) on 2015-02-13
Changed in nova:
status: New → Confirmed
Joe Gordon (jogo) wrote :

If we can fix some low hanging fruit here that is great, but the name simple-tenant-usage says it all, this isn't a feature that should be used in production.

Changed in nova:
importance: Undecided → Wishlist
Tony Breeds (o-tony) wrote :

okay the problem is that it's used by horizon. to show the stats on the login page. So while there may have been an intent for it to be niche it's being used a lot ("Build it and they will come" I guess).

So we need to see what can be done here. The real solution may be a different API for Liberty and if that's the case knowing that ASAP is a good thin (TM)

Richard Jones (r1chardj0n3s) wrote :

I'm afraid this can't be marked "wishlist" - it has a direct impact on users of Horizon. Or, we just accept that simple-tenant-usage is irredeemably broken, and write new API call for Horizon to consume :)

Fix proposed to branch: master
Review: https://review.openstack.org/159062

Changed in nova:
assignee: nobody → Ankit Agrawal (ankitagrawal)
status: Confirmed → In Progress

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/159062
Reason: This patch as been stalled for a very long time, so I am going to abandon it to keep the review queue sane. Please restore the change when its ready for review.

Changed in nova:
assignee: Ankit Agrawal (ankitagrawal) → nobody
status: In Progress → Confirmed

It's been a while since the performance was measured and there is not activity around this bug report. I'm closing it as "Opinion". If this issue is still observed with the latest release, then the report can be reopended.

Changed in nova:
status: Confirmed → Opinion
Tony Breeds (o-tony) wrote :

Confirmed with origin/master SHA:ced89e7b26b3cff323852e1d8a9c6db80334f4dd

Changed in nova:
status: Opinion → Confirmed
Changed in nova:
assignee: nobody → Diana Clarke (diana-clarke)
Matt Riedemann (mriedem) wrote :

Hmm, this bug says it's spending time doing the joins on the system_metadata table, but that should have been resolved with bug 1485025 and fix https://review.openstack.org/#/c/213340/ so that we're only loading up the instance_extra/flavor information, as the REST API code doesn't need system_metadata for the flavors (assuming you're instances have been migrated past kilo where flavors were moved out of the instance_system_metadata table and into instance_extra). That was fixed in liberty.

Matt Riedemann (mriedem) wrote :

Given this bug was reported before https://review.openstack.org/#/c/213340/ landed then you wouldn't have that fix, but it would be useful to know if it resolves your issue.

Diana Clarke (diana-clarke) wrote :

Yes, before proposing pagination for these endpoints I spent some time profiling the current queries generated by the simple tenant usage endpoints, and can confirm that they were significantly improved since this bug was initially reported.

That said, 1 tenant with 20,000+ instances is still going to be problematic without paging of some kind unless the server_usages details (via detailed=1) are removed from the API response and the aggregation is moved to the SQL (with a GROUP BY tenant_id clause).

As of stable/newton, the query generated looks like this (note: I replaced the individual fields with stars for brevity):

SELECT instances.*, instance_extra.*
FROM instances
LEFT OUTER JOIN instance_extra ON instance_extra.instance_uuid = instances.uuid
WHERE (instances.terminated_at IS NULL OR instances.terminated_at > '2016-09-28 21:02:51') AND instances.launched_at < '2016-09-28 21:02:51';

Richard Jones (r1chardj0n3s) wrote :

I don't have the system available for testing this bug out any longer - I'll have to re-investigate setting up a 20,000 instance setup to re-test, which I'll add to my TODO.

As I noted on the proposed fix patch https://review.openstack.org/213340 the usage of this API in Horizon is for summary purposes only - we count the results (for quota and usage summary display). This is in the absence of a more appropriate API call.

I will look into re-testing my scenario and check the performance of the Horizon page in question, and file a followup bug which is more specific about the problem if necessary.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers