Nova list is extremely slow with lots of vms

Bug #1160487 reported by Joshua Harlow
74
This bug affects 15 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

When performing a list for 'all-tenants' by an admin (or someone with a policy that can run this command), or when a single user has a large amount of vms, the command itself can block a whole nova-api process for a long time, this can make it easy for said users to DOS the whole system. This is more evident as you add more users and more tenants or a user makes a lot of vms. Likely some kind of pagination should be used (?) or the queries being performed should be analyzed to make sure they are optimal (and not repeated many times in for loops...)

Tags: ops
Joshua Harlow (harlowja)
description: updated
Joshua Harlow (harlowja)
summary: - Nova list (for all tenants) is extremely slow
+ Nova list is extremely slow with alot of vms
Revision history for this message
Michael Still (mikal) wrote :

To give me a sense of scale, how many instances is "a lot"? Hundreds? Thousands?

summary: - Nova list is extremely slow with alot of vms
+ Nova list is extremely slow with lots of vms
Changed in nova:
status: New → Incomplete
Revision history for this message
Joshua Harlow (harlowja) wrote :

I'll get u a good number, but in the thousands I think was when this started.

Michael Still (mikal)
tags: added: ops
Revision history for this message
Joshua Harlow (harlowja) wrote :

Get servers slowness...

Part #1: Form big 'single' query to get all instances Y

Part #2:

If detailed:
    - Iterate for each instance in Y:
      - Check policy to get fault
    - Make big 'single' query to get all instances faults (called this Z)
      - Iterate over result set Z, attach to appropriate instance in Y
    - Form detail view:
      - Call view function show() on each instance in Y to get the basic
        information about an instance (note it appears no extra calls are made
        here) and results in list X

*Now here is the iffy part*

Extensions cause iterations & alterations on/to the resultant 'view' X.

- ExtendedServerAttributesController
  - for server in servers:
- ExtendedIpsController
  - for server in servers:
- ServerDiskConfigController
  - for server in servers:
- config_drive.Controller
  - for server in servers:
- ExtendedStatusController
  - for server in servers:
- ... others?

Now if that server list is in the 1000+ the process of doing those iterations
and adjusting the views via this extension chain (and reiteration) actually takes
more time than the initial DB call to begin with. Hopefully there are ways this
can be improved (as a cloud will just get bigger and not smaller). Especially in
the 'admin' case where it is useful to view all instances these operations can
be especially slow (don't piss off the admins, haha).

Some ideas:

It would be interesting to alter the extension mechanism to provide
a central controller that can iterate over the server list once, and let
each extension locally accumulate its additions, and then at the end of
each the overall iteration the locally accumulated results can then be
attached to the response by this central controller (instead of having each
extension do this itself).

Another idea and possible future work is that extensions are limited by the initial DB query
and its cached entries via req.get_db_instance, it might instead be useful to have this central controller
form the initial 'detail/show' calls SQL query with 'input' from the extensions
that will be activated. This way extensions can guarantee that there needed
backing source query is added to the larger query and said DB query will be
exact for the extensions and root 'view' instead of being very generic (as it is
right now, since it is impossible to predict what field an extension will need from
the underlying instance).

Michael Still (mikal)
Changed in nova:
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Hrushikesh (hrushikesh-gangur) wrote :

Am sure this has something to do with inefficiency in nova (due to code or configuration) that is causing the overall response time getting sluggish. And, this being factor of no. of active VM instances within a project. 40 seconds response time was with 100 VM instances. Now, I launched around 500, and the reponse time bumped up to 190 seconds:

2013-06-03 13:57:07.481 DEBUG nova.api.openstack.wsgi [req-3da658aa-6a39-4995-8e6f-2d5c7912549e ac5e4da2c17e4f669f8d3e82d7b751dd 5a19956a849542869ce710b9e51439e0] No Content-Type provided in request get_body /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:791
 2013-06-03 13:57:07.482 DEBUG nova.api.openstack.wsgi [req-3da658aa-6a39-4995-8e6f-2d5c7912549e ac5e4da2c17e4f669f8d3e82d7b751dd 5a19956a849542869ce710b9e51439e0] Calling method <bound method Controller.detail of <nova.api.openstack.compute.servers.Controller object at 0x4132850>> _process_stack /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:911
 2013-06-03 13:57:07.483 DEBUG nova.compute.api [req-3da658aa-6a39-4995-8e6f-2d5c7912549e ac5e4da2c17e4f669f8d3e82d7b751dd 5a19956a849542869ce710b9e51439e0] Searching by: {'deleted': False, u'project_id': u'5a19956a849542869ce710b9e51439e0'} get_all /usr/lib/python2.7/dist-packages/nova/compute/api.py:1373
 2013-06-03 14:00:15.336 INFO nova.osapi_compute.wsgi.server [req-3da658aa-6a39-4995-8e6f-2d5c7912549e ac5e4da2c17e4f669f8d3e82d7b751dd 5a19956a849542869ce710b9e51439e0] 10.1.56.12 "GET /v2/5a19956a849542869ce710b9e51439e0/servers/detail?project_id=5a19956a849542869ce710b9e51439e0 HTTP/1.1" status: 200 len: 536479 time: 187.857266

And, this was not the case in Essex. If I can get someone to help me out on profiling what it does during this API call, we can get to some conclusion. As this can go worst after I have 10,00 instances.

Revision history for this message
Mohammed Naser (mnaser) wrote :

I've tested this and it takes 4.5s to list 377 instances.

Are you using Neutron?

Revision history for this message
Hrushikesh (hrushikesh-gangur) wrote :

This has been significantly improved in Icehouse. To list 550 Active instances across 15 projects, it takes 5.8 seconds.

2014-05-08 11:51:29.229 19815 INFO nova.osapi_compute.wsgi.server [req-c2e215c4-2fb9-4b66-be6f-b962b054d8e9 70c8e4a9473f4f2bae79f18605cbc664 87f209614f5144cd904c27b8f04b7d40] 10.1.56.79 "GET /v2/87f209614f5144cd904c27b8f04b7d40/servers/detail?all_tenants=1 HTTP/1.1" status: 200 len: 1102272 time: 5.7907159

Yes, I am using Neutron.

Revision history for this message
Sam Morrison (sorrison) wrote :

$ time nova list --all-tenants | wc -l
1004

real 0m21.856s
user 0m0.472s
sys 0m0.056s

Note we have over 10k instances but nova only returns 1000 at a time, the client isn't smart enough to get them all

Joe Gordon (jogo)
Changed in nova:
importance: Medium → High
Revision history for this message
Christopher Lefelhocz (christopher-lefelhoc) wrote :

Do we have any criteria for success to close this bug?

Revision history for this message
Christopher Lefelhocz (christopher-lefelhoc) wrote :

I did a bit of research on this issue this week. For testing in a large cloud I'm seeing similar results of 11-15 seconds to download 1000 instances . Hacking curls and using the limits/marker options I was able to determine the following...

At least in our case the limit ~1000 instances seems to be upper bound on how many in a single request. New markers do take longer than load (instead of 11 seconds, it took between 15-22 seconds when the marker changed), but this wasn't proportional where it was in the list.

Having detail/no detail didn't affect the timing of loading either.

Right now the suspects are the DB itself (though it is odd that we are seeing similar numbers with varying clouds), the building of the response based on the db data, or the caching of db instances (as part of the list).

Revision history for this message
Joshua Harlow (harlowja) wrote :

I think your 1000 per 'page' seems like a good criteria. As for how fast, I'd be nice to have ~5 second. The amount of data that this requires and produces shouldn't take more than 5 seconds IMHO (and that's pretty high). If it does then it seems like its taking to long and there either needs to be better caching or something fixed.

Revision history for this message
sonam soni (sonam-soni) wrote :

I was facing the same issue, it used to take a long time when I do "nova list". But when I removed unnecessary servers in
resolv.conf and cleaned my etc/hosts, it was resolved.

Revision history for this message
Sean Dague (sdague) wrote :

I'm not convinced that nova-list for all servers being slow is really high priority bug. We have paging mechanisms for a reason.

Changed in nova:
importance: High → Low
Sean Dague (sdague)
Changed in nova:
status: Triaged → Confirmed
Revision history for this message
Choe, Cheng-Dae (whitekid) wrote :

In my case the mose time consuming extensions are below, and I made some changeset.

- ExtendedVolumesController: https://review.openstack.org/#/c/211258/
becase it execute query every instance to reterive block device mapping. for perforamce I use big single query with IN operator.

- ExtendedAZController: https://review.openstack.org/#/c/211250/
It reterive hos availability zone information with memcache for every host on by on with memcache.get operation. for performance I use get_multi() instead of get()

- SecurityGroupsOutputController
It's poor performace caused by neutron api call.. So I just disable this extension :D

Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1359808 is related to this for listing block device mappings per instance.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: Low → Undecided
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.