"nova usage" taking too much time with many VMs in database

Bug #1481262 reported by Antonio Messina on 2015-08-04
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Guillaume Espanel

Bug Description

Issue found on Kilo 2015.1.0 on Ubuntu Trusty (1:2015.1.0-0ubuntu1.1~cloud0) from http://ubuntu-cloud.archive.canonical.com/ubuntu

When running "nova usage" on a tenant that started many instances O(100k) during the current month, the following happens:

* nova-api is stuck at 100% for a long time
* as a consequence, nova CLI returns "ERROR (ConnectionRefused):
Unable to establish connection to ..."
* on MySQL slow query log I see there is a query like:

SELECT instance_system_metadata.created_at AS
instance_system_metadata_created_at,
instance_system_metadata.updated_at AS
instance_system_metadata_updated_at,
instance_system_metadata.deleted_at AS
instance_system_metadata_deleted_at, instance_system_metadata.deleted
AS instance_system_metadata_deleted, instance_system_metadata.id AS
instance_system_metadata_id, instance_system_metadata.`key` AS
instance_system_metadata_key, instance_system_metadata.value AS
instance_system_metadata_value, instance_system_metadata.instance_uuid
AS instance_system_metadata_instance_uuid
FROM instance_system_metadata
WHERE instance_system_metadata.deleted = 0 AND
 instance_system_metadata.instance_uuid IN (<list of ~100k UUID>)

which took 1.8 seconds.

Also, when logging in from Horizon, login is very slow, and I get an error
"Error: Unable to retrieve usage information.".

Changed in nova:
assignee: nobody → Zhenzan Zhou (zhenzan-zhou)
tags: added: db
tags: added: performance
Cale Rath (ctrath) wrote :

Have prior instance been "deleted"? When this occurs, the actual data is not removed from the DB, but is soft deleted. There's a patch here that hasn't landed yet to purge soft-deleted instance data: https://review.openstack.org/#/c/203751/

@Zhenzan Zhou:

Are you still actively working on a patch for this bug? If "yes", please provide a patch in Gerrit in the near future, if "no", please remove yourself as assignee.

Changed in nova:
assignee: Zhenzan Zhou (zhenzan-zhou) → nobody
Sean Dague (sdague) on 2016-02-17
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
stgleb (gstepanov) wrote :

Could you provide table dump? It will allow me reproduce your problem without annoying creating/deleting instances on
my enviroment.

Changed in nova:
assignee: nobody → stgleb (gstepanov)
Sean Dague (sdague) on 2016-04-18
Changed in nova:
assignee: stgleb (gstepanov) → nobody
Attila Fazekas (afazekas) wrote :

The situation can be even worse with the usage-list call (all tenant),
it can permanently grow the memory allocated by the n-api processes by a huge extend (multiple Gigabytes, each worker).

1. The aggregation should be done on the DB side.
2. n-api should not ever to fetch more then osapi_max_limit of things ever.
3. some these statics should be handled by the telemetry service or depending on service which consuming the telemetry data, instead of having nova to this job.

Attila Fazekas (afazekas) wrote :

The situation can be even worse with the usage-list call (all tenant),
it can permanently grow the memory allocated by the n-api processes by a huge extend (multiple Gigabytes, each worker).

1. The aggregation should be done on the DB side.
2. n-api should not ever fetch more than osapi_max_limit of things
3. most of these statistics should be handled by the telemetry service or depending on service which consuming the telemetry data, instead of having nova to (re)do this job.

Changed in nova:
assignee: nobody → Guillaume Espanel (guillaume-espanel)

Fix proposed to branch: master
Review: https://review.openstack.org/343734

Changed in nova:
status: Confirmed → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers