Slow Query on Instances Table

Bug #1553184 reported by Joseph bajin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

We are currently running Juno (2014.2.4).
We have a few tenants that have a lot of instances that are created and deleted so they have a lot of entries in the instances table.

I see the following bug that was not brought into Juno but was delivered in Kilo that was suppose to help with this same type of issue, but it didn't seem like that worked at all. https://bugs.launchpad.net/nova/+bug/1378395

After reviewing the query a bit more, I found that another Index could be added that dramatically dropped the time the query took to run.

This the query in mention:

SELECT anon_1.instances_deleted_at AS anon_1_instances_deleted_at, anon_1.instances_deleted AS anon_1_instances_deleted, anon_1.instances_created_at AS anon_1_instances_created_at, anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_id AS anon_1_instances_id, anon_1.instances_user_id AS anon_1_instances_user_id, anon_1.instances_project_id AS anon_1_instances_project_id, anon_1.instances_image_ref AS anon_1_instances_image_ref, anon_1.instances_kernel_id AS anon_1_instances_kernel_id, anon_1.instances_ramdisk_id AS anon_1_instances_ramdisk_id, anon_1.instances_hostname AS anon_1_instances_hostname, anon_1.instances_launch_index AS anon_1_instances_launch_index, anon_1.instances_key_name AS anon_1_instances_key_name, anon_1.instances_key_data AS anon_1_instances_key_data, anon_1.instances_power_state AS anon_1_instances_power_state, anon_1.instances_vm_state AS anon_1_instances_vm_state, anon_1.instances_task_state AS anon_1_instances_task_state, anon_1.instances_memory_mb AS anon_1_instances_memory_mb, anon_1.instances_vcpus AS anon_1_instances_vcpus, anon_1.instances_root_gb AS anon_1_instances_root_gb, anon_1.instances_ephemeral_gb AS anon_1_instances_ephemeral_gb, anon_1.instances_ephemeral_key_uuid AS anon_1_instances_ephemeral_key_uuid, anon_1.instances_host AS anon_1_instances_host, anon_1.instances_node AS anon_1_instances_node, anon_1.instances_instance_type_id AS anon_1_instances_instance_type_id, anon_1.instances_user_data AS anon_1_instances_user_data, anon_1.instances_reservation_id AS anon_1_instances_reservation_id, anon_1.instances_scheduled_at AS anon_1_instances_scheduled_at, anon_1.instances_launched_at AS anon_1_instances_launched_at, anon_1.instances_terminated_at AS anon_1_instances_terminated_at, anon_1.instances_availability_zone AS anon_1_instances_availability_zone, anon_1.instances_display_name AS anon_1_instances_display_name, anon_1.instances_display_description AS anon_1_instances_display_description, anon_1.instances_launched_on AS anon_1_instances_launched_on, anon_1.instances_locked AS anon_1_instances_locked, anon_1.instances_locked_by AS anon_1_instances_locked_by, anon_1.instances_os_type AS anon_1_instances_os_type, anon_1.instances_architecture AS anon_1_instances_architecture, anon_1.instances_vm_mode AS anon_1_instances_vm_mode, anon_1.instances_uuid AS anon_1_instances_uuid, anon_1.instances_root_device_name AS anon_1_instances_root_device_name, anon_1.instances_default_ephemeral_device AS anon_1_instances_default_ephemeral_device, anon_1.instances_default_swap_device AS anon_1_instances_default_swap_device, anon_1.instances_config_drive AS anon_1_instances_config_drive, anon_1.instances_access_ip_v4 AS anon_1_instances_access_ip_v4, anon_1.instances_access_ip_v6 AS anon_1_instances_access_ip_v6, anon_1.instances_auto_disk_config AS anon_1_instances_auto_disk_config, anon_1.instances_progress AS anon_1_instances_progress, anon_1.instances_shutdown_terminate AS anon_1_instances_shutdown_terminate, anon_1.instances_disable_terminate AS anon_1_instances_disable_terminate, anon_1.instances_cell_name AS anon_1_instances_cell_name, anon_1.instances_internal_id AS anon_1_instances_internal_id, anon_1.instances_cleaned AS anon_1_instances_cleaned, instance_info_caches_1.deleted_at AS instance_info_caches_1_deleted_at, instance_info_caches_1.deleted AS instance_info_caches_1_deleted, instance_info_caches_1.created_at AS instance_info_caches_1_created_at, instance_info_caches_1.updated_at AS instance_info_caches_1_updated_at, instance_info_caches_1.id AS instance_info_caches_1_id, instance_info_caches_1.network_info AS instance_info_caches_1_network_info, instance_info_caches_1.instance_uuid AS instance_info_caches_1_instance_uuid, security_groups_1.deleted_at AS security_groups_1_deleted_at, security_groups_1.deleted AS security_groups_1_deleted, security_groups_1.created_at AS security_groups_1_created_at, security_groups_1.updated_at AS security_groups_1_updated_at, security_groups_1.id AS security_groups_1_id, security_groups_1.name AS security_groups_1_name, security_groups_1.description AS security_groups_1_description, security_groups_1.user_id AS security_groups_1_user_id, security_groups_1.project_id AS security_groups_1_project_id
FROM (SELECT instances.deleted_at AS instances_deleted_at, instances.deleted AS instances_deleted, instances.created_at AS instances_created_at, instances.updated_at AS instances_updated_at, instances.id AS instances_id, instances.user_id AS instances_user_id, instances.project_id AS instances_project_id, instances.image_ref AS instances_image_ref, instances.kernel_id AS instances_kernel_id, instances.ramdisk_id AS instances_ramdisk_id, instances.hostname AS instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS instances_vm_state, instances.task_state AS instances_task_state, instances.memory_mb AS instances_memory_mb, instances.vcpus AS instances_vcpus, instances.root_gb AS instances_root_gb, instances.ephemeral_gb AS instances_ephemeral_gb, instances.ephemeral_key_uuid AS instances_ephemeral_key_uuid, instances.host AS instances_host, instances.node AS instances_node, instances.instance_type_id AS instances_instance_type_id, instances.user_data AS instances_user_data, instances.reservation_id AS instances_reservation_id, instances.scheduled_at AS instances_scheduled_at, instances.launched_at AS instances_launched_at, instances.terminated_at AS instances_terminated_at, instances.availability_zone AS instances_availability_zone, instances.display_name AS instances_display_name, instances.display_description AS instances_display_description, instances.launched_on AS instances_launched_on, instances.locked AS instances_locked, instances.locked_by AS instances_locked_by, instances.os_type AS instances_os_type, instances.architecture AS instances_architecture, instances.vm_mode AS instances_vm_mode, instances.uuid AS instances_uuid, instances.root_device_name AS instances_root_device_name, instances.default_ephemeral_device AS instances_default_ephemeral_device, instances.default_swap_device AS instances_default_swap_device, instances.config_drive AS instances_config_drive, instances.access_ip_v4 AS instances_access_ip_v4, instances.access_ip_v6 AS instances_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config, instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS instances_cleaned
FROM instances
WHERE instances.deleted = 0 AND (instances.vm_state != 'soft-delete' OR instances.vm_state IS NULL) ORDER BY instances.created_at DESC, instances.created_at DESC, instances.created_at DESC, instances.id DESC
 LIMIT 1000) AS anon_1 LEFT OUTER JOIN instance_info_caches AS instance_info_caches_1 ON instance_info_caches_1.instance_uuid = anon_1.instances_uuid LEFT OUTER JOIN (security_group_instance_association AS security_group_instance_association_1 INNER JOIN security_groups AS security_groups_1 ON security_groups_1.id = security_group_instance_association_1.security_group_id AND security_group_instance_association_1.deleted = 0 AND security_groups_1.deleted = 0) ON security_group_instance_association_1.instance_uuid = anon_1.instances_uuid AND anon_1.instances_deleted = 0 ORDER BY anon_1.instances_created_at DESC, anon_1.instances_created_at DESC, anon_1.instances_created_at DESC, anon_1.instances_id DESC

This is what the explain plan looks like pre-indexes:
Explain Before Indexes:
+----+-------------+---------------------------------------+--------+-------------------------------------------------------------------------+-------------------------------------------------------+---------+--------------------------------------------------------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------------------------+--------+-------------------------------------------------------------------------+-------------------------------------------------------+---------+--------------------------------------------------------------+--------+-----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 994 | Using filesort |
| 1 | PRIMARY | instance_info_caches_1 | eq_ref | uniq_instance_info_caches0instance_uuid | uniq_instance_info_caches0instance_uuid | 110 | anon_1.instances_uuid | 1 | |
| 1 | PRIMARY | security_group_instance_association_1 | ref | security_group_id,security_group_instance_association_instance_uuid_idx | security_group_instance_association_instance_uuid_idx | 111 | anon_1.instances_uuid | 1 | |
| 1 | PRIMARY | security_groups_1 | eq_ref | PRIMARY | PRIMARY | 4 | nova.security_group_instance_association_1.security_group_id | 1 | |
| 2 | DERIVED | instances | ALL | NULL | NULL | NULL | NULL | 993123 | Using where; Using filesort |
+----+-------------+---------------------------------------+--------+-------------------------------------------------------------------------+-------------------------------------------------------+---------+--------------------------------------------------------------+--------+-----------------------------+
5 rows in set (10.61 sec)

You can see that it takes 10.61 second to return this.

After reviewing the full query, I saw this derived query that was causing most of the issues. This was the query:
SELECT instances.deleted_at AS instances_deleted_at, instances.deleted AS instances_deleted, instances.created_at AS instances_created_at, instances.updated_at AS instances_updated_at, instances.id AS instances_id, instances.user_id AS instances_user_id, instances.project_id AS instances_project_id, instances.image_ref AS instances_image_ref, instances.kernel_id AS instances_kernel_id, instances.ramdisk_id AS instances_ramdisk_id, instances.hostname AS instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS instances_vm_state, instances.task_state AS instances_task_state, instances.memory_mb AS instances_memory_mb, instances.vcpus AS instances_vcpus, instances.root_gb AS instances_root_gb, instances.ephemeral_gb AS instances_ephemeral_gb, instances.ephemeral_key_uuid AS instances_ephemeral_key_uuid, instances.host AS instances_host, instances.node AS instances_node, instances.instance_type_id AS instances_instance_type_id, instances.user_data AS instances_user_data, instances.reservation_id AS instances_reservation_id, instances.scheduled_at AS instances_scheduled_at, instances.launched_at AS instances_launched_at, instances.terminated_at AS instances_terminated_at, instances.availability_zone AS instances_availability_zone, instances.display_name AS instances_display_name, instances.display_description AS instances_display_description, instances.launched_on AS instances_launched_on, instances.locked AS instances_locked, instances.locked_by AS instances_locked_by, instances.os_type AS instances_os_type, instances.architecture AS instances_architecture, instances.vm_mode AS instances_vm_mode, instances.uuid AS instances_uuid, instances.root_device_name AS instances_root_device_name, instances.default_ephemeral_device AS instances_default_ephemeral_device, instances.default_swap_device AS instances_default_swap_device, instances.config_drive AS instances_config_drive, instances.access_ip_v4 AS instances_access_ip_v4, instances.access_ip_v6 AS instances_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config, instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS instances_cleaned
FROM instances
WHERE instances.deleted = 0 AND (instances.vm_state != 'soft-delete' OR instances.vm_state IS NULL) ORDER BY instances.created_at DESC, instances.created_at DESC, instances.created_at DESC, instances.id DESC
 LIMIT 1000)

This is what the explain plan looked like with this query:
+----+-------------+-----------+------+---------------+------+---------+------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+------+---------+------+--------+-----------------------------+
| 1 | SIMPLE | instances | ALL | NULL | NULL | NULL | NULL | 993126 | Using where; Using filesort |
+----+-------------+-----------+------+---------------+------+---------+------+--------+-----------------------------+
1 row in set (0.00 sec)

That's a lot of data to sort through.
I went ahead and added this index which is found in the above mentioned bug:
CREATE INDEX `instances_project_id_deleted_idx` ON `instances` (`project_id`, `deleted`);
and that didn't seem to solve the problem.
As I kept digging into that derived query some more, I found that it didn't have any index it could use because of the ORDER BY statement.
Adding this index to the table looks to have resolved the issue:
CREATE INDEX `instances_deleted_created_at_idx` ON `instances` (`deleted`, `created_at`);

That made the explain plan for the derived query look like this:
+----+-------------+-----------+------+----------------------------------+----------------------------------+---------+-------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+----------------------------------+----------------------------------+---------+-------+------+-----------------------------+
| 1 | SIMPLE | instances | ref | instances_deleted_created_at_idx | instances_deleted_created_at_idx | 5 | const | 1490 | Using where; Using filesort |
+----+-------------+-----------+------+----------------------------------+----------------------------------+---------+-------+------+-----------------------------+

Instead of 993,126 rows, it only grabbed 1490.

This made the entire large query explain plan look like this:
+----+-------------+---------------------------------------+--------+-------------------------------------------------------------------------+-------------------------------------------------------+---------+--------------------------------------------------------------+---------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------------------------+--------+-------------------------------------------------------------------------+-------------------------------------------------------+---------+--------------------------------------------------------------+---------+-----------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000 | Using filesort |
| 1 | PRIMARY | instance_info_caches_1 | eq_ref | uniq_instance_info_caches0instance_uuid | uniq_instance_info_caches0instance_uuid | 110 | anon_1.instances_uuid | 1 | |
| 1 | PRIMARY | security_group_instance_association_1 | ref | security_group_id,security_group_instance_association_instance_uuid_idx | security_group_instance_association_instance_uuid_idx | 111 | anon_1.instances_uuid | 1 | |
| 1 | PRIMARY | security_groups_1 | eq_ref | PRIMARY | PRIMARY | 4 | nova.security_group_instance_association_1.security_group_id | 1 | |
| 2 | DERIVED | instances | ALL | instances_deleted_created_at_idx | instances_deleted_created_at_idx | 5 | | 1328990 | Using where; Using filesort |
+----+-------------+---------------------------------------+--------+-------------------------------------------------------------------------+-------------------------------------------------------+---------+--------------------------------------------------------------+---------+-----------------------------+
5 rows in set (0.28 sec)

Bringing down the query time significantly and now I have no more slow queries at all.

tl;dr Very large derived query that was giving me 10+ second return times in nova was reduced down to under 1 second with adding an index to the instances table.

If someone could direct me to how to get this added, I'm happy to do all the work, I just don't know what the process is for an index.

Tags: db ops
Revision history for this message
Matt Riedemann (mriedem) wrote :

Nice, this will be awesome if there aren't any side effects.

Basically like the other bug, you add an index to the schema with a database migration script, see:

https://review.openstack.org/#/c/126603/11/nova/db/sqlalchemy/migrate_repo/versions/274_update_instances_project_id_index.py

We don't support 'downgrade' anymore so you'd omit that part.

But it's pretty simple, just get the table, check if the index already exists (so the migration is idempotent), and then create the index. Add a test to make sure the index exists and update the data model.

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

Feel free to ping zzzeek or jaypipes in #openstack-nova on freenode IRC about this too, they are the resident mysql experts in nova.

Revision history for this message
Andrew Laski (alaski) wrote :

You would add a database migration in order to add the index, and update the sqlalchemy model within Nova. You can find example migrations in nova/db/sqlalchemy/migrate_repo/versions/ and there's likely one there that adds an index. The Instance model in nova/db/sqlalchemy/models.py would also need to be updated.

Revision history for this message
Joseph bajin (josephbajin) wrote :

Totally makes sense. Thanks for the help.. I'll get something out here in the next day and hit up the guys on IRC to take a look and get their opinions on it.

Changed in nova:
assignee: nobody → Joseph bajin (josephbajin)
Changed in nova:
assignee: Joseph bajin (josephbajin) → Augustina Ragwitz (auggy)
assignee: Augustina Ragwitz (auggy) → nobody
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/288765

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

So, I think the order_by is happening way down in the oslo.db code here:

https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/utils.py#L149

But that's when passing in a sort key on the instance list API, so were you listing and sorting by 'created_at'? Because if that's the case, then the order by is just based on what you passed in, which could change, and then we still have the performance problem on whatever the other random sort key is.

Revision history for this message
Matt Riedemann (mriedem) wrote :

alaski pointed out that the default sort keys are created_at and id, here:

http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n2463

Revision history for this message
Joseph bajin (josephbajin) wrote :

This is nova code that is generating these queries not a user. I know it's a user because without the fixes, we had over 4000 slow queries during a 12 hour or so period. The only thing that would be doing that is our CI/CD teams and test instances that we are running 24x7.

So, yes maybe we would have problems with users doing those queries, but those would be far and few unlike the always happening ones we are seeing. It's probably part of our steps that we are doing to get the current vms running, so it's probably a nova list type of action.

Matt Riedemann (mriedem)
Changed in nova:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/288765
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9bc935d7d059c5ad9ff563b373691c3134e8f3ac
Submitter: Jenkins
Branch: master

commit 9bc935d7d059c5ad9ff563b373691c3134e8f3ac
Author: Matt Riedemann <email address hidden>
Date: Fri Mar 4 17:07:41 2016 -0500

    Add index on instances table across deleted/created_at columns

    The process_sort_params method, used by instance_get_all_by_filters_sort,
    defaults to ordering by the created_at and id columns.

    Adding an index on the deleted and created_at columns significantly
    reduces the time it takes to run the query on the instances table.

    From the bug report:

    Instead of 993,126 rows, it only grabbed 1490.

    And drops from 10.61 seconds to 0.28 seconds on the data set in
    the bug report.

    The index was provided by Joseph bajin <email address hidden>
    in the bug report.

    Change-Id: I25ec8a0b8124d77926d938dc911464cc8b9d2945
    Closes-Bug: #1553184

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0rc1

This issue was fixed in the openstack/nova 13.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.