OpenStack Compute (nova)

nova list --all-tenants fetches all instance faults but uses only latest

Bug #1632247 reported by Tina Kevin on 2016-10-11

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Roman Podoliaka

Bug Description

Description
===========
There are 15 instances. I excute the command: nova --debug list --all-tenants,
But it took more than 40 seconds. I read the nova api code, it sends a Get request
and read the instance_faults table for detail information. The instance_faults table has
more than tens of thousands of records. Each instance has many records.

GET /v2/433288e1244046a9bd306658b732dded/servers/detail

I think instance_faults table needs to be optimized. A large number of records in the instance_faults
table are useless, only leaving the last three records should be on it, others can be deleted.

There are any other optimization program?

Steps to reproduce
==================
A chronological list of steps which will bring off the
issue you noticed:
* I excute the command: nova --debug list --all-tenants

A list of openstack client commands (with correct argument value)
$ nova --debug list --all-tenants

Expected result
===============
I expect to be back soon within 10 seconds

Actual result
=============
But the query took more than 40 seconds.

Environment
===========
1. version
Mitaka

2. Which hypervisor did you use?
Libvirt + KVM

See original description

Tags:

Tina Kevin (song-ruixia) on 2016-10-11

tags:

added: all-tenants list slow

Tina Kevin (song-ruixia) on 2016-10-29

Changed in nova:
importance:	Undecided → Wishlist

Sivasathurappan Radhakrishnan (siva-radhakrishnan) on 2016-11-02

Changed in nova:
assignee:	nobody → Sivasathurappan Radhakrishnan (siva-radhakrishnan)

Revision history for this message

Sivasathurappan Radhakrishnan (siva-radhakrishnan) wrote on 2016-11-08:

Hi Tina Kevin!
I tried reproducing the issue but I wasn't able to do so. I just dug into this issue in more detail and below are my findings.

(a) In instance_faults table we are soft deleting the records by setting the column to numeric value which can be later archived.

(b) Even if you have thousand of records in the table, the query selects the records based only deleted flag set

(c) instance_faults has column 'detail' which is of mediumtext format which can possibly slown down the query. Since we just put stack trace in detail column so I assume innodb wouldn't use external blob storage page to store it.

Can you please let me know how did you confirm if this method was the reason for the query being too slow ?

Thanks,
Siva

Revision history for this message

Tina Kevin (song-ruixia) wrote on 2016-11-10:

@siva
My steps:
I add logs in code, and then gradually narrowing the scope of the code,
the final positioning to the follow function.

    /nova/api/openstack/compute/servers.py
    def _get_servers(self, req, is_detail):
        """Returns a list of servers, based on any search options specified."""
        ……

        if is_detail:
             instance_list.fill_faults() #the function
             response = self._view_builder.detail(req, instance_list)

    /nova/objects/instance.py
    def fill_faults(self):
        """Batch query the database for our instances' faults.

        :returns: A list of instance uuids for which faults were found.
        """
        uuids = [inst.uuid for inst in self]
        LOG.info(_("before"))
        faults = objects.InstanceFaultList.get_by_instance_uuids(
            self._context, uuids) #the function
        LOG.info(_("end"))

I compare the printing time of the two logs, they differ by about 30 seconds.
I also test the other case, I remove this line of code and then query,
the total time is much less obvious than before. So I think it is caused by the
instance_faults table records

   def _get_servers(self, req, is_detail):
        """Returns a list of servers, based on any search options specified."""
        ……

        if is_detail:
             #instance_list.fill_faults() #remove
             response = self._view_builder.detail(req, instance_list)

Revision history for this message

Sivasathurappan Radhakrishnan (siva-radhakrishnan) wrote on 2016-11-21:

Hi Tina!
I was not able to reproduce the bug. Tried spinning up 15 instances as suggested by you. Need to wait for someone to confirm it.

Regards,
Siva.

Sean Dague (sdague) on 2016-12-09

tags:

added: db
removed: all-tenants list slow

Sean Dague (sdague) on 2016-12-09

Changed in nova:
status:	New → Confirmed
importance:	Wishlist → Medium
summary:	- The command "nova list --all-tenants" query is slow + We fetch all instance faults, then throw away all but the last ones on + "nova list --all-tenants", potentinally making the command slow

Revision history for this message

Sean Dague (sdague) wrote on 2016-12-09: Re: We fetch all instance faults, then throw away all but the last ones on "nova list --all-tenants", potentinally making the command slow

In looking at the code, we definitely pull over *all faults* sorted, then pretty much throw out all but the last ones in almost every situation.

I expect what we need to do is create an optimized query that only returns the latest faults for each uuid. That would dramatically eliminate content on the wire for long lived instances that might have seen issues over time. That needs more clever SQL than my brain can muster right now. Will engage Jay Pipes in that.

Jay Pipes (jaypipes) on 2016-12-12

summary:	- We fetch all instance faults, then throw away all but the last ones on - "nova list --all-tenants", potentinally making the command slow + nova list --all-tenants fetches all instance faults but uses only latest
Changed in nova:
assignee:	Sivasathurappan Radhakrishnan (siva-radhakrishnan) → Jay Pipes (jaypipes)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-12-12: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/409943

Changed in nova:
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2016-12-13

Changed in nova:
assignee:	Jay Pipes (jaypipes) → Roman Podoliaka (rpodolyaka)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-12-29: Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/415736

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-02: Related fix merged to nova (master)

Reviewed: https://review.openstack.org/415736
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc911b8026a5cf949ca5a697bb802c495dc9ada2
Submitter: Jenkins
Branch: master

commit dc911b8026a5cf949ca5a697bb802c495dc9ada2
Author: Matt Riedemann <email address hidden>
Date: Thu Dec 29 12:17:10 2016 -0500

Fix assertion in test_instance_fault_get_by_instance

    The instance_fault_get_by_instance_uuids DB API method returns
    results sorted by created_at in descending order, so the newest
    fault should be at the front of the list for a given instance
    uuid in the resulting dict.

    However, the way the test for this was written the 'expected'
    list actually had the faults in order from oldest to newest. The
    reason this didn't fail the test was because it was using
    _assertEqualListsOfObjects to compare the results, which sorts
    the lists before comparing them, which is exactly NOT what we want
    when we care about the sort order of the results.

    The test is fixed by using _assertEqualOrderedListOfObjects which
    leaves the original sort order of the lists to compare and fixes
    the 'expected' list by inserting the newest faults at the front
    of the list.

Change-Id: I58990194016447b05b42df76141a193595cdcb9c
Related-Bug: #1632247

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-04: Fix merged to nova (master)

Reviewed: https://review.openstack.org/409943
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=176c5c8a65efbde01020bc69a97bb7d05720589e
Submitter: Jenkins
Branch: master

commit 176c5c8a65efbde01020bc69a97bb7d05720589e
Author: Jay Pipes <email address hidden>
Date: Mon Dec 12 16:39:23 2016 -0500

Only return latest instance fault for instances

    This patch addresses slowness that can occur when doing a list servers
    API operation when there are many thousands of records in the
    instance_faults table.

    Previously, in the Instance.fill_faults() method, we were getting all
    instance fault records for a set of instances having one of a set of
    supplied instance UUIDs and then iterating over those faults and
    returning a dict of instance UUID to the first fault returned (which
    happened to be the latest fault because of ordering the SQL query by
    created_at).

    This patch adds a new InstanceFaultList.get_latest_by_instance_uuids()
    method that does some SQL-fu to only return the latest fault records for
    each instance being inspected.

Closes-Bug: #1632247

Co-Authored-By: Roman Podoliaka <email address hidden>
Change-Id: I8f2227b3969791ebb2d04d74a316b9d97a4b1571