AggregateInstanceExtraSpecs very slow

Bug #1133495 reported by Cor Cornelisse
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Joe Gordon

Bug Description

This bug report is here to be able to link the patch (to be commited) to the bug reported in the mailinglist by Sam Morrison

I quote:

---------

I have been playing with the AggregateInstanceExtraSpecs filter and can't get it to work.

In our staging environment it works fine with 4 compute nodes, I have 2 aggregates to split them into 2.

When I try to do the same in our production environment which has 80 compute nodes (splitting them again into 2 aggregates) it doesn't work.

nova-scheduler starts to go very slow, I scheduled an instance and gave up after 5 minutes, it seemed to be taking ages and the host was at 100% cpu. Also got about 500 messages in rabbit that were unacknowledged.

We are running stable/folsom. Does anyone else have this issue or know if there have been any fixes in Grizzly relating to this? I couldn't see any bugs about it.

Thanks,
Sam

---------

A brief discussion followed, indicating this appears to be a bug indeed, not a missing index. We're experiencing and seeing the same behavior, and I think I've found the root cause.

From what I can see, there's a many-To-many relation defined in nova/db/sqlalchemy/models.py between Aggregate class and both AggregateHost and AggregateMetadata classes. This should be a one to many relation if I'm not mistaken, so I've created a patch to reflect this.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

Excellent, please post the patch here or on gerrit.

Changed in nova:
status: New → Triaged
importance: Undecided → High
tags: added: folsom-backport-potential
Revision history for this message
Chris Behrens (cbehrens) wrote :

Here's a patch that helps the specific query being used:

http://paste.openstack.org/show/32534/

Revision history for this message
Sam Morrison (sorrison) wrote :
Download full text (8.4 KiB)

I've tested this patch in our test env. (4 hosts with one aggregate assigned to each host)

The original query is:

SELECT aggregates.created_at AS aggregates_created_at, aggregates.updated_at AS aggregates_updated_at, aggregates.deleted_at AS aggregates_deleted_at, aggregates.deleted AS aggregates_deleted, aggregates.id AS aggregates_id, aggregates.name AS aggregates_name, aggregates.availability_zone AS aggregates_availability_zone, aggregate_hosts_1.created_at AS aggregate_hosts_1_created_at, aggregate_hosts_1.updated_at AS aggregate_hosts_1_updated_at, aggregate_hosts_1.deleted_at AS aggregate_hosts_1_deleted_at, aggregate_hosts_1.deleted AS aggregate_hosts_1_deleted, aggregate_hosts_1.id AS aggregate_hosts_1_id, aggregate_hosts_1.host AS aggregate_hosts_1_host, aggregate_hosts_1.aggregate_id AS aggregate_hosts_1_aggregate_id FROM aggregates INNER JOIN aggregate_hosts AS aggregate_hosts_2 ON aggregates.id = aggregate_hosts_2.aggregate_id AND aggregate_hosts_2.deleted = 0 AND aggregates.deleted = 0 INNER JOIN aggregate_hosts ON aggregate_hosts.aggregate_id = aggregates.id AND aggregate_hosts.deleted = 0 AND aggregates.deleted = 0 INNER JOIN aggregate_metadata AS aggregate_metadata_1 ON aggregates.id = aggregate_metadata_1.aggregate_id AND aggregate_metadata_1.deleted = 0 AND aggregates.deleted = 0 INNER JOIN aggregate_metadata ON aggregate_metadata.aggregate_id = aggregates.id AND aggregate_metadata.deleted = 0 AND aggregates.deleted = 0 LEFT OUTER JOIN aggregate_hosts AS aggregate_hosts_3 ON aggregates.id = aggregate_hosts_3.aggregate_id AND aggregate_hosts_3.deleted = 0 AND aggregates.deleted = 0 LEFT OUTER JOIN aggregate_hosts AS aggregate_hosts_1 ON aggregate_hosts_1.aggregate_id = aggregates.id AND aggregate_hosts_1.deleted = 0 AND aggregates.deleted = 0 WHERE aggregates.deleted = 0 AND aggregate_hosts.host = 'cc2';

This returns 64 rows.

With the patch applied I get this SQL which returns 4 rows:

SELECT aggregates.created_at AS aggregates_created_at, aggregates.updated_at AS aggregates_updated_at, aggregates.deleted_at AS aggregates_deleted_at, aggregates.deleted AS aggregates_deleted, aggregates.id AS aggregates_id, aggregates.name AS aggregates_name, aggregates.availability_zone AS aggregates_availability_zone, aggregate_hosts_1.created_at AS aggregate_hosts_1_created_at, aggregate_hosts_1.updated_at AS aggregate_hosts_1_updated_at, aggregate_hosts_1.deleted_at AS aggregate_hosts_1_deleted_at, aggregate_hosts_1.deleted AS aggregate_hosts_1_deleted, aggregate_hosts_1.id AS aggregate_hosts_1_id, aggregate_hosts_1.host AS aggregate_hosts_1_host, aggregate_hosts_1.aggregate_id AS aggregate_hosts_1_aggregate_id, aggregate_metadata_1.created_at AS aggregate_metadata_1_created_at, aggregate_metadata_1.updated_at AS aggregate_metadata_1_updated_at, aggregate_metadata_1.deleted_at AS aggregate_metadata_1_deleted_at, aggregate_metadata_1.deleted AS aggregate_metadata_1_deleted, aggregate_metadata_1.id AS aggregate_metadata_1_id, aggregate_metadata_1.`key` AS aggregate_metadata_1_key, aggregate_metadata_1.value AS aggregate_metadata_1_value, aggregate_metadata_1.aggregate_id AS aggregate_metadata_1_aggregate_id FROM aggregate_hosts, aggre...

Read more...

Revision history for this message
Sam Morrison (sorrison) wrote :

This patch is currently breaking the following tests too so may have some bad side affects

======================================================================
FAIL: test_aggregate_filter_fails_extra_specs_deleted_host (nova.tests.scheduler.test_host_filters.HostFiltersTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/buildd/nova-2012.2.2+20130129/nova/tests/scheduler/test_host_filters.py", line 798, in test_aggregate_filter_fails_extra_specs_deleted_host
    self.assertFalse(filt_cls.host_passes(host, filter_properties))
AssertionError: True is not false
    'True is not false' = self._formatMessage('True is not false', "%s is not false" % safe_repr(True))
>> raise self.failureException('True is not false')

======================================================================
FAIL: Ensure we can get aggregates by host.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/buildd/nova-2012.2.2+20130129/nova/tests/test_db_api.py", line 688, in test_aggregate_metdata_get_by_host
    self.assertFalse('badkey' in r1)
AssertionError: True is not false
    'True is not false' = self._formatMessage('True is not false', "%s is not false" % safe_repr(True))
>> raise self.failureException('True is not false')

======================================================================
FAIL: Ensure we can get aggregates by host.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/buildd/nova-2012.2.2+20130129/nova/tests/test_db_api.py", line 704, in test_aggregate_metdata_get_by_host_with_key
    self.assertFalse('fake_key1' in r1)
AssertionError: True is not false
    'True is not false' = self._formatMessage('True is not false', "%s is not false" % safe_repr(True))
>> raise self.failureException('True is not false')

----------------------------------------------------------------------

Revision history for this message
Joe Gordon (jogo) wrote :

Sam what happens in your environment if you try:

diff --git a/nova/db/sqlalchemy/models.py b/nova/db/sqlalchemy/models.py
index 95456bf..49cf23e 100644
--- a/nova/db/sqlalchemy/models.py
+++ b/nova/db/sqlalchemy/models.py
@@ -842,28 +842,18 @@ class Aggregate(BASE, NovaBase):
     name = Column(String(255))
     _hosts = relationship(AggregateHost,
                           lazy="joined",
- secondary="aggregate_hosts",
                           primaryjoin='and_('
                                  'Aggregate.id == AggregateHost.aggregate_id,'
                                  'AggregateHost.deleted == 0,'
                                  'Aggregate.deleted == 0)',
- secondaryjoin='and_('
- 'AggregateHost.aggregate_id == Aggregate.id, '
- 'AggregateHost.deleted == 0,'
- 'Aggregate.deleted == 0)',
                          backref='aggregates')

     _metadata = relationship(AggregateMetadata,
- secondary="aggregate_metadata",
                          primaryjoin='and_('
                              'Aggregate.id == AggregateMetadata.aggregate_id,'
                              'AggregateMetadata.deleted == 0,'
                              'Aggregate.deleted == 0)',
- secondaryjoin='and_('
- 'AggregateMetadata.aggregate_id == Aggregate.id, '
- 'AggregateMetadata.deleted == 0,'
- 'Aggregate.deleted == 0)',
- backref='aggregates')
+ backref='aggregates')

     def _extra_keys(self):
         return ['hosts', 'metadetails', 'availability_zone']

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/23115

Changed in nova:
assignee: nobody → Joe Gordon (jogo)
status: Triaged → In Progress
Revision history for this message
Sam Morrison (sorrison) wrote :

Hi Joe,

Funny, this is exactly the same patch I went for in the end too. It reduces the sql queries down a lot but it still isn't 100% correct.

The number of rows will be the number of hosts that are in the same aggregate as the host you choose. Not ideal but still better than the 300K+ rows being returned originally.

Would be nice to get it to return the correct result. It can produce the correct sql when you remove the primaryjoins too but it breaks a lot of tests and I don't know enough about sqlalchemy to figure out what's going on.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/23164

Changed in nova:
assignee: Joe Gordon (jogo) → Cor Cornelisse (corcornelisse)
Revision history for this message
Cor Cornelisse (corcornelisse) wrote : Re: [Bug 1133495] Re: AggregateInstanceExtraSpecs very slow

Whoops, just noticed a fix was already committed here:

https://review.openstack.org/23115/

Please abandon:

https://review.openstack.org/23164

Sorry...

On Thu, Feb 28, 2013 at 12:15 PM, OpenStack Hudson <
<email address hidden>> wrote:

> Fix proposed to branch: master
> Review: https://review.openstack.org/23164
>
> ** Changed in: nova
> Assignee: Joe Gordon (jogo) => Cor Cornelisse (corcornelisse)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1133495
>
> Title:
> AggregateInstanceExtraSpecs very slow
>
> Status in OpenStack Compute (Nova):
> In Progress
>
> Bug description:
> This bug report is here to be able to link the patch (to be commited)
> to the bug reported in the mailinglist by Sam Morrison
>
> I quote:
>
> ---------
>
> I have been playing with the AggregateInstanceExtraSpecs filter and
> can't get it to work.
>
> In our staging environment it works fine with 4 compute nodes, I have
> 2 aggregates to split them into 2.
>
> When I try to do the same in our production environment which has 80
> compute nodes (splitting them again into 2 aggregates) it doesn't
> work.
>
> nova-scheduler starts to go very slow, I scheduled an instance and
> gave up after 5 minutes, it seemed to be taking ages and the host was
> at 100% cpu. Also got about 500 messages in rabbit that were
> unacknowledged.
>
> We are running stable/folsom. Does anyone else have this issue or know
> if there have been any fixes in Grizzly relating to this? I couldn't
> see any bugs about it.
>
> Thanks,
> Sam
>
> ---------
>
> A brief discussion followed, indicating this appears to be a bug
> indeed, not a missing index. We're experiencing and seeing the same
> behavior, and I think I've found the root cause.
>
> From what I can see, there's a many-To-many relation defined in
> nova/db/sqlalchemy/models.py between Aggregate class and both
> AggregateHost and AggregateMetadata classes. This should be a one to
> many relation if I'm not mistaken, so I've created a patch to reflect
> this.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/nova/+bug/1133495/+subscriptions
>

--
A lie told often enough becomes the truth.

Lenin (1870 - 1924)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/23115
Committed: http://github.com/openstack/nova/commit/ce23d6c1d679355bafede816bca39ea1eb24073c
Submitter: Jenkins
Branch: master

commit ce23d6c1d679355bafede816bca39ea1eb24073c
Author: Joe Gordon <email address hidden>
Date: Wed Feb 27 20:59:08 2013 +0000

    Shrink size of aggregate_metadata_get_by_host sql query

    * Remove unnecessary secondary joins in Aggregate model.
    * Remove unnecessary backrefs (only needed to create bidirectional relationship)

    aggregate_metadata_get_by_host goes from 6 JOINS down to two JOINS.

    Fix bug 1133495

    Change-Id: I72966fa205973ec638816b07bfdcd54f1102feb5

Changed in nova:
status: In Progress → Fix Committed
Joe Gordon (jogo)
Changed in nova:
assignee: Cor Cornelisse (corcornelisse) → Joe Gordon (jogo)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/folsom)

Fix proposed to branch: stable/folsom
Review: https://review.openstack.org/23207

Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-rc1 → 2013.1
Revision history for this message
Alan Pevec (apevec) wrote :

stable/folsom review was abandoned

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/folsom)

Reviewed: https://review.openstack.org/23207
Committed: http://github.com/openstack/nova/commit/e11d90af9c30d9af875974520cef207a839a0e21
Submitter: Jenkins
Branch: stable/folsom

commit e11d90af9c30d9af875974520cef207a839a0e21
Author: Joe Gordon <email address hidden>
Date: Wed Apr 3 20:25:53 2013 +0000

    Shrink size of aggregate_metadata_get_by_host sql query

    * Remove unnecessary secondary joins in Aggregate model.
    * Remove unnecessary backrefs (only needed to create bidirectional relationship)

    aggregate_metadata_get_by_host goes from 6 JOINS down to two JOINS.

    Fix bug 1133495

    Change-Id: I72966fa205973ec638816b07bfdcd54f1102feb5
    (cherry picked from commit ce23d6c1d679355bafede816bca39ea1eb24073c)

Sean Dague (sdague)
no longer affects: nova/folsom
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers