scheduler affinity doesn't work with multiple cells

Bug #1746863 reported by melanie witt
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
melanie witt
Pike
Fix Committed
High
melanie witt
Queens
Fix Committed
High
melanie witt
Rocky
Fix Committed
Undecided
melanie witt

Bug Description

I happened upon this while hacking on my WIP CellDatabases fixture patch.

Some of the nova/tests/functional/test_server_group.py tests started failing with multiple cells and I found that it's because there's a database query 'objects.InstanceList.get_by_filters' for all instances who are members of the server group to do the affinity check. The query for instances doesn't check all cells, so it fails to return any hosts that group members are currently on.

This makes the ServerGroup[Anti|]AffinityFilter a no-op for multiple cells. Affinity is checked again via the late-affinity check in compute, but compute is using the same InstanceGroup.get_hosts method and will only find group member's hosts that are in its cell.

This is the code that populates the RequestSpec.instance_group.hosts via a
lazy-load on first access:

nova/objects/instance_group.py:

    def obj_load_attr(self, attrname):
        ...
        self.hosts = self.get_hosts()
        self.obj_reset_changes(['hosts'])

    ...

    @base.remotable
    def get_hosts(self, exclude=None):
        """Get a list of hosts for non-deleted instances in the group
        This method allows you to get a list of the hosts where instances in
        this group are currently running. There's also an option to exclude
        certain instance UUIDs from this calculation.
        """
        filter_uuids = self.members
        if exclude:
            filter_uuids = set(filter_uuids) - set(exclude)
        filters = {'uuid': filter_uuids, 'deleted': False}
        instances = objects.InstanceList.get_by_filters(self._context,
                                                        filters=filters)
        return list(set([instance.host for instance in instances
                         if instance.host]))

melanie witt (melwitt)
description: updated
melanie witt (melwitt)
tags: removed: performance
summary: - database query via lazy-load in ServerGroup(Anti|)AffinityFilter
+ scheduler affinity doesn't work with multiple cells
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/540258

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: New → In Progress
Revision history for this message
melanie witt (melwitt) wrote :

I was wrong about any lazy-load of instance_group.hosts happening in a scheduler filter. It was occurring in nova/scheduler/utils.py in the setup_instance_group method which is called before select_destinations in conductor.

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

It seems [anti-]affinity doesn't work at all with cells_v2. https://bugs.launchpad.net/nova/+bug/1746863

Changed in nova:
status: In Progress → Won't Fix
status: Won't Fix → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/585073

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/585073
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4f2c4993b1748d679bce8de6ffc672490ef3fa49
Submitter: Zuul
Branch: master

commit 4f2c4993b1748d679bce8de6ffc672490ef3fa49
Author: melanie witt <email address hidden>
Date: Tue Jul 24 02:28:47 2018 +0000

    Add functional test for affinity with multiple cells

    Related-Bug: #1746863

    Change-Id: I418bbe5c38faf3e460b0d62219507ed64e156682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/540258
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=14f4c502f92b10b669044e5069ac3b3555a42ee0
Submitter: Zuul
Branch: master

commit 14f4c502f92b10b669044e5069ac3b3555a42ee0
Author: melanie witt <email address hidden>
Date: Fri Feb 2 05:41:20 2018 +0000

    Make scheduler.utils.setup_instance_group query all cells

    To check affinity and anti-affinity policies for scheduling instances,
    we use the RequestSpec.instance_group.hosts field to check the hosts
    that have group members on them. Access of the 'hosts' field calls
    InstanceGroup.get_hosts during a lazy-load and get_hosts does a query
    for all instances that are members of the group and returns their hosts
    after removing duplicates. The InstanceList query isn't targeting any
    cells, so it will return [] in a multi-cell environment in both the
    instance create case and the instance move case. In the move case, we
    do have a cell-targeted RequestContext when setup_instance_group is
    called *but* the RequestSpec.instance_group object is queried early in
    compute/api before we're targeted to a cell, so a call of
    RequestSpec.instance_group.get_hosts() will result in [] still, even
    for move operations.

    This makes setup_instance_group query all cells for instances that are
    members of the instance group if the RequestContext is untargeted, else
    it queries the targeted cell for the instances.

    Closes-Bug: #1746863

    Change-Id: Ia5f5a0d75953b1154a8de3e1eaa15f8042e32d77

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/599731

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/599732

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/599765

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/599766

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/599840

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/599841

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/599731
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9dbe8cd5d31c8f63765d239392b0c0f38ee674d1
Submitter: Zuul
Branch: stable/rocky

commit 9dbe8cd5d31c8f63765d239392b0c0f38ee674d1
Author: melanie witt <email address hidden>
Date: Tue Jul 24 02:28:47 2018 +0000

    Add functional test for affinity with multiple cells

    Related-Bug: #1746863

    Change-Id: I418bbe5c38faf3e460b0d62219507ed64e156682
    (cherry picked from commit 4f2c4993b1748d679bce8de6ffc672490ef3fa49)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/599732
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3051a0d5d5639b451d6da5e9f490332399b18c54
Submitter: Zuul
Branch: stable/rocky

commit 3051a0d5d5639b451d6da5e9f490332399b18c54
Author: melanie witt <email address hidden>
Date: Fri Feb 2 05:41:20 2018 +0000

    Make scheduler.utils.setup_instance_group query all cells

    To check affinity and anti-affinity policies for scheduling instances,
    we use the RequestSpec.instance_group.hosts field to check the hosts
    that have group members on them. Access of the 'hosts' field calls
    InstanceGroup.get_hosts during a lazy-load and get_hosts does a query
    for all instances that are members of the group and returns their hosts
    after removing duplicates. The InstanceList query isn't targeting any
    cells, so it will return [] in a multi-cell environment in both the
    instance create case and the instance move case. In the move case, we
    do have a cell-targeted RequestContext when setup_instance_group is
    called *but* the RequestSpec.instance_group object is queried early in
    compute/api before we're targeted to a cell, so a call of
    RequestSpec.instance_group.get_hosts() will result in [] still, even
    for move operations.

    This makes setup_instance_group query all cells for instances that are
    members of the instance group if the RequestContext is untargeted, else
    it queries the targeted cell for the instances.

    Closes-Bug: #1746863

    Change-Id: Ia5f5a0d75953b1154a8de3e1eaa15f8042e32d77
    (cherry picked from commit 14f4c502f92b10b669044e5069ac3b3555a42ee0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/599765
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c35f29001a2ce036b7281a125cb5077e0dddc605
Submitter: Zuul
Branch: stable/queens

commit c35f29001a2ce036b7281a125cb5077e0dddc605
Author: melanie witt <email address hidden>
Date: Tue Jul 24 02:28:47 2018 +0000

    Add functional test for affinity with multiple cells

    The 'az' keyword arg needed to be added to the
    _build_minimal_create_server_request method in the InstanceHelperMixin
    because we don't have 96f10711667603e7fbad57b151c6438cdd9ae270 in
    Queens.

    A fake_network.set_stub_network_methods(self) call needed to be added
    to the test because we don't have
    acd3216a8beac263aa87d785c21e816d29b1b6bb in Queens.

    Related-Bug: #1746863

    Change-Id: I418bbe5c38faf3e460b0d62219507ed64e156682
    (cherry picked from commit 4f2c4993b1748d679bce8de6ffc672490ef3fa49)
    (cherry picked from commit 9dbe8cd5d31c8f63765d239392b0c0f38ee674d1)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/599766
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ab939372cd0488a0a4e2a8a1764a42eff2e329c5
Submitter: Zuul
Branch: stable/queens

commit ab939372cd0488a0a4e2a8a1764a42eff2e329c5
Author: melanie witt <email address hidden>
Date: Fri Feb 2 05:41:20 2018 +0000

    Make scheduler.utils.setup_instance_group query all cells

    To check affinity and anti-affinity policies for scheduling instances,
    we use the RequestSpec.instance_group.hosts field to check the hosts
    that have group members on them. Access of the 'hosts' field calls
    InstanceGroup.get_hosts during a lazy-load and get_hosts does a query
    for all instances that are members of the group and returns their hosts
    after removing duplicates. The InstanceList query isn't targeting any
    cells, so it will return [] in a multi-cell environment in both the
    instance create case and the instance move case. In the move case, we
    do have a cell-targeted RequestContext when setup_instance_group is
    called *but* the RequestSpec.instance_group object is queried early in
    compute/api before we're targeted to a cell, so a call of
    RequestSpec.instance_group.get_hosts() will result in [] still, even
    for move operations.

    This makes setup_instance_group query all cells for instances that are
    members of the instance group if the RequestContext is untargeted, else
    it queries the targeted cell for the instances.

    Closes-Bug: #1746863

     Conflicts:
     nova/objects/instance_group.py

    NOTE(melwitt): The conflict was from oslo_utils jsonutils not being
    used in Queens.

    Change-Id: Ia5f5a0d75953b1154a8de3e1eaa15f8042e32d77
    (cherry picked from commit 14f4c502f92b10b669044e5069ac3b3555a42ee0)
    (cherry picked from commit 3051a0d5d5639b451d6da5e9f490332399b18c54)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.6

This issue was fixed in the openstack/nova 17.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.1

This issue was fixed in the openstack/nova 18.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/599840
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa1966a70eb38bba1c7c3f88a77a2f2ea5fcc75f
Submitter: Zuul
Branch: stable/pike

commit aa1966a70eb38bba1c7c3f88a77a2f2ea5fcc75f
Author: melanie witt <email address hidden>
Date: Tue Jul 24 02:28:47 2018 +0000

    Add functional test for affinity with multiple cells

    The 'az' keyword arg needed to be added to the
    _build_minimal_create_server_request method in the InstanceHelperMixin
    because we don't have 96f10711667603e7fbad57b151c6438cdd9ae270 in
    Queens.

    A fake_network.set_stub_network_methods(self) call needed to be added
    to the test because we don't have
    acd3216a8beac263aa87d785c21e816d29b1b6bb in Queens.

    The post_aggregate_action method from commit
    8204b2492b07438d0569d3807e1c34f196756253 is also needed for this change
    to set the AZ for an aggregate.

    Related-Bug: #1746863

    Change-Id: I418bbe5c38faf3e460b0d62219507ed64e156682
    (cherry picked from commit 4f2c4993b1748d679bce8de6ffc672490ef3fa49)
    (cherry picked from commit 9dbe8cd5d31c8f63765d239392b0c0f38ee674d1)
    (cherry picked from commit c35f29001a2ce036b7281a125cb5077e0dddc605)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/599841
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1a266633e1077fe4a9101bb456a69ab89689be4b
Submitter: Zuul
Branch: stable/pike

commit 1a266633e1077fe4a9101bb456a69ab89689be4b
Author: melanie witt <email address hidden>
Date: Fri Feb 2 05:41:20 2018 +0000

    Make scheduler.utils.setup_instance_group query all cells

    To check affinity and anti-affinity policies for scheduling instances,
    we use the RequestSpec.instance_group.hosts field to check the hosts
    that have group members on them. Access of the 'hosts' field calls
    InstanceGroup.get_hosts during a lazy-load and get_hosts does a query
    for all instances that are members of the group and returns their hosts
    after removing duplicates. The InstanceList query isn't targeting any
    cells, so it will return [] in a multi-cell environment in both the
    instance create case and the instance move case. In the move case, we
    do have a cell-targeted RequestContext when setup_instance_group is
    called *but* the RequestSpec.instance_group object is queried early in
    compute/api before we're targeted to a cell, so a call of
    RequestSpec.instance_group.get_hosts() will result in [] still, even
    for move operations.

    This makes setup_instance_group query all cells for instances that are
    members of the instance group if the RequestContext is untargeted, else
    it queries the targeted cell for the instances.

    Closes-Bug: #1746863

    Change-Id: Ia5f5a0d75953b1154a8de3e1eaa15f8042e32d77
    (cherry picked from commit 14f4c502f92b10b669044e5069ac3b3555a42ee0)
    (cherry picked from commit 3051a0d5d5639b451d6da5e9f490332399b18c54)
    (cherry picked from commit ab939372cd0488a0a4e2a8a1764a42eff2e329c5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.7

This issue was fixed in the openstack/nova 16.1.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.