migration of anti-affinity server fails due to stale scheduler instance info

Bug #1869050 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Balazs Gibizer
Pike
Invalid
Low
Unassigned
Queens
Invalid
Low
Balazs Gibizer
Rocky
Fix Released
Low
Balazs Gibizer
Stein
Fix Released
Low
Elod Illes
Train
Fix Released
Low
Elod Illes
Ussuri
Fix Released
Low
Balazs Gibizer

Bug Description

Description
===========

Steps to reproduce
==================
Have a deployment with 3 compute nodes

* make sure that the deployment is configured with tracks_instance_changes=True (True is the default)
* create and server group with anti-affinity policy
* boot server1 into the group
* boot server2 into the group
* migrate server2
* confirm the migration
* boot server3

Make sure that between the last two steps there was no periodic _sync_scheduler_instance_info running on the compute that was hosted server2 before the migration. This could done by doing the last too steps after each other without waiting too much as interval of that periodic (scheduler_instance_sync_interval) is defaulted to 120 sec.

Expected result
===============
server3 is booted on the host where server2 is moved away

Actual result
=============
server3 cannot be booted (NoValidHost)

Triage
======

The confirm resize call on the source compute does not update the scheduler that the instance is removed from this host. This makes the scheduler instance info stale and causing the subsequent scheduling error.

Changed in nova:
status: New → Triaged
importance: Undecided → Low
assignee: nobody → Balazs Gibizer (balazs-gibizer)
tags: added: compute scheduler
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/714997

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/714998

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/714997
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b52c483308f32f3744dd8a5df424b9f518c13155
Submitter: Zuul
Branch: master

commit b52c483308f32f3744dd8a5df424b9f518c13155
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:38:14 2020 +0100

    Reproduce bug 1869050

    This patch adds a functional test that reproduce the bug when stale
    scheduler instance info prevents booting server with anti-affinity.

    Change-Id: If485330b48ae2671651aafabc93f92a8999f7ca2
    Related-Bug: #1869050

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/714998
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=738110db7492b1360f5f197e8ecafd69a3b141b4
Submitter: Zuul
Branch: master

commit 738110db7492b1360f5f197e8ecafd69a3b141b4
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:48:23 2020 +0100

    Update scheduler instance info at confirm resize

    When a resize is confirmed the instance does not belong to the source
    compute any more. In the past the scheduler instance info is only
    updated by the _sync_scheduler_instance_info periodic. This caused that
    server boots with anti-affinity did not consider the source host.
    But now at the end of the confirm_resize call the compute also updates
    the scheduler about the move.

    Change-Id: Ic50e72e289b56ac54720ad0b719ceeb32487b8c8
    Closes-Bug: #1869050

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/728781

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/728782

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/728781
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=016eeec9841116bbbbc6c3019850c18012e3781a
Submitter: Zuul
Branch: stable/ussuri

commit 016eeec9841116bbbbc6c3019850c18012e3781a
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:38:14 2020 +0100

    Reproduce bug 1869050

    This patch adds a functional test that reproduce the bug when stale
    scheduler instance info prevents booting server with anti-affinity.

    Change-Id: If485330b48ae2671651aafabc93f92a8999f7ca2
    Related-Bug: #1869050
    (cherry picked from commit b52c483308f32f3744dd8a5df424b9f518c13155)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/728782
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e8b3927c92d29c74fd0c79b5a51b7a34e9d66236
Submitter: Zuul
Branch: stable/ussuri

commit e8b3927c92d29c74fd0c79b5a51b7a34e9d66236
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:48:23 2020 +0100

    Update scheduler instance info at confirm resize

    When a resize is confirmed the instance does not belong to the source
    compute any more. In the past the scheduler instance info is only
    updated by the _sync_scheduler_instance_info periodic. This caused that
    server boots with anti-affinity did not consider the source host.
    But now at the end of the confirm_resize call the compute also updates
    the scheduler about the move.

    Change-Id: Ic50e72e289b56ac54720ad0b719ceeb32487b8c8
    Closes-Bug: #1869050
    (cherry picked from commit 738110db7492b1360f5f197e8ecafd69a3b141b4)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/729162

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/train
Review: https://review.opendev.org/729163

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/729162
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=66e4d8218133a5e4a68f68a3017446cb585675c4
Submitter: Zuul
Branch: stable/train

commit 66e4d8218133a5e4a68f68a3017446cb585675c4
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:38:14 2020 +0100

    Reproduce bug 1869050

    This patch adds a functional test that reproduce the bug when stale
    scheduler instance info prevents booting server with anti-affinity.

    Some adjustment was needed due to I8c96b337f32148f8f5899c9b87af331b1fa41424
    is missing from stable/train

    Change-Id: If485330b48ae2671651aafabc93f92a8999f7ca2
    Related-Bug: #1869050
    (cherry picked from commit b52c483308f32f3744dd8a5df424b9f518c13155)
    (cherry picked from commit 016eeec9841116bbbbc6c3019850c18012e3781a)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/729163
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e34b375a6161b15d92beba64fa281f40634ffeab
Submitter: Zuul
Branch: stable/train

commit e34b375a6161b15d92beba64fa281f40634ffeab
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:48:23 2020 +0100

    Update scheduler instance info at confirm resize

    When a resize is confirmed the instance does not belong to the source
    compute any more. In the past the scheduler instance info is only
    updated by the _sync_scheduler_instance_info periodic. This caused that
    server boots with anti-affinity did not consider the source host.
    But now at the end of the confirm_resize call the compute also updates
    the scheduler about the move.

    Change-Id: Ic50e72e289b56ac54720ad0b719ceeb32487b8c8
    Closes-Bug: #1869050
    (cherry picked from commit 738110db7492b1360f5f197e8ecafd69a3b141b4)
    (cherry picked from commit e8b3927c92d29c74fd0c79b5a51b7a34e9d66236)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/729505

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/729527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/stein)

Change abandoned by Qiu Fossen (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/729505

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by Qiu Fossen (<email address hidden>) on branch: stable/rocky
Review: https://review.opendev.org/729527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/729530

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/729538

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/729530
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2a15e0096bc87234a930bb75b73d4874f0f7ec87
Submitter: Zuul
Branch: stable/stein

commit 2a15e0096bc87234a930bb75b73d4874f0f7ec87
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:38:14 2020 +0100

    Reproduce bug 1869050

    This patch adds a functional test that reproduce the bug when stale
    scheduler instance info prevents booting server with anti-affinity.

    Change-Id: If485330b48ae2671651aafabc93f92a8999f7ca2
    Related-Bug: #1869050
    (cherry picked from commit b52c483308f32f3744dd8a5df424b9f518c13155)
    (cherry picked from commit 016eeec9841116bbbbc6c3019850c18012e3781a)
    (cherry picked from commit 66e4d8218133a5e4a68f68a3017446cb585675c4)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/729538
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e1116ee3b776ec84e4ce7d6ac9346fa0d43269b5
Submitter: Zuul
Branch: stable/stein

commit e1116ee3b776ec84e4ce7d6ac9346fa0d43269b5
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:48:23 2020 +0100

    Update scheduler instance info at confirm resize

    When a resize is confirmed the instance does not belong to the source
    compute any more. In the past the scheduler instance info is only
    updated by the _sync_scheduler_instance_info periodic. This caused that
    server boots with anti-affinity did not consider the source host.
    But now at the end of the confirm_resize call the compute also updates
    the scheduler about the move.

    Conflicts:
          nova/tests/unit/compute/test_compute_mgr.py
          due to Ib50b6b02208f5bd2972de8a6f8f685c19745514c and
          Ia6d8a7909081b0b856bd7e290e234af7e42a2b38 are missing from
          stable/stein

    Change-Id: Ic50e72e289b56ac54720ad0b719ceeb32487b8c8
    Closes-Bug: #1869050
    (cherry picked from commit 738110db7492b1360f5f197e8ecafd69a3b141b4)
    (cherry picked from commit e8b3927c92d29c74fd0c79b5a51b7a34e9d66236)
    (cherry picked from commit e34b375a6161b15d92beba64fa281f40634ffeab)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/730343

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/730344

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/730343
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=44c47421785ba07d4b238bba06eacc42827c84ab
Submitter: Zuul
Branch: stable/rocky

commit 44c47421785ba07d4b238bba06eacc42827c84ab
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:38:14 2020 +0100

    Reproduce bug 1869050

    This patch adds a functional test that reproduce the bug when stale
    scheduler instance info prevents booting server with anti-affinity.

    Change-Id: If485330b48ae2671651aafabc93f92a8999f7ca2
    Related-Bug: #1869050
    (cherry picked from commit b52c483308f32f3744dd8a5df424b9f518c13155)
    (cherry picked from commit 016eeec9841116bbbbc6c3019850c18012e3781a)
    (cherry picked from commit 66e4d8218133a5e4a68f68a3017446cb585675c4)
    (cherry picked from commit 2a15e0096bc87234a930bb75b73d4874f0f7ec87)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/730344
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=abe04f048c432fed5726af8244bb055e6e44657e
Submitter: Zuul
Branch: stable/rocky

commit abe04f048c432fed5726af8244bb055e6e44657e
Author: Balazs Gibizer <email address hidden>
Date: Wed Mar 25 17:48:23 2020 +0100

    Update scheduler instance info at confirm resize

    When a resize is confirmed the instance does not belong to the source
    compute any more. In the past the scheduler instance info is only
    updated by the _sync_scheduler_instance_info periodic. This caused that
    server boots with anti-affinity did not consider the source host.
    But now at the end of the confirm_resize call the compute also updates
    the scheduler about the move.

    Conflicts:
      nova/compute/manager.py due to
      I933687891abef4878de09481937d576ce5899511 is a stable only patch
      nova/tests/unit/compute/test_compute_mgr.py due to
      35ce77835bb271bad3c18eaf22146edac3a42ea0 is missing from stable/rocky

    Change-Id: Ic50e72e289b56ac54720ad0b719ceeb32487b8c8
    Closes-Bug: #1869050
    (cherry picked from commit 738110db7492b1360f5f197e8ecafd69a3b141b4)
    (cherry picked from commit e8b3927c92d29c74fd0c79b5a51b7a34e9d66236)
    (cherry picked from commit e34b375a6161b15d92beba64fa281f40634ffeab)
    (cherry picked from commit e1116ee3b776ec84e4ce7d6ac9346fa0d43269b5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/731563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/731564

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

The problem cannot be reproduced on stable/queens. The rocky patch [1] changed the logic of the affinity filter to use the host_state. As the host_state could be stale we have this bug since rocky. But on queens the filter queries the instance.host from the database and that information is up to date after the migration therefore the bug is not reproducible any more.

[1] https://review.opendev.org/#/c/571166/27/nova/scheduler/filters/affinity_filter.py@101

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

The same is true for stable/pike

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens)

Change abandoned by Balazs Gibizer (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/731563
Reason: Bug is not valid for stable/queens

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Balazs Gibizer (<email address hidden>) on branch: stable/queens
Review: https://review.opendev.org/731564
Reason: Bug is not valid for stable/queens

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova rocky-eol

This issue was fixed in the openstack/nova rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.