unshelve leak allocation if update port fails

Bug #1862633 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Balazs Gibizer
Pike
Fix Released
Undecided
Unassigned
Queens
Fix Released
Undecided
Unassigned
Rocky
Fix Released
Undecided
Unassigned
Stein
Fix Released
Undecided
Unassigned
Train
Fix Released
Undecided
Unassigned

Bug Description

If updating the port binding during unshelve of an offloaded server fails then nova leaks placement allocation.

Steps to reproduce
==================
1) boot a server with a neutron port
2) shelve and offload the server
3) disable the original host of the server to force scheduling during unshelve to select a differetn host. This is important as it triggers a non empty port update during unshelve
4) unshelve the server and inject network fault in the communication between nova and neutron. You can try to simply shut down neutron-server at the right moment as well. Right means just before the target compute tries to send the port update
5) observer that the unshelve fails, the server goes back to offloaded state, but the placement allocation on the target host remains.

Triage: the problem is cause by a missing fault handling code in the compute manager[1]. The compute manager has proper error handling if the unshelve fails in the virt driver spawn call, but it does not handle failure if the neutron communication fails. The compute manager method simply logs and re-raises the neutron exceptions. This means that the exception is dropped as the unshelve_instance compute RPC is a cast.

[1] https://github.com/openstack/nova/blob/1fcd74730d343b7cee12a0a50ea537dc4ff87f65/nova/compute/manager.py#L6473

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Balazs Gibizer (balazs-gibizer)
tags: added: shelve
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/706867

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/706868

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/706867
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c33ebdafbd633578a0a4b6f1b118c756510acea6
Submitter: Zuul
Branch: master

commit c33ebdafbd633578a0a4b6f1b118c756510acea6
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:23:02 2020 +0100

    Reproduce bug 1862633

    If port update fails during unshelve of an offloaded server then
    placement allocation on the target host is leaked.

    Change-Id: I7be32e4fc2e69f805535e0a437931516f491e5cb
    Related-Bug: #1862633

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/706868
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e65d4a131a7ebc02261f5df69fa1b394a502f268
Submitter: Zuul
Branch: master

commit e65d4a131a7ebc02261f5df69fa1b394a502f268
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:48:04 2020 +0100

    Clean up allocation if unshelve fails due to neutron

    When port binding update fails during unshelve of a shelve offloaded
    instance compute manager has to catch the exception and clean up the
    destination host allocation.

    Change-Id: I4c3fbb213e023ac16efc0b8561f975a659311684
    Closes-Bug: #1862633

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/709166

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/train
Review: https://review.opendev.org/709167

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/709166
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bd1bfc13d7e2c418afc409871ab56da454a1334d
Submitter: Zuul
Branch: stable/train

commit bd1bfc13d7e2c418afc409871ab56da454a1334d
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:23:02 2020 +0100

    Reproduce bug 1862633

    If port update fails during unshelve of an offloaded server then
    placement allocation on the target host is leaked.

    Changes in test_bug_1862633.py is due to
    I84c58de90dad6d86271767363aef90ddac0f1730 and
    I8c96b337f32148f8f5899c9b87af331b1fa41424 are missing from train.

    Change-Id: I7be32e4fc2e69f805535e0a437931516f491e5cb
    Related-Bug: #1862633
    (cherry picked from commit c33ebdafbd633578a0a4b6f1b118c756510acea6)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/709167
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e6b749dbdd735e2cd0054654b5da7a02280a080b
Submitter: Zuul
Branch: stable/train

commit e6b749dbdd735e2cd0054654b5da7a02280a080b
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:48:04 2020 +0100

    Clean up allocation if unshelve fails due to neutron

    When port binding update fails during unshelve of a shelve offloaded
    instance compute manager has to catch the exception and clean up the
    destination host allocation.

    Change-Id: I4c3fbb213e023ac16efc0b8561f975a659311684
    Closes-Bug: #1862633
    (cherry picked from commit e65d4a131a7ebc02261f5df69fa1b394a502f268)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/711626

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/711629

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/711626
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f960d1751d752d559ea18604bfd1fcaf1a3283cd
Submitter: Zuul
Branch: stable/stein

commit f960d1751d752d559ea18604bfd1fcaf1a3283cd
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:23:02 2020 +0100

    Reproduce bug 1862633

    If port update fails during unshelve of an offloaded server then
    placement allocation on the target host is leaked.

    Change-Id: I7be32e4fc2e69f805535e0a437931516f491e5cb
    Related-Bug: #1862633
    (cherry picked from commit c33ebdafbd633578a0a4b6f1b118c756510acea6)
    (cherry picked from commit bd1bfc13d7e2c418afc409871ab56da454a1334d)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/711629
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=405a35587a2291e3cf9eb4efc8f102c91bb4ef76
Submitter: Zuul
Branch: stable/stein

commit 405a35587a2291e3cf9eb4efc8f102c91bb4ef76
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:48:04 2020 +0100

    Clean up allocation if unshelve fails due to neutron

    When port binding update fails during unshelve of a shelve offloaded
    instance compute manager has to catch the exception and clean up the
    destination host allocation.

    Conflicts:
      nova/compute/manager.py due to I59aec72e158eb2859bb6178b2a42d3f3438ab0f3
      is missing in stein

    Change-Id: I4c3fbb213e023ac16efc0b8561f975a659311684
    Closes-Bug: #1862633
    (cherry picked from commit e65d4a131a7ebc02261f5df69fa1b394a502f268)
    (cherry picked from commit e6b749dbdd735e2cd0054654b5da7a02280a080b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713187

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713196

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/713243

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/713243
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1dfb72e0487c12553a50899d6aed292dea4dcd7f
Submitter: Zuul
Branch: master

commit 1dfb72e0487c12553a50899d6aed292dea4dcd7f
Author: Balazs Gibizer <email address hidden>
Date: Mon Mar 16 14:58:46 2020 +0100

    Fix intermittently failing regression case

    The test_unshelve_offloaded_fails_due_to_neutron could fail due to race
    condition. The test case only waits for the first instance.save() call
    at [1] but the allocation delete happens after it. This causes that the
    test case can still see the allocation of the offloaded server in
    placement.

    The fix makes sure that the test waits for the second instance.save() by
    checking for the host of the instance.

    [1] https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L5274-L5288

    Related-Bug #1862633

    Change-Id: Ic1c3d35749fbdc7f5b6f6ec1e16b8fcf37c10de8

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/713384

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/713384
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c48d62184348a5bd77df9e54cdac3cf641049eb8
Submitter: Zuul
Branch: stable/train

commit c48d62184348a5bd77df9e54cdac3cf641049eb8
Author: Balazs Gibizer <email address hidden>
Date: Mon Mar 16 14:58:46 2020 +0100

    Fix intermittently failing regression case

    The test_unshelve_offloaded_fails_due_to_neutron could fail due to race
    condition. The test case only waits for the first instance.save() call
    at [1] but the allocation delete happens after it. This causes that the
    test case can still see the allocation of the offloaded server in
    placement.

    The fix makes sure that the test waits for the second instance.save() by
    checking for the host of the instance.

    Conflicts:
        nova/tests/functional/regressions/test_bug_1862633.py

    Note: Changes in test_bug_1862633.py is due to
    I8c96b337f32148f8f5899c9b87af331b1fa41424 is missing from train.

    [1] https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L5274-L5288

    Related-Bug #1862633

    Change-Id: Ic1c3d35749fbdc7f5b6f6ec1e16b8fcf37c10de8
    (cherry picked from commit 1dfb72e0487c12553a50899d6aed292dea4dcd7f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/713682

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/713682
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=763b7897f40140b04e9f0bfcc376b975b85c2381
Submitter: Zuul
Branch: stable/stein

commit 763b7897f40140b04e9f0bfcc376b975b85c2381
Author: Balazs Gibizer <email address hidden>
Date: Mon Mar 16 14:58:46 2020 +0100

    Fix intermittently failing regression case

    The test_unshelve_offloaded_fails_due_to_neutron could fail due to race
    condition. The test case only waits for the first instance.save() call
    at [1] but the allocation delete happens after it. This causes that the
    test case can still see the allocation of the offloaded server in
    placement.

    The fix makes sure that the test waits for the second instance.save() by
    checking for the host of the instance.

    [1] https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L5274-L5288

    Related-Bug #1862633

    Change-Id: Ic1c3d35749fbdc7f5b6f6ec1e16b8fcf37c10de8
    (cherry picked from commit 1dfb72e0487c12553a50899d6aed292dea4dcd7f)
    (cherry picked from commit c48d62184348a5bd77df9e54cdac3cf641049eb8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/713187
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5e452f8eb743c226af4f4998835ece8dd142a011
Submitter: Zuul
Branch: stable/rocky

commit 5e452f8eb743c226af4f4998835ece8dd142a011
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:23:02 2020 +0100

    Reproduce bug 1862633

    If port update fails during unshelve of an offloaded server then
    placement allocation on the target host is leaked.

    Changes in nova test_bug_1862633.py due to
    Idaed39629095f86d24a54334c699a26c218c6593 is missing from stable/rocky

    Change-Id: I7be32e4fc2e69f805535e0a437931516f491e5cb
    Related-Bug: #1862633
    (cherry picked from commit c33ebdafbd633578a0a4b6f1b118c756510acea6)
    (cherry picked from commit bd1bfc13d7e2c418afc409871ab56da454a1334d)
    (cherry picked from commit f960d1751d752d559ea18604bfd1fcaf1a3283cd)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/713196
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aeeab5d064492e112cd626a2988a6808250fb029
Submitter: Zuul
Branch: stable/rocky

commit aeeab5d064492e112cd626a2988a6808250fb029
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:48:04 2020 +0100

    Clean up allocation if unshelve fails due to neutron

    When port binding update fails during unshelve of a shelve offloaded
    instance compute manager has to catch the exception and clean up the
    destination host allocation.

    Conflicts:
      nova/compute/manager.py due to #Ibb8c12fb2799bb5ceb9e3d72a2b86dbb4f14451e
      is missing in rocky

    Squashed Ic1c3d35749fbdc7f5b6f6ec1e16b8fcf37c10de8 into this to avoid
    intermittently failing test case.

    Change-Id: I4c3fbb213e023ac16efc0b8561f975a659311684
    Closes-Bug: #1862633
    (cherry picked from commit e65d4a131a7ebc02261f5df69fa1b394a502f268)
    (cherry picked from commit e6b749dbdd735e2cd0054654b5da7a02280a080b)
    (cherry picked from commit 405a35587a2291e3cf9eb4efc8f102c91bb4ef76)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/729539

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/729540

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/729539
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eb4f0a5aa93feb2dc7730987207f052a04bc33db
Submitter: Zuul
Branch: stable/queens

commit eb4f0a5aa93feb2dc7730987207f052a04bc33db
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:23:02 2020 +0100

    Reproduce bug 1862633

    If port update fails during unshelve of an offloaded server then
    placement allocation on the target host is leaked.

    Changes in nova test_bug_1862633.py due to
    I2cf2fcbaebc706f897ce5dfbff47d32117064f9c is missing from stable/queens

    Change-Id: I7be32e4fc2e69f805535e0a437931516f491e5cb
    Related-Bug: #1862633
    (cherry picked from commit c33ebdafbd633578a0a4b6f1b118c756510acea6)
    (cherry picked from commit bd1bfc13d7e2c418afc409871ab56da454a1334d)
    (cherry picked from commit f960d1751d752d559ea18604bfd1fcaf1a3283cd)
    (cherry picked from commit 5e452f8eb743c226af4f4998835ece8dd142a011)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/729540
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9a073c9edc993525e896f67eeda1639a248fe2df
Submitter: Zuul
Branch: stable/queens

commit 9a073c9edc993525e896f67eeda1639a248fe2df
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:48:04 2020 +0100

    Clean up allocation if unshelve fails due to neutron

    When port binding update fails during unshelve of a shelve offloaded
    instance compute manager has to catch the exception and clean up the
    destination host allocation.

    Change-Id: I4c3fbb213e023ac16efc0b8561f975a659311684
    Closes-Bug: #1862633
    (cherry picked from commit e65d4a131a7ebc02261f5df69fa1b394a502f268)
    (cherry picked from commit e6b749dbdd735e2cd0054654b5da7a02280a080b)
    (cherry picked from commit 405a35587a2291e3cf9eb4efc8f102c91bb4ef76)
    (cherry picked from commit aeeab5d064492e112cd626a2988a6808250fb029)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/740634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/740636

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.opendev.org/740634
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=618dd9bdc24b3a1de6a5581d4e7201efdf1f86bd
Submitter: Zuul
Branch: stable/pike

commit 618dd9bdc24b3a1de6a5581d4e7201efdf1f86bd
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:23:02 2020 +0100

    Reproduce bug 1862633

    If port update fails during unshelve of an offloaded server then
    placement allocation on the target host is leaked.

    Changes in test_bug_1862633.py due to:
    * the NeutronFixture improvement done in
      Id8d2c48c9c864554a917596e377d30515465fec1 is missing from stable/pike
      therefore the fault injection mock needed to be moved to a higher
      level function.
    * the Ie4676eed0039c927b35af7573f0b57fd762adbaa refactor is also missing
      and causing the name change of wait_for_versioned_notification

    Change-Id: I7be32e4fc2e69f805535e0a437931516f491e5cb
    Related-Bug: #1862633
    (cherry picked from commit c33ebdafbd633578a0a4b6f1b118c756510acea6)
    (cherry picked from commit bd1bfc13d7e2c418afc409871ab56da454a1334d)
    (cherry picked from commit f960d1751d752d559ea18604bfd1fcaf1a3283cd)
    (cherry picked from commit 5e452f8eb743c226af4f4998835ece8dd142a011)
    (cherry picked from commit eb4f0a5aa93feb2dc7730987207f052a04bc33db)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/740636
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=89312275607ea21e1c35a393ad1b00fe5591e30a
Submitter: Zuul
Branch: stable/pike

commit 89312275607ea21e1c35a393ad1b00fe5591e30a
Author: Balazs Gibizer <email address hidden>
Date: Mon Feb 10 15:48:04 2020 +0100

    Clean up allocation if unshelve fails due to neutron

    When port binding update fails during unshelve of a shelve offloaded
    instance compute manager has to catch the exception and clean up the
    destination host allocation.

    Change-Id: I4c3fbb213e023ac16efc0b8561f975a659311684
    Closes-Bug: #1862633
    (cherry picked from commit e65d4a131a7ebc02261f5df69fa1b394a502f268)
    (cherry picked from commit e6b749dbdd735e2cd0054654b5da7a02280a080b)
    (cherry picked from commit 405a35587a2291e3cf9eb4efc8f102c91bb4ef76)
    (cherry picked from commit aeeab5d064492e112cd626a2988a6808250fb029)
    (cherry picked from commit 9a073c9edc993525e896f67eeda1639a248fe2df)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova pike-eol

This issue was fixed in the openstack/nova pike-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova rocky-eol

This issue was fixed in the openstack/nova rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.