Volume remains in attaching/reserved status, if the instance is deleted after TooManyInstances exception in nova-conductor

Bug #1806064 reported by s10 on 2018-11-30
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Pike
Medium
Matt Riedemann
Queens
Medium
Matt Riedemann
Rocky
Medium
s10

Bug Description

If a number of instances are booted from volumes in parallel and some of the build requests failed in nova-conductor with exception TooManyInstances [1] because of the setting quota.recheck_quota=True being set in nova.conf, some instances will end up in the ERROR state.

If we delete this instances, their volumes will remain in attaching(Pike)/reserved(Queens) state.

This bug is related to https://bugs.launchpad.net/nova/+bug/1404867

Steps to reproduce:

0. Set quota.recheck_quota=True, start several nova-conductors.

1. Set VCPU quota limits for the project to 1.

2. Create two instances with 1 VCPU in parallel.

3. One of this instances will be created and one will end up in the ERROR state. Or both of them will be in ERROR state.

4. Delete instances.

5. Volumes from errored instances will not be available, they can't be attached, they can't be deleted without permision in volume:force_delete cinder policy.

This bug exists at least in Pike (7ff1b28) and Queens (c5fe051).

---
[1] https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/manager.py#L1308

s10 (vlad-esten) on 2018-11-30
summary: - Volume remains in attaching/reserved status after TooManyInstances
- exception in nova-conductor
+ Volume remains in attaching/reserved status, if the instance is deleted
+ after TooManyInstances exception in nova-conductor
Matt Riedemann (mriedem) wrote :

This is closer to this change https://review.openstack.org/#/c/544748/

Matt Riedemann (mriedem) wrote :

It should be relatively easy to write a functional regression test similar to https://review.openstack.org/#/c/545123/5/nova/tests/functional/wsgi/test_servers.py but for this scenario.

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Matt Riedemann (mriedem) on 2018-11-30
tags: added: cells volumes
Matt Riedemann (mriedem) on 2018-12-03
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)

Fix proposed to branch: master
Review: https://review.openstack.org/621692

Changed in nova:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/621664
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5d514b33e28964b38aeb42a8dd5b93f3fc8ae239
Submitter: Zuul
Branch: master

commit 5d514b33e28964b38aeb42a8dd5b93f3fc8ae239
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 3 12:55:11 2018 -0500

    Add functional regression test for bug 1806064

    Change I9269ffa2b80e48db96c622d0dc0817738854f602 in Pike
    introduced a race condition where creating multiple
    servers concurrently can fail the second instances quota
    check which happens in conductor after the instance record
    is created in the cell database but its related BDMs and
    tags are not stored in the cell DB. When deleting the
    server from the API, since the BDMs are not in the cell
    database with the instance, they are not "seen" and thus
    the volume attachments are not deleted and the volume is
    orphaned. As for tags, you should be able to see the tags
    on the server in ERROR status from the API before deleting
    it.

    This change adds a functional regression test to show both
    the volume attachment and tag issue when we fail the quota
    check in conductor.

    Change-Id: I21c2189cc1de6b8e4857de77acd9f1ef8b6ea9f6
    Related-Bug: #1806064

Reviewed: https://review.openstack.org/621692
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6d0386058b9628bbfcf64abdd707ad87ee19353c
Submitter: Zuul
Branch: master

commit 6d0386058b9628bbfcf64abdd707ad87ee19353c
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 3 15:12:12 2018 -0500

    Create BDMs/tags in cell with instance when over-quota

    If the server create build request fails the quota check
    after the instance record has been created in a cell, we also
    need to create the BDMs and tags in that cell so that users
    can still see the tags on the server and so the API can
    properly cleanup volume attachments when the server is deleted.

    This change updates _cleanup_build_artifacts to create BDMs
    and tags in the same cell as the instance prior to deleting the
    build request and request spec and adjusts the assertions in the
    related functional test to show the bug is fixed.

    As for instances that get buried in cell0 due to scheduling
    failures, the tags are not created there so comments are left
    in those code paths to fix that issue as well, but that can be
    done separately from this patch.

    Change-Id: I1a9bdb596f74605ab4613c9cb2574e976aebbd8c
    Closes-Bug: #1806064

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/623931
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5b7e904126f68a8c9f5250620393dcdee54336d8
Submitter: Zuul
Branch: stable/rocky

commit 5b7e904126f68a8c9f5250620393dcdee54336d8
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 3 12:55:11 2018 -0500

    Add functional regression test for bug 1806064

    Change I9269ffa2b80e48db96c622d0dc0817738854f602 in Pike
    introduced a race condition where creating multiple
    servers concurrently can fail the second instances quota
    check which happens in conductor after the instance record
    is created in the cell database but its related BDMs and
    tags are not stored in the cell DB. When deleting the
    server from the API, since the BDMs are not in the cell
    database with the instance, they are not "seen" and thus
    the volume attachments are not deleted and the volume is
    orphaned. As for tags, you should be able to see the tags
    on the server in ERROR status from the API before deleting
    it.

    This change adds a functional regression test to show both
    the volume attachment and tag issue when we fail the quota
    check in conductor.

    Change-Id: I21c2189cc1de6b8e4857de77acd9f1ef8b6ea9f6
    Related-Bug: #1806064
    (cherry picked from commit 5d514b33e28964b38aeb42a8dd5b93f3fc8ae239)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/623932
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3028d25705bbd54cdf0b7ba13859a809d401fc70
Submitter: Zuul
Branch: stable/rocky

commit 3028d25705bbd54cdf0b7ba13859a809d401fc70
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 3 15:12:12 2018 -0500

    Create BDMs/tags in cell with instance when over-quota

    If the server create build request fails the quota check
    after the instance record has been created in a cell, we also
    need to create the BDMs and tags in that cell so that users
    can still see the tags on the server and so the API can
    properly cleanup volume attachments when the server is deleted.

    This change updates _cleanup_build_artifacts to create BDMs
    and tags in the same cell as the instance prior to deleting the
    build request and request spec and adjusts the assertions in the
    related functional test to show the bug is fixed.

    As for instances that get buried in cell0 due to scheduling
    failures, the tags are not created there so comments are left
    in those code paths to fix that issue as well, but that can be
    done separately from this patch.

    Change-Id: I1a9bdb596f74605ab4613c9cb2574e976aebbd8c
    Closes-Bug: #1806064
    (cherry picked from commit 6d0386058b9628bbfcf64abdd707ad87ee19353c)

Reviewed: https://review.openstack.org/623933
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3f37d0bba46fd7f39265863b227c5fd504fa7864
Submitter: Zuul
Branch: stable/queens

commit 3f37d0bba46fd7f39265863b227c5fd504fa7864
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 3 12:55:11 2018 -0500

    Add functional regression test for bug 1806064

    Change I9269ffa2b80e48db96c622d0dc0817738854f602 in Pike
    introduced a race condition where creating multiple
    servers concurrently can fail the second instances quota
    check which happens in conductor after the instance record
    is created in the cell database but its related BDMs and
    tags are not stored in the cell DB. When deleting the
    server from the API, since the BDMs are not in the cell
    database with the instance, they are not "seen" and thus
    the volume attachments are not deleted and the volume is
    orphaned. As for tags, you should be able to see the tags
    on the server in ERROR status from the API before deleting
    it.

    This change adds a functional regression test to show both
    the volume attachment and tag issue when we fail the quota
    check in conductor.

    Change-Id: I21c2189cc1de6b8e4857de77acd9f1ef8b6ea9f6
    Related-Bug: #1806064
    (cherry picked from commit 5d514b33e28964b38aeb42a8dd5b93f3fc8ae239)
    (cherry picked from commit 5b7e904126f68a8c9f5250620393dcdee54336d8)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/623934
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3dd42c2cd658fd3f73b11fbf5e81ccccd748b450
Submitter: Zuul
Branch: stable/queens

commit 3dd42c2cd658fd3f73b11fbf5e81ccccd748b450
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 3 15:12:12 2018 -0500

    Create BDMs/tags in cell with instance when over-quota

    If the server create build request fails the quota check
    after the instance record has been created in a cell, we also
    need to create the BDMs and tags in that cell so that users
    can still see the tags on the server and so the API can
    properly cleanup volume attachments when the server is deleted.

    This change updates _cleanup_build_artifacts to create BDMs
    and tags in the same cell as the instance prior to deleting the
    build request and request spec and adjusts the assertions in the
    related functional test to show the bug is fixed.

    As for instances that get buried in cell0 due to scheduling
    failures, the tags are not created there so comments are left
    in those code paths to fix that issue as well, but that can be
    done separately from this patch.

    Change-Id: I1a9bdb596f74605ab4613c9cb2574e976aebbd8c
    Closes-Bug: #1806064
    (cherry picked from commit 6d0386058b9628bbfcf64abdd707ad87ee19353c)
    (cherry picked from commit 3028d25705bbd54cdf0b7ba13859a809d401fc70)

This issue was fixed in the openstack/nova 18.1.0 release.

This issue was fixed in the openstack/nova 17.0.9 release.

Reviewed: https://review.openstack.org/623935
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6b85dafd87a242b3d312f01eadc9e67376349ba4
Submitter: Zuul
Branch: stable/pike

commit 6b85dafd87a242b3d312f01eadc9e67376349ba4
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 3 12:55:11 2018 -0500

    Add functional regression test for bug 1806064

    Change I9269ffa2b80e48db96c622d0dc0817738854f602 in Pike
    introduced a race condition where creating multiple
    servers concurrently can fail the second instances quota
    check which happens in conductor after the instance record
    is created in the cell database but its related BDMs and
    tags are not stored in the cell DB. When deleting the
    server from the API, since the BDMs are not in the cell
    database with the instance, they are not "seen" and thus
    the volume attachments are not deleted and the volume is
    orphaned. As for tags, you should be able to see the tags
    on the server in ERROR status from the API before deleting
    it.

    This change adds a functional regression test to show both
    the volume attachment and tag issue when we fail the quota
    check in conductor.

    NOTE(s10): Three changes has been made in this test for Pike:
    1. CONF.glance.api_servers is mandatory in Pike, so
    nova.tests.unit.image.fake.stub_out_image_service has been added.
    2. CinderFixture is being used instead of CinderFixtureNewAttachFlow
    introduced in Queens.
    3. In Pike volume will be in 'attaching' state without attachment
    created, so just check for this state without listing attachments.

    Change-Id: I21c2189cc1de6b8e4857de77acd9f1ef8b6ea9f6
    Related-Bug: #1806064
    (cherry picked from commit 5d514b33e28964b38aeb42a8dd5b93f3fc8ae239)
    (cherry picked from commit 5b7e904126f68a8c9f5250620393dcdee54336d8)
    (cherry picked from commit 3f37d0bba46fd7f39265863b227c5fd504fa7864)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/623937
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3eb9006b3e7e3e9fd296b6c15cceebb36b0495a5
Submitter: Zuul
Branch: stable/pike

commit 3eb9006b3e7e3e9fd296b6c15cceebb36b0495a5
Author: Matt Riedemann <email address hidden>
Date: Sun Dec 9 19:54:16 2018 +0300

    Create BDMs/tags in cell with instance when over-quota

    If the server create build request fails the quota check
    after the instance record has been created in a cell, we also
    need to create the BDMs and tags in that cell so that users
    can still see the tags on the server and so the API can
    properly cleanup volume attachments when the server is deleted.

    This change updates _cleanup_build_artifacts to create BDMs
    and tags in the same cell as the instance prior to deleting the
    build request and request spec and adjusts the assertions in the
    related functional test to show the bug is fixed.

    As for instances that get buried in cell0 due to scheduling
    failures, the tags are not created there so comments are left
    in those code paths to fix that issue as well, but that can be
    done separately from this patch.

    NOTE(s10): Conflict is caused by I70b11dd489d222be3d70733355bfe7966df556aa
    not being in Pike.

    Change-Id: I1a9bdb596f74605ab4613c9cb2574e976aebbd8c
    Closes-Bug: #1806064
    (cherry picked from commit 6d0386058b9628bbfcf64abdd707ad87ee19353c)
    (cherry picked from commit 3028d25705bbd54cdf0b7ba13859a809d401fc70)
    (cherry picked from commit 3dd42c2cd658fd3f73b11fbf5e81ccccd748b450)

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

This issue was fixed in the openstack/nova 16.1.8 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers