Compute node soft undelete raises inactive session error

Bug #1853159 reported by Mark Goddard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Mark Goddard

Bug Description

In the fix for bug 1839560, soft-deleted compute nodes may be restored, to ensure we can reuse ironic node UUIDs as compute node UUIDs. While this seems to largely work, it results in some nasty errors being generated.

Steps to reproduce
==================

Same as bug 1839560.

Expected results
================

Compute node is restored in an orderly fashion.

Actual results
==============

Example error:

http://paste.openstack.org/show/786350/

Despite the error, all my testing shows that the node is actually restored. However, the exception raised breaks the execution of the resource tracker update, which no doubt has some unintended consequences.

Environment
===========

Seen on Rocky 18.2.0, and master (in functional testing).

Mark Goddard (mgoddard)
Changed in nova:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/695012

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/695189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/695012
Committed: https://opendev.org/openstack/nova/commit/59d9871e8a0672538f8ffc43ae99b3d1c4b08909
Submitter: "Zuul (22348)"
Branch: master

commit 59d9871e8a0672538f8ffc43ae99b3d1c4b08909
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 14:45:02 2019 +0000

    Add functional regression test for bug 1853009

    Bug 1853009 describes a race condition involving multiple nova-compute
    services with ironic. As the compute services start up, the hash ring
    rebalances, and the compute services have an inconsistent view of which
    is responsible for a compute node.

    The sequence of actions here is adapted from a real world log [1], where
    multiple nova-compute services were started simultaneously. In some
    cases mocks are used to simulate race conditions.

    There are three main issues with the behaviour:

    * host2 deletes the orphan node compute node after host1 has taken
      ownership of it.

    * host1 assumes that another compute service will not delete its nodes.
      Once a node is in rt.compute_nodes, it is not removed again unless the
      node is orphaned. This prevents host1 from recreating the compute
      node.

    * host1 assumes that another compute service will not delete its
      resource providers. Once an RP is in the provider tree, it is not
      removed.

    This functional test documents the current behaviour, with the idea that
    it can be updated as this behaviour is fixed.

    [1] http://paste.openstack.org/show/786272/

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
    Related-Bug: #1853009
    Related-Bug: #1853159

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/695189
Committed: https://opendev.org/openstack/nova/commit/2383cbb4a518821d245fce316b3778c8ba8e5246
Submitter: "Zuul (22348)"
Branch: master

commit 2383cbb4a518821d245fce316b3778c8ba8e5246
Author: Mark Goddard <email address hidden>
Date: Wed Nov 20 12:01:33 2019 +0000

    Fix inactive session error in compute node creation

    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:

        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.

    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.

    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.

    The new unit test added here fails without this change.

    This change marks the removal of the final FIXME from the functional
    test added in [4].

    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/

    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811805

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/811809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811810

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811814

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811815

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811819

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811821

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/811805
Committed: https://opendev.org/openstack/nova/commit/c260e75d012cc4fae596d5de185afad6fb24068c
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit c260e75d012cc4fae596d5de185afad6fb24068c
Author: Mark Goddard <email address hidden>
Date: Tue Nov 19 14:45:02 2019 +0000

    Add functional regression test for bug 1853009

    Bug 1853009 describes a race condition involving multiple nova-compute
    services with ironic. As the compute services start up, the hash ring
    rebalances, and the compute services have an inconsistent view of which
    is responsible for a compute node.

    The sequence of actions here is adapted from a real world log [1], where
    multiple nova-compute services were started simultaneously. In some
    cases mocks are used to simulate race conditions.

    There are three main issues with the behaviour:

    * host2 deletes the orphan node compute node after host1 has taken
      ownership of it.

    * host1 assumes that another compute service will not delete its nodes.
      Once a node is in rt.compute_nodes, it is not removed again unless the
      node is orphaned. This prevents host1 from recreating the compute
      node.

    * host1 assumes that another compute service will not delete its
      resource providers. Once an RP is in the provider tree, it is not
      removed.

    This functional test documents the current behaviour, with the idea that
    it can be updated as this behaviour is fixed.

    [1] http://paste.openstack.org/show/786272/

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
    Related-Bug: #1853009
    Related-Bug: #1853159
    (cherry picked from commit 59d9871e8a0672538f8ffc43ae99b3d1c4b08909)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/811809
Committed: https://opendev.org/openstack/nova/commit/665c053315439e1345aa131f4839945d662fb3f3
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 665c053315439e1345aa131f4839945d662fb3f3
Author: Mark Goddard <email address hidden>
Date: Wed Nov 20 12:01:33 2019 +0000

    Fix inactive session error in compute node creation

    In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
    restored, to ensure we can reuse ironic node UUIDs as compute node
    UUIDs. While this seems to largely work, it results in some nasty errors
    being generated [3]:

        InvalidRequestError This session is in 'inactive' state, due to the
        SQL transaction being rolled back; no further SQL can be emitted
        within this transaction.

    This happens because compute_node_create is decorated with
    pick_context_manager_writer, which begins a transaction. While
    _compute_node_get_and_update_deleted claims that calling a second
    pick_context_manager_writer decorated function will begin a new
    subtransaction, this does not appear to be the case.

    This change removes pick_context_manager_writer from the
    compute_node_create function, and adds a new _compute_node_create
    function which ensures the transaction is finished if
    _compute_node_get_and_update_deleted is called.

    The new unit test added here fails without this change.

    This change marks the removal of the final FIXME from the functional
    test added in [4].

    [1] https://bugs.launchpad.net/nova/+bug/1839560
    [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
    [3] http://paste.openstack.org/show/786350/
    [4] https://review.opendev.org/#/c/695012/

    Conflicts:
        nova/db/sqlalchemy/api.py

    NOTE(melwitt): The conflict is because change
    I9f414cf831316b624132d9e06192f1ecbbd3dd78 (db: Copy docs from
    'nova.db.*' to 'nova.db.sqlalchemy.*') is not in Wallaby.

    NOTE(melwitt): Difference from the cherry picked change from calling
    nova.db.api => nova.db.sqlalchemy.api directly are due to the alembic
    migration in Xena which looks to have made the nova.db.api interface
    obsolete.

    Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
    Closes-Bug: #1853159
    Related-Bug: #1853009
    (cherry picked from commit 2383cbb4a518821d245fce316b3778c8ba8e5246)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.2.1

This issue was fixed in the openstack/nova 23.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811821
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/811825
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811819
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/811815
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811814
Reason: stable/victoria branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/victoria if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/811810
Reason: stable/victoria branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/victoria if you want to further work on this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.