tempest plugin can deadlock on test_delete_introspection

Bug #2028866 reported by Julia Kreger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Medium
Unassigned

Bug Description

The ironic-tempest-plugin test test_delete_allocation can create a deadlock by trying create an allocation, and then turn around and then *delete* an allocation[0].

Which is fine if your only creating API records.

But... creating an allocation launches a worker task[1], and also immediately returns back to the caller with the base allocation.

Under the hood, the allocation attempts to identify candidates, and then allocate a node. The act of allocating a node holds a lock on the node, and in that entire process, calls allocation.save().

Allocation.save() internally, on the object calls update_allocation[2] in the DB API which pulls a row level lock for the entry in the allocations table, and then attempts to resolve if it *can* lock the node or not, and so on and so forth. To put icing on the cake, it returns the session query result ref from a write transaction, which means that transaction itself can also derail the sequence if it doesn't get garbage collected quickly.

Anyhow, this row will occur locked multiple times, and has the deadlock decorator on it today. Except going and trying to delete the row in a new transaction can cause issues to begin with.

The allocation is only safe to delete, realistically when it's state is in ERROR or ACTIVE. In flight might conflict.

So a few things are broken here:
1) The tempest plugin is just a tad bit too aggressive.
2) The update_allocation db_api call should re-query with a read and hand that back instead of the row from the writer transaction. While it flushes, it is still an open writer.
3) We *likely* need constrain the ability to delete an allocation that is still in flight.

[0]: https://github.com/openstack/ironic-tempest-plugin/blob/3c8235ed02c10940fc3889ae04d9ba7a6a8bbb34/ironic_tempest_plugin/tests/api/admin/test_allocations.py#L121
[1]: https://github.com/openstack/ironic/blob/40356e2eae020ffe2ec778500e362769071f1559/ironic/conductor/manager.py#L3499
[2]: https://github.com/openstack/ironic/blob/master/ironic/db/sqlalchemy/api.py#L2308

Changed in ironic:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 22.1.0

This issue was fixed in the openstack/ironic 22.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (stable/2023.1)

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/ironic/+/895038

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/ironic/+/895039

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/ironic/+/895038
Committed: https://opendev.org/openstack/ironic/commit/b35a9c78c6a5f1163df10738f799229144282e30
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit b35a9c78c6a5f1163df10738f799229144282e30
Author: Julia Kreger <email address hidden>
Date: Thu Jul 27 09:33:03 2023 -0700

    DB: Streamline allocation interactions

    We've discovered we can deadlock on allocations, and reviewing
    the code of both the test and the underlying db, it is sort of a
    "multiple things contribute scenario", but first up here is to
    streamline the allocations update process so we re-query after
    closing out the transaction.

    Change-Id: I46e78813787703819a61f69d4243271ec07e0983
    Partial-Bug: #2028866
    (cherry picked from commit cc9af373e7120bdf7699c370e118d5f11171b573)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/ironic/+/895039
Committed: https://opendev.org/openstack/ironic/commit/58251eb3730e4d49322db3e5c7ddc169fc9539b0
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 58251eb3730e4d49322db3e5c7ddc169fc9539b0
Author: Julia Kreger <email address hidden>
Date: Thu Jul 27 10:01:31 2023 -0700

    DB: Select upon delete for allocations

    When an allocation has been created, it is still being created
    in the background. The client may request deletion of the
    allocation *prior* to the creation of the allocation is entirely
    completed. That is fine, but the challenge is we may encounter a
    locked row if we're asked to delete while in process.

    So, we'll query with with_for_update[0] which should be held until
    the lock is released, which is only released when the original
    locking transaction closes out[1][2].

    [0]: https://docs.sqlalchemy.org/en/14/core/selectable.html#sqlalchemy.sql.expression.GenerativeSelect.with_for_update
    [1]: https://dev.mysql.com/doc/refman/5.7/en/select.html
    [2]: https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-reads.html

    Change-Id: I0b68f054c951655b01f0cd776feb5a8c768471ab
    Closes-Bug: #2028866
    (cherry picked from commit 6ac0bdab1de8219fed419a838663aa82973e93e9)

Changed in ironic:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 21.4.1

This issue was fixed in the openstack/ironic 21.4.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.