Quota driver "DbQuotaNoLockDriver" can lock when removing the expired reservations

Bug #1954662 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Rodolfo Alonso

Bug Description

Just in case, this is related to [1].

In [1] we found that we were deleting the reservations always for a specific resource and project, regardless of the date. The solution was to introduce a timeout (with reasonable value of 20 seconds) to filter the existing reservations. Any recent reservation, created by an ongoing request transaction, is keep in the DB.

This bug shows another problem related to situations with very high concurrency. The deletion of the expired reservations cannot be executed at the same time by two or more concurrent transactions. In case this happens, only one transaction will succeed and the others will fail, triggering the DB retry and ending in a DB lock state.

Error log: https://paste.opendev.org/show/811637/

[1]https://bugs.launchpad.net/neutron/+bug/1940311
[2]https://github.com/openstack/neutron/blob/e99d9a9d0697a21ba7ec84465f415f60041f3767/neutron/db/quota/driver_nolock.py#L53-L58

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
importance: Undecided → High
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/821592

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/821592
Committed: https://opendev.org/openstack/neutron/commit/2dd3ffa271d68b4e042ff64fcc2657af6990e95f
Submitter: "Zuul (22348)"
Branch: master

commit 2dd3ffa271d68b4e042ff64fcc2657af6990e95f
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Dec 13 14:29:47 2021 +0000

    Remove the expired reservations in a separate DB transaction

    In "DbQuotaNoLockDriver", when a new reservation is being made,
    first the expired reservations are removed. That guarantees the
    freshness of the existing reservations.

    In systems with high concurrency of operations, the
    "DbQuotaNoLockDriver.make_reservation" method will be called in
    parallel. The expired reservations removal implies a deletion
    on the "reservation" table that could be executed by several
    workers at the same time (in the same controller or not). That
    could lead to a "DBDeadlock" exception if multiple workers want
    to delete the same registers.

    In case an API worker receives this exception, it should continue
    as the expired reservations have been deleted by other worker. It
    should not retry this operation.

    If the reservations are not deleted, the quota engine will filter
    out those expired reservations when counting the current number of
    reservations [1][2][3]. That means even if in a particular request
    the expired reservations are not deleted, these won't count in the
    resource quota calculation.

    The default reservation expiration timeout is set to 120 seconds
    (as it should have been initially set) that is the default
    expiration delta for a reservation since 2015.

    [1]https://github.com/openstack/neutron/blob/e99d9a9d0697a21ba7ec84465f415f60041f3767/neutron/quota/resource.py#L340
    [2]https://github.com/openstack/neutron/blob/e99d9a9d0697a21ba7ec84465f415f60041f3767/neutron/db/quota/api.py#L226
    [3]https://github.com/openstack/neutron/blob/e99d9a9d0697a21ba7ec84465f415f60041f3767/neutron/objects/quota.py#L100-L101

    Closes-Bug: #1954662
    Change-Id: I8af6565d2537db7f0df2e8e567ea046a0a6e003a

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/821967

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/821967
Committed: https://opendev.org/openstack/neutron/commit/05666791e104de8d8120070bd73aa83205d78eb9
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 05666791e104de8d8120070bd73aa83205d78eb9
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Dec 13 14:29:47 2021 +0000

    Remove the expired reservations in a separate DB transaction

    In "DbQuotaNoLockDriver", when a new reservation is being made,
    first the expired reservations are removed. That guarantees the
    freshness of the existing reservations.

    In systems with high concurrency of operations, the
    "DbQuotaNoLockDriver.make_reservation" method will be called in
    parallel. The expired reservations removal implies a deletion
    on the "reservation" table that could be executed by several
    workers at the same time (in the same controller or not). That
    could lead to a "DBDeadlock" exception if multiple workers want
    to delete the same registers.

    In case an API worker receives this exception, it should continue
    as the expired reservations have been deleted by other worker. It
    should not retry this operation.

    If the reservations are not deleted, the quota engine will filter
    out those expired reservations when counting the current number of
    reservations [1][2][3]. That means even if in a particular request
    the expired reservations are not deleted, these won't count in the
    resource quota calculation.

    The default reservation expiration timeout is set to 120 seconds
    (as it should have been initially set) that is the default
    expiration delta for a reservation since 2015.

    [1]https://github.com/openstack/neutron/blob/e99d9a9d0697a21ba7ec84465f415f60041f3767/neutron/quota/resource.py#L340
    [2]https://github.com/openstack/neutron/blob/e99d9a9d0697a21ba7ec84465f415f60041f3767/neutron/db/quota/api.py#L226
    [3]https://github.com/openstack/neutron/blob/e99d9a9d0697a21ba7ec84465f415f60041f3767/neutron/objects/quota.py#L100-L101

    Conflicts:
            neutron/db/quota/driver_nolock.py

    Closes-Bug: #1954662
    Change-Id: I8af6565d2537db7f0df2e8e567ea046a0a6e003a
    (cherry picked from commit 2dd3ffa271d68b4e042ff64fcc2657af6990e95f)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.1.0

This issue was fixed in the openstack/neutron 19.1.0 release.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-lib (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-lib/+/825490

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/825521

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-lib (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-lib/+/825490
Committed: https://opendev.org/openstack/neutron-lib/commit/04a87664c21824b69799326452fcb45c11c06b7e
Submitter: "Zuul (22348)"
Branch: master

commit 04a87664c21824b69799326452fcb45c11c06b7e
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Sun Jan 16 13:09:52 2022 +0000

    Add "get_workers" method to "QuotaDriverAPI" class

    This method returns the quota driver workers that needs to be spawned
    during the plugin initialization.

    Change-Id: Id9840912b9d0018d008b6961d24dadbfaafc9f8e
    Related-Bug: #1954662

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/825521
Committed: https://opendev.org/openstack/neutron/commit/6ea6fdd874eed70d8c103f6bc91783921c817cc8
Submitter: "Zuul (22348)"
Branch: master

commit 6ea6fdd874eed70d8c103f6bc91783921c817cc8
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Sun Jan 16 16:32:18 2022 +0000

    Create a PeriodicWorker for DbQuotaNoLockDriver clean up

    The "DbQuotaNoLockDriver" quota driver "Reservation" registers clean up
    is done now in a "PeriodicWorker" spawned by ML2Plugin during the
    initialization. The "Reservation" registers are no longer deleted
    synchronously during the API calls.

    That will prevent from possible database deadlocks when concurrent
    delete operations clash (as seen in very busy systems, with more
    then 500 parallel requests). Although those database deadlocks were
    recoverable, this new implementation will avoid this by allowing
    onle single thread to execute this command periodically.

    Related-Bug: #1954662
    Change-Id: I50bab57830ce4c1d123b2cbd9d9832690bd4c8f9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/828172

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/828172
Committed: https://opendev.org/openstack/neutron/commit/8dc7db260866cda18327a95134b79e5274084588
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 8dc7db260866cda18327a95134b79e5274084588
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Sun Jan 16 16:32:18 2022 +0000

    Create a PeriodicWorker for DbQuotaNoLockDriver clean up

    The "DbQuotaNoLockDriver" quota driver "Reservation" registers clean up
    is done now in a "PeriodicWorker" spawned by ML2Plugin during the
    initialization. The "Reservation" registers are no longer deleted
    synchronously during the API calls.

    That will prevent from possible database deadlocks when concurrent
    delete operations clash (as seen in very busy systems, with more
    then 500 parallel requests). Although those database deadlocks were
    recoverable, this new implementation will avoid this by allowing
    onle single thread to execute this command periodically.

    Related-Bug: #1954662

    Conflicts:
            neutron/db/quota/api.py
            neutron/db/quota/driver_nolock.py

    Change-Id: I50bab57830ce4c1d123b2cbd9d9832690bd4c8f9
    (cherry picked from commit 6ea6fdd874eed70d8c103f6bc91783921c817cc8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.0.0.0rc1

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.