DB Deadlock on quota reserve

Bug #1518936 reported by Valeriy Ponomaryov
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Shared File Systems Service (Manila)
Fix Released
Low
Yiping Huang

Bug Description

See logs of Rally job results [1]. There is following error:

2015-11-23 11:22:39.134 12164 ERROR oslo_db.api DBDeadlock: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'INSERT INTO quota_usages (created_at, updated_at, deleted_at, deleted, project_id, user_id, resource, in_use, reserved, until_refresh) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: (datetime.datetime(2015, 11, 23, 11, 22, 39, 117932), datetime.datetime(2015, 11, 23, 11, 22, 39, 117688), None, 0, u'4c36967975304f639790e12e783a0c72', u'6490ad4bc45240e4b89958152e23b251', 'share_networks', 0, 0, None)]

[1] http://logs.openstack.org/15/248615/1/check/gate-rally-dsvm-manila-multibackend/b9889c1/logs/screen-m-api.txt.gz?level=TRACE#_2015-11-23_11_22_39_134

tags: added: api db
Changed in manila:
importance: Undecided → Low
milestone: none → newton-1
Revision history for this message
Peter Wang (peter.wang) wrote :

Is there anyone working on this issue?

Appreciate if someone can give a workaround.

Thanks
Peter

Tom Barron (tpb)
tags: added: quotas
Changed in manila:
milestone: newton-1 → none
Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

Sorry to be chiming on this really old bug report.
Peter: do you see this issue occurring? If yes, please feel free to set this bug's status to "Confirmed". I haven't seen this issue in our rally jobs; we'll keep a look out for it.

tags: added: concurrency races
Changed in manila:
status: New → Incomplete
Revision history for this message
Tom Barron (tpb) wrote :
Download full text (6.9 KiB)

As Goutham just pointed out to me, we just had an instance of what is almost certainly the same issue at root. The report here is for an insert but here's a failure just like it except that it's a deadlock on deletion of quota usages:

From the manila-share log:

Mar 21 18:08:46.263907 ubuntu-bionic-ovh-bhs1-0004095398 manila-api[16852]: DEBUG manila.quota [None req-91556767-fbbd-43ba-81f2-5a49c7eb7f86 None None] Created reservations ['e2ffd686-3ec9-4129-bf46-909eca546638'] {{(pid=16959) reserve /opt/stack/new/manila/manila/quota.py:916}}
Mar 21 18:08:46.278873 ubuntu-bionic-ovh-bhs1-0004095398 manila-api[16852]: DEBUG eventlet.wsgi.server [-] (16959) accepted ('::ffff:158.69.74.30', 57108, 0, 0) {{(pid=16959) server /usr/local/lib/python3.6/dist-packages/eventlet/wsgi.py:956}}
Mar 21 18:08:46.279440 ubuntu-bionic-ovh-bhs1-0004095398 manila-api[16852]: DEBUG eventlet.wsgi.server [-] (16959) accepted ('::ffff:158.69.74.30', 57110, 0, 0) {{(pid=16959) server /usr/local/lib/python3.6/dist-packages/eventlet/wsgi.py:956}}
Mar 21 18:08:46.295613 ubuntu-bionic-ovh-bhs1-0004095398 manila-api[16852]: INFO manila.api.openstack.wsgi [None req-fa8107c9-1bd2-423c-ba8d-e2a6e93a5620 None None] DELETE https://158.69.74.30:8786/v1/081d92bd03fb49288548b31f48c36340/share-networks/7c850b58-8ce8-433e-8f53-fba572dc4644
Mar 21 18:08:46.296256 ubuntu-bionic-ovh-bhs1-0004095398 manila-api[16852]: DEBUG manila.api.openstack.wsgi [None req-fa8107c9-1bd2-423c-ba8d-e2a6e93a5620 None None] Empty body provided in request {{(pid=16959) get_body /opt/stack/new/manila/manila/api/openstack/wsgi.py:653}}
Mar 21 18:08:46.296789 ubuntu-bionic-ovh-bhs1-0004095398 manila-api[16852]: DEBUG manila.api.openstack.wsgi [None req-fa8107c9-1bd2-423c-ba8d-e2a6e93a5620 None None] Calling method 'ShareNetworkController.delete' {{(pid=16959) _process_stack /opt/stack/new/manila/manila/api/openstack/wsgi.py:799}}
Mar 21 18:08:46.347740 ubuntu-bionic-ovh-bhs1-0004095398 manila-api[16852]: ERROR manila.api.middleware.fault [None req-3d6475a9-72fb-42c5-9f66-8deaa54d4e49 None None] Caught error: (pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: 'UPDATE quota_usages SET updated_at=quota_usages.updated_at, deleted_at=%(deleted_at)s, deleted=quota_usages.id WHERE quota_usages.deleted = %(deleted_1)s AND quota_usages.share_type_id = %(share_type_id_1)s'] [parameters: {'deleted_at': datetime.datetime(2019, 3, 21, 18, 8, 46, 267266), 'deleted_1': 0, 'share_type_id_1': 'd01f19cc-c877-4404-b34f-52fd743e3c5e'}] (Background on this error at: http://sqlalche.me/e/e3q8): oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: 'UPDATE quota_usages SET updated_at=quota_usages.updated_at, deleted_at=%(deleted_at)s, deleted=quota_usages.id WHERE quota_usages.deleted = %(deleted_1)s AND quota_usages.share_type_id = %(share_type_id_1)s'] [parameters: {'deleted_at': datetime.datetime(2019, 3, 21, 18, 8, 46, 267266), 'deleted_1': 0, 'share_type_id_1': 'd01f19cc-c877-4404-b34f-52fd743e3c5e'}] (Background on this error at: http://sqlalche.me/e/e3q8)
Mar 21 18:08:46.347...

Read more...

Changed in manila:
status: Incomplete → Confirmed
Revision history for this message
Jason Grosso (jgrosso) wrote :

Tom this looks like an active problem why is this a low priority, is it something that is hard to re-produce or something most customers would not run into very often?

Revision history for this message
Tom Barron (tpb) wrote :

Jason, yeah I don't think we have a good sense of how to reproduce the issue and so far the reports are from developers with randomish failures in CI rather than from operators with field deployments.

Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :
Download full text (4.0 KiB)

One more failure: http://logs.openstack.org/34/671134/5/gate/manila-tempest-minimal-dsvm-dummy/02a769c/logs/screen-m-api.txt.gz#_Jul_25_19_58_14_319488

m-api logs attached.

This is happening so sporadically though - the test class we were trying to execute was SchedulerStatsAdminTest -

2019-07-25 19:58:14.321 28321 INFO tempest.lib.common.rest_client [req-785d8f85-4bb0-47da-b0eb-3461e8e697af ] Request (SchedulerStatsAdminTest:tearDownClass): 500 DELETE https://104.130.132.195:8786/v2/09ac67e686dd45f5a3db0940a38b4b34/types/be13b002-75e7-4a17-bede-f7a813f1854b 0.228s
2019-07-25 19:58:14.322 28321 DEBUG tempest.lib.common.rest_client [req-785d8f85-4bb0-47da-b0eb-3461e8e697af ] Request - Headers: {'x-openstack-manila-api-version': '2.49', 'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '500', u'content-length': '128', 'content-location': 'https://104.130.132.195:8786/v2/09ac67e686dd45f5a3db0940a38b4b34/types/be13b002-75e7-4a17-bede-f7a813f1854b', u'x-compute-request-id': 'req-785d8f85-4bb0-47da-b0eb-3461e8e697af', u'vary': 'X-OpenStack-Manila-API-Version', u'x-openstack-manila-api-version': '2.49', u'server': 'Apache/2.4.29 (Ubuntu)', u'connection': 'close', u'date': 'Thu, 25 Jul 2019 19:58:14 GMT', u'content-type': 'application/json'}
        Body: {"computeFault": {"code": 500, "message": "The server has either erred or is incapable of performing the requested operation."}} _log_request_full tempest/lib/common/rest_client.py:440
2019-07-25 19:58:14.322 28321 ERROR tempest.test [-] teardown of resources failed: Got server fault
Details: The server has either erred or is incapable of performing the requested operation.: ServerFault: Got server fault
Details: The server has either erred or is incapable of performing the requested operation.
2019-07-25 19:58:14.322 28321 ERROR tempest.test Traceback (most recent call last):
2019-07-25 19:58:14.322 28321 ERROR tempest.test File "tempest/test.py", line 197, in tearDownClass
2019-07-25 19:58:14.322 28321 ERROR tempest.test teardown()
2019-07-25 19:58:14.322 28321 ERROR tempest.test File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/tests/api/base.py", line 317, in resource_cleanup
2019-07-25 19:58:14.322 28321 ERROR tempest.test cls.clear_resources(cls.class_resources)
2019-07-25 19:58:14.322 28321 ERROR tempest.test File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/tests/api/base.py", line 962, in clear_resources
2019-07-25 19:58:14.322 28321 ERROR tempest.test client.delete_share_type(res_id)
2019-07-25 19:58:14.322 28321 ERROR tempest.test File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/services/share/v2/json/shares_client.py", line 928, in delete_share_type
2019-07-25 19:58:14.322 28321 ERROR tempest.test resp, body = self.delete("types/%s" % share_type_id, version=version)
2019-07-25 19:58:14.322 28321 ERROR tempest.test File "/opt/stack/new/manila-tempest-plugin/manila_tempest_tests/services/share/v2/json/shares_client.py", line 87, in delete
2019-07-25 19:58:14.322 28321 ERROR tempest.test body=body)
2019-07-25 19:58:14...

Read more...

Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

Another failure observed on a stable/rocky job: https://review.opendev.org/#/c/691006/2
Log from manila-api for the request ID that failed: http://paste.openstack.org/show/785563/

Interestingly, the last three occurrences of this issue seem to be when deleting a share type.

Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

Another failure observed on a stable/train job, again while cleaning up a share type: https://review.opendev.org/#/c/722531/

Revision history for this message
Goutham Pacha Ravi (gouthamr) wrote :

Revisiting this bug while we're triaging stale bugs - we haven't recorded a recent CI run that reproduces this issue; however, i think going by our prior analysis the "share_type_destroy" method here could use a db retry on deadlock. If multiple quota operations are occurring at the same exact instant, with no retry capability, we're open to hitting a deadlock - we have retry decorators on many other "update"/"delete" operations in the database code. We should probably add one and see if this helps.

I'm going to set this to "low-hanging-fruit" to see if any new contributors can pick this up.

tags: added: low-hanging-fruit
Changed in manila:
assignee: nobody → Goutham Pacha Ravi (gouthamr)
Changed in manila:
assignee: Goutham Pacha Ravi (gouthamr) → nobody
Yiping Huang (yhcode)
Changed in manila:
assignee: nobody → Yiping Huang (yhcode)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to manila (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/manila/+/812232

Changed in manila:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to manila (master)

Reviewed: https://review.opendev.org/c/openstack/manila/+/812232
Committed: https://opendev.org/openstack/manila/commit/3935fe5e6ae3e1488f31beba2de8da277925ee89
Submitter: "Zuul (22348)"
Branch: master

commit 3935fe5e6ae3e1488f31beba2de8da277925ee89
Author: Yiping Huang <email address hidden>
Date: Fri Oct 1 17:32:49 2021 -0400

    retry deadlock for share_type_destroy

    by adding a decorator.

    Closes-Bug: 1518936
    Change-Id: I788d34a88a79058c5ab35a8441f0a9dd4b667a2b

Changed in manila:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/manila 14.0.0.0rc1

This issue was fixed in the openstack/manila 14.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.