PostgreSQL database deadlocks under high load

Bug #1785459 reported by ITD27M01
26
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Designate
New
Undecided
Unassigned

Bug Description

We are using standard nova and neutron notifications handlers for records create. We have three controllers with the following packages:

[root@srv-os-ctl01 ~]# rpm -qa | grep designate
openstack-designate-api-3.0.1-3.el7ost.noarch
openstack-designate-worker-3.0.1-3.el7ost.noarch
python-designate-3.0.1-3.el7ost.noarch
openstack-designate-mdns-3.0.1-3.el7ost.noarch
openstack-designate-common-3.0.1-3.el7ost.noarch
openstack-designate-producer-3.0.1-3.el7ost.noarch
python-designateclient-2.3.0-1.el7ost.noarch
openstack-designate-central-3.0.1-3.el7ost.noarch
openstack-designate-agent-3.0.1-3.el7ost.noarch
openstack-designate-sink-3.0.1-3.el7ost.noarch

Memcached is used as a coordination backend.

Under high load we get the following errors in central.log:

WARNING designate.storage [req-75f6b550-ff4c-4bb1-8d33-f96d6326866d - - - - -] Deadlock detected. Retrying...

And on the postgresql side:

< 2018-08-02 16:23:04.286 MSK > ERROR: deadlock detected
< 2018-08-02 16:23:04.286 MSK > DETAIL: Process 40320 waits for AccessExclusiveLock on tuple (141,10) of relation 17034 of database 16396; blocked by process 40159.
        Process 40159 waits for ShareLock on transaction 109671456; blocked by process 40102.
        Process 40102 waits for ShareLock on transaction 109671770; blocked by process 40320.
        Process 40320: UPDATE records SET version=(records.version + 1), updated_at='2018-08-02T13:23:03.284743'::timestamp, data='srv-sdn-osns1.example.local. devops.example.local. 1533216183 3565 600 86400 3600', hash='76e0db0719565cfa6a40cbdb8ed6b3d7', serial=1533216183 WHERE records.id = '65120dd1-300a-480e-972a-a664666fc702'::uuid
        Process 40159: UPDATE records SET version=(records.version + 1), updated_at='2018-08-02T13:22:57.797790'::timestamp, status='ACTIVE', action='NONE' WHERE records.id = '65120dd1-300a-480e-972a-a664666fc702'::uuid
        Process 40102: UPDATE zones SET version=(zones.version + 1), updated_at='2018-08-02T13:23:04.207338'::timestamp, status='ACTIVE', action='NONE' WHERE zones.id = 'eee4274e-020f-4077-bc60-d5f76b598cb2'::uuid AND zones.deleted = '0'
< 2018-08-02 16:23:04.286 MSK > HINT: See server log for query details.
< 2018-08-02 16:23:04.286 MSK > STATEMENT: UPDATE records SET version=(records.version + 1), updated_at='2018-08-02T13:23:03.284743'::timestamp, data='srv-sdn-osns1.example.local. devops.example.local. 1533216183 3565 600 86400 3600', hash='76e0db0719565cfa6a40cbdb8ed6b3d7', serial=1533216183 WHERE records.id = '65120dd1-300a-480e-972a-a664666fc702'::uuid

As a workaround we have moved central and mdns services to pacemaker and are using only one instances of each currently.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to designate (master)

Reviewed: https://review.openstack.org/647711
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=f828654a3d40476cac7eb24a09a36e9978c2d708
Submitter: Zuul
Branch: master

commit f828654a3d40476cac7eb24a09a36e9978c2d708
Author: Takahito Hirose <email address hidden>
Date: Tue Mar 26 19:52:33 2019 +0900

    Fix DBDeadLock error resulting into 500

    When user requests the record registration request continuously,
    sometimes designate hits DBDeadLock resuting into 500 InternalServerError.

    We get below error:

    2019-02-21 21:30:39.925 49752 ERROR designate.api.middleware RemoteError:
    Remote error: DBDeadlock (pymysql.err.InternalError)
    (1213, u'Deadlock found when trying to get lock; try restarting transaction')
    [SQL: u'UPDATE records SET version=(records.version + %(version_1)s),
    updated_at=%(updated_at)s, data=%(data)s, hash=%(hash)s, status=%(status)s,
    action=%(action)s, serial=%(serial)s WHERE records.id = %(id_1)s']
    [parameters: {'status': 'PENDING', 'hash': '39795ee18c6e3c9ad1c0190c6a3d8d4f',
    'updated_at': datetime.datetime(2019, 2, 21, 12, 30, 39, 909846), u'version_1': 1,
    u'id_1': '7a655eeda4d446cdaa81caf19ab55fcc', 'action': 'UPDATE',
    'serial': 1550752338,
    'data': u'ns2.example.jp. domain.example.com. 1550752338 3552 600 86400 3600'}]

    In the process of record registeration, designate first tried to update
    the reocrd and then update the zone status.

    Updating the zone_status and registering the record process[1] and after synced
    update record_status and zone_status process[2] are in reverse order. So If user
    request the registering record many time and same time, Designate will get the
    DBDeadLock, when these processes run the same time.

    We observed that changing the order of the operations solves this issue.

    [1] https://github.com/openstack/designate/blob/master/designate/central/service.py#L1292-L1320
    [2] https://github.com/openstack/designate/blob/master/designate/central/service.py#L2310-L2322

    1. transaction [1]-1 updating zone status process <- run ---> table_name-zone
    2. transaction [2]-1 updating record status process <- run ---> table_name-record
    3. transaction [1]-2 registering record process <- run and wait ---> table_name-record
    4. transaction [2]-2 updating zone process <-deadlock! ---> table_name-zone

    Change-Id: Icd6e690ac84a2fe0db0f4a8a513de47f7916f5ea
    Related-Bug: #1785459

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to designate (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.openstack.org/650603

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to designate (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/650604

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to designate (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/650605

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to designate (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/650606

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to designate (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/650607

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to designate (stable/stein)

Reviewed: https://review.opendev.org/650603
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=9b4b876d6ae3298110e15b6998306a2cf5660c3f
Submitter: Zuul
Branch: stable/stein

commit 9b4b876d6ae3298110e15b6998306a2cf5660c3f
Author: Takahito Hirose <email address hidden>
Date: Tue Mar 26 19:52:33 2019 +0900

    Fix DBDeadLock error resulting into 500

    When user requests the record registration request continuously,
    sometimes designate hits DBDeadLock resuting into 500 InternalServerError.

    We get below error:

    2019-02-21 21:30:39.925 49752 ERROR designate.api.middleware RemoteError:
    Remote error: DBDeadlock (pymysql.err.InternalError)
    (1213, u'Deadlock found when trying to get lock; try restarting transaction')
    [SQL: u'UPDATE records SET version=(records.version + %(version_1)s),
    updated_at=%(updated_at)s, data=%(data)s, hash=%(hash)s, status=%(status)s,
    action=%(action)s, serial=%(serial)s WHERE records.id = %(id_1)s']
    [parameters: {'status': 'PENDING', 'hash': '39795ee18c6e3c9ad1c0190c6a3d8d4f',
    'updated_at': datetime.datetime(2019, 2, 21, 12, 30, 39, 909846), u'version_1': 1,
    u'id_1': '7a655eeda4d446cdaa81caf19ab55fcc', 'action': 'UPDATE',
    'serial': 1550752338,
    'data': u'ns2.example.jp. domain.example.com. 1550752338 3552 600 86400 3600'}]

    In the process of record registeration, designate first tried to update
    the reocrd and then update the zone status.

    Updating the zone_status and registering the record process[1] and after synced
    update record_status and zone_status process[2] are in reverse order. So If user
    request the registering record many time and same time, Designate will get the
    DBDeadLock, when these processes run the same time.

    We observed that changing the order of the operations solves this issue.

    [1] https://github.com/openstack/designate/blob/master/designate/central/service.py#L1292-L1320
    [2] https://github.com/openstack/designate/blob/master/designate/central/service.py#L2310-L2322

    1. transaction [1]-1 updating zone status process <- run ---> table_name-zone
    2. transaction [2]-1 updating record status process <- run ---> table_name-record
    3. transaction [1]-2 registering record process <- run and wait ---> table_name-record
    4. transaction [2]-2 updating zone process <-deadlock! ---> table_name-zone

    Change-Id: Icd6e690ac84a2fe0db0f4a8a513de47f7916f5ea
    Related-Bug: #1785459
    (cherry picked from commit f828654a3d40476cac7eb24a09a36e9978c2d708)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to designate (stable/rocky)

Reviewed: https://review.opendev.org/650604
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=2b4fbbf4d8c72a785cb1f8efe3ebdc2ee07a6a69
Submitter: Zuul
Branch: stable/rocky

commit 2b4fbbf4d8c72a785cb1f8efe3ebdc2ee07a6a69
Author: Takahito Hirose <email address hidden>
Date: Tue Mar 26 19:52:33 2019 +0900

    Fix DBDeadLock error resulting into 500

    When user requests the record registration request continuously,
    sometimes designate hits DBDeadLock resuting into 500 InternalServerError.

    We get below error:

    2019-02-21 21:30:39.925 49752 ERROR designate.api.middleware RemoteError:
    Remote error: DBDeadlock (pymysql.err.InternalError)
    (1213, u'Deadlock found when trying to get lock; try restarting transaction')
    [SQL: u'UPDATE records SET version=(records.version + %(version_1)s),
    updated_at=%(updated_at)s, data=%(data)s, hash=%(hash)s, status=%(status)s,
    action=%(action)s, serial=%(serial)s WHERE records.id = %(id_1)s']
    [parameters: {'status': 'PENDING', 'hash': '39795ee18c6e3c9ad1c0190c6a3d8d4f',
    'updated_at': datetime.datetime(2019, 2, 21, 12, 30, 39, 909846), u'version_1': 1,
    u'id_1': '7a655eeda4d446cdaa81caf19ab55fcc', 'action': 'UPDATE',
    'serial': 1550752338,
    'data': u'ns2.example.jp. domain.example.com. 1550752338 3552 600 86400 3600'}]

    In the process of record registeration, designate first tried to update
    the reocrd and then update the zone status.

    Updating the zone_status and registering the record process[1] and after synced
    update record_status and zone_status process[2] are in reverse order. So If user
    request the registering record many time and same time, Designate will get the
    DBDeadLock, when these processes run the same time.

    We observed that changing the order of the operations solves this issue.

    [1] https://github.com/openstack/designate/blob/master/designate/central/service.py#L1292-L1320
    [2] https://github.com/openstack/designate/blob/master/designate/central/service.py#L2310-L2322

    1. transaction [1]-1 updating zone status process <- run ---> table_name-zone
    2. transaction [2]-1 updating record status process <- run ---> table_name-record
    3. transaction [1]-2 registering record process <- run and wait ---> table_name-record
    4. transaction [2]-2 updating zone process <-deadlock! ---> table_name-zone

    Change-Id: Icd6e690ac84a2fe0db0f4a8a513de47f7916f5ea
    Related-Bug: #1785459
    (cherry picked from commit f828654a3d40476cac7eb24a09a36e9978c2d708)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to designate (stable/queens)

Reviewed: https://review.opendev.org/650605
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=00d0cb71e63e870fdf09833eec02f29b95542b18
Submitter: Zuul
Branch: stable/queens

commit 00d0cb71e63e870fdf09833eec02f29b95542b18
Author: Takahito Hirose <email address hidden>
Date: Tue Mar 26 19:52:33 2019 +0900

    Fix DBDeadLock error resulting into 500

    When user requests the record registration request continuously,
    sometimes designate hits DBDeadLock resuting into 500 InternalServerError.

    We get below error:

    2019-02-21 21:30:39.925 49752 ERROR designate.api.middleware RemoteError:
    Remote error: DBDeadlock (pymysql.err.InternalError)
    (1213, u'Deadlock found when trying to get lock; try restarting transaction')
    [SQL: u'UPDATE records SET version=(records.version + %(version_1)s),
    updated_at=%(updated_at)s, data=%(data)s, hash=%(hash)s, status=%(status)s,
    action=%(action)s, serial=%(serial)s WHERE records.id = %(id_1)s']
    [parameters: {'status': 'PENDING', 'hash': '39795ee18c6e3c9ad1c0190c6a3d8d4f',
    'updated_at': datetime.datetime(2019, 2, 21, 12, 30, 39, 909846), u'version_1': 1,
    u'id_1': '7a655eeda4d446cdaa81caf19ab55fcc', 'action': 'UPDATE',
    'serial': 1550752338,
    'data': u'ns2.example.jp. domain.example.com. 1550752338 3552 600 86400 3600'}]

    In the process of record registeration, designate first tried to update
    the reocrd and then update the zone status.

    Updating the zone_status and registering the record process[1] and after synced
    update record_status and zone_status process[2] are in reverse order. So If user
    request the registering record many time and same time, Designate will get the
    DBDeadLock, when these processes run the same time.

    We observed that changing the order of the operations solves this issue.

    [1] https://github.com/openstack/designate/blob/master/designate/central/service.py#L1292-L1320
    [2] https://github.com/openstack/designate/blob/master/designate/central/service.py#L2310-L2322

    1. transaction [1]-1 updating zone status process <- run ---> table_name-zone
    2. transaction [2]-1 updating record status process <- run ---> table_name-record
    3. transaction [1]-2 registering record process <- run and wait ---> table_name-record
    4. transaction [2]-2 updating zone process <-deadlock! ---> table_name-zone

    Change-Id: Icd6e690ac84a2fe0db0f4a8a513de47f7916f5ea
    Related-Bug: #1785459
    (cherry picked from commit f828654a3d40476cac7eb24a09a36e9978c2d708)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on designate (stable/ocata)

Change abandoned by "Michael Johnson <email address hidden>" on branch: stable/ocata
Review: https://review.opendev.org/c/openstack/designate/+/650607
Reason: Abandoning this patch in preparation for stable/ocata end of life. This has not had an update for two years.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on designate (stable/pike)

Change abandoned by "Erik Olof Gunnar Andersson <email address hidden>" on branch: stable/pike
Review: https://review.opendev.org/c/openstack/designate/+/650606

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.