Concurrent requests can cause designate-central to lock up

Bug #1392762 reported by Kiall Mac Innes
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Designate
Fix Released
High
Kiall Mac Innes
Juno
Triaged
High
Unassigned

Bug Description

Concurrent requests to designate-central can, under certain circumstances, cause it to lock up.

In most current production deployments, having X designate-central instances, each with Y workers results in this issue being unlikely to occur when less than N*Y concurrent API calls are processed for a single
zone simultaneously.

If two requests to, for example, add records to a zone are received approximately simultaneously, we can end up with a code deadlock (i.e. not a true DB deadlock) . Consider the following example:

1) Two API calls to add a record to a single zone come in
2) Request 1 ("R1") is received by Central, a DB TX is opened, and work begins causing a DB lock to be obtained.
3) Eventlet performs a context switch, allowing R2 to begin.
4) Request 2 ("R2") is received by Central, a DB TX is opened, and work begins, the DB query blocks as R1 holds the requisite locks.
5) Neither R1 nor R2 can complete, as MySQL-Python is C based, so eventlet is unable to make the "blocking" query asynchronous.
6) After 30 seconds or so, at least 1 of the 2 open TX's will be aborted by MySQL due to a timeout obtaining the requisite locks.

Using a pure python MySQL driver (e.g. PyMySQL) will prevent this issue, as eventlet is capable of monkey patching the driver. The downside is, it's a slow pure-python implementation rather than a C implementation like MySQL-Python.

I believe the correct solution is to "tighten up" our DB TX window, avoiding any code that may cause a context switch during the TX window. This has the added advantage of having a much smaller transaction window than we currently do.

Kiall Mac Innes (kiall)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to designate (master)

Fix proposed to branch: master
Review: https://review.openstack.org/134707

Changed in designate:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/134942

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to designate (master)

Reviewed: https://review.openstack.org/134942
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=768ee18830c5eba554344117509e2caa9980ee34
Submitter: Jenkins
Branch: master

commit 768ee18830c5eba554344117509e2caa9980ee34
Author: Kiall Mac Innes <email address hidden>
Date: Mon Nov 17 13:41:59 2014 +0000

    Add synchronized_domain decorator

    As a temporary fix for bug 1392762, we serialize concurrent
    modifications to a domain, preventing the issue described in
    the bug. We choose a temporary fix, as pools will make a no
    locking fix significantly easier.

    Closes-Bug: 1392762
    Change-Id: Ifb1bba170983023aedbc63ed3559fe8b28359efa

Changed in designate:
status: In Progress → Fix Committed
Revision history for this message
Kiall Mac Innes (kiall) wrote :

Looks like we'll also need to merge https://review.openstack.org/#/c/134707/ before we can call this fixed

Changed in designate:
status: Fix Committed → Won't Fix
status: Won't Fix → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/134707
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=2c781c694e89df0bb304de1ff4b7ff6a35914aea
Submitter: Jenkins
Branch: master

commit 2c781c694e89df0bb304de1ff4b7ff6a35914aea
Author: Kiall Mac Innes <email address hidden>
Date: Sat Nov 15 14:03:27 2014 +0000

    Move Central notifications to a decorator

    We now emit the notifications outside of the database transaction,
    which removes one place where eventlet may choose to context
    switch.

    Change-Id: I95bf89d0a0605e63c29380961df483686ffb3092
    Partial-Bug: 1392762

Changed in designate:
assignee: Kiall Mac Innes (kiall) → Ron Rickard (rjrjr)
Changed in designate:
assignee: Ron Rickard (rjrjr) → Kiall Mac Innes (kiall)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/138406
Committed: https://git.openstack.org/cgit/openstack/designate/commit/?id=4e75f5b54c149bfd1aeea139bd8108c3031d049c
Submitter: Jenkins
Branch: master

commit 4e75f5b54c149bfd1aeea139bd8108c3031d049c
Author: rjrjr <email address hidden>
Date: Tue Dec 2 08:44:20 2014 -0700

    Pool Manager Integration with Central

    Full integration of Pool Manager with Central (no longer using the proxy
    backend driver.)

    This patch fixes:

    - Fix concurrent requests that cause lockup issue (bug #1392762)
    - Fixed bug where creating a domain fails the first time in mdns
    - Fixed bug where records in recordsets do not have the correct
      status/action/serial
    - Changed 'ADD' to 'CREATE' for ACTION column
    - Ported Fake backend to pools
    - Removed transitional pool_manager_proxy backend

    Change-Id: Icb40448f760ff2a573d08a04bb4dec1f550119bb
    Closes-Bug: 1392762

Changed in designate:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in designate:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in designate:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.