DB lock timeout errors with parallel operations

Bug #1283522 reported by Salvatore Orlando
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Kevin Benton
Havana
Fix Released
Undecided
Unassigned

Bug Description

Since the neutron full job has been enabled in non-voting mode a worrying number of lock timeout errors are appearing.
An analysis of 60 random failures revealed that this errors are responsible for 15 (25%) of failures of the full jobs.

Some examples here:
http://paste.openstack.org/show/68417/
http://paste.openstack.org/show/68413/

It is worth noting that offending queries are seldom the same, and that the root cause apparently lies in the well-known eventlet/mysql deadlock condition, which is exacerbated by the fact that the there are now a consistent number of agents reporting to the neutron server.

This bug should be regarded as an "umbrella bug" whose main purpose is to track failure frequency with elastic recheck.
Feel free to submit new bugs to fix specific lock timeout issues; or use this bug report specifying "partial-bug" in the commit message.

A rough logstash query is here: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiKE9wZXJhdGlvbmFsRXJyb3IpICgxMjA1LCAnTG9jayB3YWl0IHRpbWVvdXQgZXhjZWVkZWQ7IHRyeSByZXN0YXJ0aW5nIHRyYW5zYWN0aW9uJylcIiBBTkQgTk9UIG1lc3NhZ2U6XCJUcmFjZWJhY2sgKG1vc3QgcmVjZW50IGNhbGwgbGFzdFwiIEFORCBidWlsZF9uYW1lOlwiY2hlY2stdGVtcGVzdC1kc3ZtLW5ldXRyb24tZnVsbFwiIEFORCBidWlsZF9icmFuY2g6XCJtYXN0ZXJcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MzA5NDgyOTUzMiwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==

The query (as of now) reports 106 hits in 48 hours. In some tests the failure happens multiple times; scoring by build_uuid reveals that there are 25 failing builds, which is still a lot.

This bug need an elastic-recheck query

Here is a flow-chart describing the eventlet/mysql deadlock.
https://docs.google.com/drawings/d/13A2x4AWbf8zmzeGApUmYVlBrW8CMTPFTCBGSP_nTzDA

Tags: neutron-core
Changed in neutron:
importance: Undecided → High
Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → nobody
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

I'll take this for evaluating

Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
Revision history for this message
Akihiro Motoki (amotoki) wrote :

Is bug 1282925 related to this bug?

The original bug is bug 1282922 (nec plugin bug). I frequently encountered "(OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction'" and I found db_plugin.delete_port*s* calls all plugin.delete_port() under a single transaction. This leads to a long transaction. After removing a transaction with-clause from delete_ports, this bug disappears completely in nec plugin CI.

I am not sure it is related to this bug because ML2 plugin with OVS agent (which is used in the gate) only operates on the db and it does not talk with an external controller.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I am pretty sure 1282295 is one of the errors causing this failure. Probably one of the most common.
Frankly we've been aware of it for a few weeks since we found it often on vmware mine sweeper, but you were quicker than us in fixing it!

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-3 → icehouse-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/78880

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :
Changed in neutron:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/80354

Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Salvatore Orlando (salvatore-orlando)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/80356

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/80413

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/80670

Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/80688

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Salvatore Orlando (salvatore-orlando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/78880
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5e4b0c6fc6670ea036d801ce53444272bc311929
Submitter: Jenkins
Branch: master

commit 5e4b0c6fc6670ea036d801ce53444272bc311929
Author: Akihiro Motoki <email address hidden>
Date: Fri Mar 7 15:58:46 2014 +0900

    Avoid long transaction in plugin.delete_ports()

    db_plugin.delete_ports() called plugin.delete_port() under
    a transaction. It leads to long transaction if plugin.delete_port
    talks with external systems. This commit changes each delete_port
    outside of a transaction to avoid longer transaction.

    plugin.delete_ports is now called by release_dhcp_ports and
    dhcp-agent ports can be deleted separately, so this changes
    does not break the existing behavior.

    delete_ports is renamed to delete_ports_by_device_id
    to clarify the usage of this method.

    NEC plugin already has this change and it is no longer needed.

    _do_side_effect helper method in test_db_plugin is renamed
    to more self-descriptive name.

    Change-Id: Ied5883a57c7774c3b0778453d84c717b337f88c0
    Closes-Bug: #1282925
    Related-Bug: #1283522

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/81196

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/81196
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2648aa3561d23e1215e0cc6f446253e5df56c8f6
Submitter: Jenkins
Branch: master

commit 2648aa3561d23e1215e0cc6f446253e5df56c8f6
Author: Maru Newby <email address hidden>
Date: Fri Mar 14 22:14:09 2014 +0000

    Stop removing ip allocations on port delete

    The _delete_port() method was manually removing related
    IPAllocation instances despite the existence of a perfectly
    good cascade deletion relationship in the model. This patch
    puts an end to that nonsense and the potential for deadlock that
    it represented.

    Closes-bug: #1288379
    Related-Bug: #1283522

    Change-Id: Ib31550fa9000fc75768a327cb6cc1c419e06568f

Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → Kevin Benton (kevinbenton)
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → nobody
Robert Kukura (rkukura)
Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → Kevin Benton (kevinbenton)
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → nobody
description: updated
Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/80413
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=688e05d4b000e7cbfdfa1097287a0ff398e02780
Submitter: Jenkins
Branch: master

commit 688e05d4b000e7cbfdfa1097287a0ff398e02780
Author: Salvatore Orlando <email address hidden>
Date: Thu Mar 13 15:06:12 2014 -0700

    Add a semaphore to some ML2 operations

    This patch adds a semaphore for serializing the delete port API
    calls and update_port_status calls to avoid simultaneous
    attempts to acquire SQL update locks on the port table from
    the same thread pool.

    This semaphore has been introduced to avoid undesired eventlet
    yields which trigger 'lock wait timeout' errors in the DB backend;
    it should have a negligible impact on the overall performance.

    Change-Id: I0ec947086e533945942ae93402943dec69b218ef
    Partial-Bug: 1283522

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/81877

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Eugene Nikanorov (enikanorov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/81877
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fe961d14c5821bcad6b0b7ae0e072269c1b1e516
Submitter: Jenkins
Branch: master

commit fe961d14c5821bcad6b0b7ae0e072269c1b1e516
Author: Kevin Benton <email address hidden>
Date: Thu Mar 20 00:24:50 2014 -0700

    Get rid of additional db contention on fetching VIP

    This fix gets rid of fetching related record from ports table
    that could potentially be blocked by concurrent operation
    This is to prevent it from updating the
    vip table while a port SQL lock is in place, which
    vip entries have a reference to.

    Partial-Bug: 1283522
    Change-Id: Ic64cc268ff4817a31971734a44671c744047231b

Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Kevin Benton (kevinbenton)
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/havana)

Related fix proposed to branch: stable/havana
Review: https://review.openstack.org/84586

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/84586
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=eb1e7d50a4cca21d2f58e3811dc4572e549944b0
Submitter: Jenkins
Branch: stable/havana

commit eb1e7d50a4cca21d2f58e3811dc4572e549944b0
Author: Maru Newby <email address hidden>
Date: Fri Mar 14 22:14:09 2014 +0000

    Stop removing ip allocations on port delete

    The _delete_port() method was manually removing related
    IPAllocation instances despite the existence of a perfectly
    good cascade deletion relationship in the model. This patch
    puts an end to that nonsense and the potential for deadlock that
    it represented.

    Closes-bug: #1288379
    Related-Bug: #1283522

    Change-Id: Ib31550fa9000fc75768a327cb6cc1c419e06568f
    (cherry picked from commit 2648aa3561d23e1215e0cc6f446253e5df56c8f6)

tags: added: in-stable-havana
Alan Pevec (apevec)
tags: removed: in-stable-havana
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-rc1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.