Reservations code triggers deadlocks and lock wait timeouts

Bug #1486134 reported by Salvatore Orlando
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Salvatore Orlando

Bug Description

Switching the gate tests to multiple workers + pymysql is triggering a lot of errors like [1] in the reservation logic.
These errors are not hitting the gate at the moment (no instance of lock wait timeout or deadlock errors emerged from logstash).

Nevertheless Rally failure rate has now jumped to 100%. This means that the issue with the reservation logic will surely end up affecting production environments, and is a time bomb waiting to explode in the upstream gate.

The logic must be fixed, otherwise reverted.
Please also cut the fingers of the developer that wrote that code.

[1] http://logs.openstack.org/60/213360/4/check/gate-rally-dsvm-neutron-neutron/4297681/logs/screen-q-svc.txt.gz?level=TRACE#_2015-08-18_10_28_01_314

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/214282

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/214282
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a8bddee4f43c2772e4ca96acdee9b95feec733a9
Submitter: Jenkins
Branch: master

commit a8bddee4f43c2772e4ca96acdee9b95feec733a9
Author: Salvatore Orlando <email address hidden>
Date: Tue Aug 18 10:01:50 2015 -0700

    Stop using quota reservations on base controller

    The reservation engine is subject to failures due to concurrency;
    the switch to pymysql is likely to also have a part in observed
    failures. While no gate failures have been observed so far, this
    is a time bomb waiting to explode and must be addressed.

    For this reason this patch acts conservatively by ensuring the
    API controllers do not use anymore reservation. The code for
    reservation management is preserved, and will wired again on the
    controller when these issues are sorted.

    The devref for neutron quotas is updated accordingly as a part
    of this patch.

    Related bug: #1486134

    Change-Id: I2a95fef0fdf64ef8781bef99be0fdc743346c17a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/214602

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/214602
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=09852988d131bdd61e5685541fd2cec1e0e7b73d
Submitter: Jenkins
Branch: master

commit 09852988d131bdd61e5685541fd2cec1e0e7b73d
Author: Salvatore Orlando <email address hidden>
Date: Wed Aug 19 06:10:08 2015 -0700

    Do not query reservations table when counting resources

    Reservations are temporarily disabled, and therefore querying them
    is pointless, and potentially harmful.

    Change-Id: Iab1d0ffdc54cb5bd06a0d4fbd4eb095ac4b754b8
    Related-Bug: #1486134

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/216640

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (feature/pecan)

Related fix proposed to branch: feature/pecan
Review: https://review.openstack.org/218710

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (feature/pecan)
Download full text (155.6 KiB)

Reviewed: https://review.openstack.org/218710
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2c5f44e1b3bd4ed8a0b7232fd293b576cc8c1c87
Submitter: Jenkins
Branch: feature/pecan

commit f35d1c5c50dccbef1a2e079f967b82f0df0e22e9
Author: Adelina Tuvenie <email address hidden>
Date: Thu Aug 27 02:27:28 2015 -0700

    Fixes wrong neutron Hyper-V Agent name in constants

    Change Id03fb147e11541be309c1cd22ce27e70fadc28b5 moved the
    AGENT_TYPE_HYPERV constant from common.constants to
    plugins.ml2.drivers.hyperv.constants but change the value of the
    constant from 'HyperV agent' to 'hyperv'. This patch changes
    the name back to 'HyperV agent'

    Change-Id: If74b4b2a84811e266c8b12e70bf6bfe74ed4ea21
    Partial-Bug: #1487598

commit de604de334854e2eb6b4312ff57920564cbd4459
Author: OpenStack Proposal Bot <email address hidden>
Date: Sun Aug 30 01:39:06 2015 +0000

    Updated from global requirements

    Change-Id: Ie52aa3b59784722806726e4046bd07f4a4d97328

commit f0415ac20eaf5ab4abb9bd4839bf6d04ceee85d0
Author: armando-migliaccio <email address hidden>
Date: Fri Aug 28 13:53:04 2015 -0700

    Revert "Add support for unaddressed port"

    This implementation may expose a vulnerability where a malicious
    user can sieze the opportunity of a time window where a port
    may land unaddressed on a shared network, thus allowing him/her
    to suck up all the tenant traffic he/she wants....oh the shivers.

    This reverts commit d4c52b7f5a36a103a92bf9dcda7f371959112292.

    Change-Id: I7ebdaa8d3defa80eab90e460fde541a5bdd8864c

commit 013fdcd2a6d45dbe4de5d6e7077e5e9b60985ef9
Author: Assaf Muller <email address hidden>
Date: Fri Aug 28 16:41:07 2015 -0400

    Improve logging upon failure in iptables functional tests

    This will help us nail down a more accurate and efficient logstash
    query.

    Change-Id: Iee4238e358f7b056e373c7be8d6aa3202117a680
    Related-Bug: #1478847

commit 622dea818d851224a43d5276a81d5ce8a6eebb76
Author: Ivar Lazzaro <email address hidden>
Date: Mon Aug 17 17:17:42 2015 -0700

    handle gw_info outside of the db transaction on router creation

    Move the gateway interface creation outside the DB transaction
    to avoid lock timeout.

    Change-Id: I5a78d7f32e8ca912016978105221d5f34618af19
    Closes-bug: 1485809

commit 5b27d290a0a95f6247fc5a0fe6da1e7d905e6b2d
Author: Assaf Muller <email address hidden>
Date: Wed Aug 26 10:07:03 2015 -0400

    Remove ml2 resource extension success logging

    This is the cause of a tremendous amount of logs, for no
    perceivable gain. A normal dvr run in the gate shows this debug
    message around 120K times, which is way too much.

    Closes-Bug: #1489952

    Change-Id: I26fca8515d866a7cc1638d07fa33bc04479ae221

commit 8d3faf549cba2f58c872ef4121b2481e73464010
Author: huangpengtao <email address hidden>
Date: Fri Aug 28 23:20:46 2015 +0800

    Replace "prt" variable by "port"

    the local variable prt is meaningless,
    and port is used popular.

    Change-Id: I20849102cf5b4d84433c46791b4b1e2a22dc4739

commit ee374e7a5f4dea538fcd942f5...

tags: added: in-feature-pecan
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/216640
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7da1724d446b6804c6be7a602532fbae58d9f008
Submitter: Jenkins
Branch: master

commit 7da1724d446b6804c6be7a602532fbae58d9f008
Author: Salvatore Orlando <email address hidden>
Date: Tue Aug 25 02:21:06 2015 -0700

    Improve DB operations for quota reservation

    This patch deals with the lock wait timeout and the deadlock errors
    observed under high concurrency (api_workers >= 4) with the pymysql
    driver. It includes the following changes:

    - Stop setting dirty status for resource usage when creating
      reservation, as usage of reserved resources is not tracked anymore;
    - Add a variable, increasing delay when retrying make_reservation
      upon a DBDeadlock error in order to reduce the chances of further
      collisions;
    - Enable transaction retry upon DBDeadlock errors for set_quota_usage;
    - Do not resync quota usage while making reservation. This puts a lot
      of stress on the database and is also wasteful since resource usage
      is very likely to change again once the transaction is committed;
    - Use autonested_transaction to simplify logic around when the
      nested flag should be used.

    Change-Id: I7a335f9ebea3c0d6fee6e6b757554e045a66075c
    Closes-Bug: #1486134
    Related-Blueprint: better-quotas

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/214660
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=14ef151fe0ca193c341098fcd3910d5e523c140c
Submitter: Jenkins
Branch: master

commit 14ef151fe0ca193c341098fcd3910d5e523c140c
Author: Salvatore Orlando <email address hidden>
Date: Tue Aug 25 02:28:08 2015 -0700

    Restore reservations in API controller

    This patch restores the reservation logic in the API controller,
    as the DB issues arising from the pymysql switch has been solved.

    Change-Id: I98b40925fdceba13d6a2b5a4d0c5793aeb5cf077
    Related-Bug: #1486134
    Related-Blueprint: better-quotas

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (feature/pecan)

Related fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224334

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224357

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (feature/pecan)
Download full text (73.6 KiB)

Reviewed: https://review.openstack.org/224357
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fdc3431ccd219accf6a795079d9b67b8656eed8e
Submitter: Jenkins
Branch: feature/pecan

commit fe236bdaadb949661a0bfb9b62ddbe432b4cf5f1
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Sep 3 15:40:12 2015 +0200

    No network devices on network attached qos policies

    Network devices, like internal router legs, or dhcp ports
    should not be affected by bandwidth limiting rules.

    This patch disables application of network attached policies
    to network/neutron owned ports.

    Closes-bug: #1486039
    DocImpact

    Change-Id: I75d80227f1e6c4b3f5fa7762b8dc3b0c0f1abd46

commit db4a06f7caa20a4c7879b58b20e95b223ed8eeaf
Author: Ken'ichi Ohmichi <email address hidden>
Date: Wed Sep 16 10:04:32 2015 +0000

    Use tempest-lib's token_client

    Now tempest-lib provides token_client modules as library and the
    interface is stable. So neutron repogitory doesn't need to contain
    these modules.
    This patch makes neutron use tempest-lib's token_client and removes
    the own modules for the maintenance.

    Change-Id: Ieff7eb003f6e8257d83368dbc80e332aa66a156c

commit 78aed58edbe6eb8a71339c7add491fe9de9a0546
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 13 09:08:20 2015 +0000

    Fix establishing UDP connection

    Previously, in establish_connection() for UDP protocol data were sent
    but never read on peer socket. That lead to successful read on peer side
    if this connection was filtered. Having constant testing string masked
    this issue as we can't distinguish to which test of connectivity data
    belong.

    This patch makes unique data string per test_connectivity() and
    also makes establish_connection() to create an ASSURED entry in
    conntrack table. Finally, in last test after firewall filter was
    removed, connection is re-established in order to avoid troubles with
    terminated processes or TCP continuing sending packets which weren't
    successfully delivered.

    Closes-Bug: 1478847
    Change-Id: I2920d587d8df8d96dc1c752c28f48ba495f3cf0f

commit e6292fcdd6262434a7b713ad8802db6bc8a6d3dc
Author: YAMAMOTO Takashi <email address hidden>
Date: Wed Sep 16 13:20:51 2015 +0900

    ovsdb: Fix a few docstring

    Change-Id: I53e1e21655b28fe5da60e58aeeb7cbbd103ae014

commit c22949a4449d96a67caa616290cf76b67b182917
Author: fumihiko kakuma <email address hidden>
Date: Wed Sep 16 11:52:59 2015 +0900

    Remove requirements.txt for the ofagent mechanism driver

    It is no longer used.

    Related-Blueprint: core-vendor-decomposition
    https://blueprints.launchpad.net/neutron/+spec/core-vendor-decomposition

    Change-Id: Ib31fb3febf8968e50d86dd66e1e6e1ea2313f8ac

commit d1d4de19d85f961d388c91e70f31b3bafec418c5
Author: Kevin Benton <email address hidden>
Date: Thu Sep 3 20:25:57 2015 -0700

    Always return iterables in L3 get_candidates

    The caller of this function expects iterables.

    Closes-Bug: #1494996
    Change-Id: I3d103e63f4e127a77268502415c0ddb0d804b54a

commit 1ad6ac448067306...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Doug Wiegley (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/224334

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → liberty-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-rc1 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.