DBDeadlock on subnet allocation

Bug #1440183 reported by Armando Migliaccio
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Dane LeBlanc
Kilo
Fix Released
Critical
Unassigned
description: updated
Changed in neutron:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Culprit seems to be https://review.openstack.org/#/c/160622/, but more investigation is due.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/170631

Changed in neutron:
assignee: nobody → Armando Migliaccio (armando-migliaccio)
status: Confirmed → In Progress
Kyle Mestery (mestery)
Changed in neutron:
milestone: none → kilo-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/170631
Reason: Great!

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

There's another, more effective, fix lined up:

https://review.openstack.org/#/c/170690/

Hence the abandonment of https://review.openstack.org/170631

Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Dane LeBlanc (leblancd)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/170968

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/170690
Reason: I think this should be superseded by [1]. Please reopen/retarget if in disagreement.

Changed in neutron:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/171761

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Even though https://review.openstack.org/#/c/170690/ targets this bug, the fix unveils another failure mode:

https://bugs.launchpad.net/neutron/+bug/1441382

However we chose to revert the culprit with https://review.openstack.org/#/c/171761/. So the original change needs to be resubmitted.

Changed in neutron:
status: In Progress → Incomplete
milestone: kilo-rc1 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/171761
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0107bdd5f03e3d0fef6be88b8b586f735f610522
Submitter: Jenkins
Branch: master

commit 0107bdd5f03e3d0fef6be88b8b586f735f610522
Author: armando-migliaccio <email address hidden>
Date: Wed Apr 8 10:57:13 2015 -0700

    Revert "IPv6 SLAAC subnet create should update ports on net"

    This reverts commit 81f4469b620ec221f53d3ffb4d00b90896dc5ce1.

    Change-Id: I63a392fccda29ceff3e91c0a4de741d263bd0e8e
    Related-bug: #1441382
    Related-bug: #1440183

Kyle Mestery (mestery)
Changed in neutron:
milestone: none → kilo-rc1
status: Incomplete → In Progress
status: In Progress → Confirmed
milestone: kilo-rc1 → liberty-1
Changed in neutron:
assignee: Dane LeBlanc (leblancd) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/170690
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aeb5efe3fbeae82a2d65f6bb68710d14156c58bf
Submitter: Jenkins
Branch: master

commit aeb5efe3fbeae82a2d65f6bb68710d14156c58bf
Author: Dane LeBlanc <email address hidden>
Date: Sat Apr 4 18:50:36 2015 -0400

    Re-use context session in ML2 DB get_port_binding_host

    This patch modifies ML2 DB get_port_binding_host method so that it
    reuses the existing context session to do the database query
    rather than creating a new database session.

    Note that there are other methods in ML2 DB that do not re-use
    the caller's session (get_port_from_device_mac() and
    get_sg_ids_grouped_by_port()). These will be modified using
    a separate bug (https://bugs.launchpad.net/neutron/+bug/1441205).
    Change-Id: I8aafb0a70f40f9306ccc366e5db6860c92c48cce
    Closes-Bug: #1440183

Changed in neutron:
status: Confirmed → Fix Committed
Revision history for this message
Baodong (Robert) Li (baoli) wrote :

I did some investigation on the root cause of this bug. And it now looks clear to me why it had happened. With the procedure that Dane has provided to reproduce the issue ( https://bugs.launchpad.net/neutron/+bug/1440192 ), there are two threads involved in the test. The test is having two iterations that adds a SLAAC subnet and a DHCP subnet into a network. When an iteration ends, the subnets are removed. As a result of removing the dhcp subnet, the corresponding dhcp port is deleted. The second iteration is going to add the slaac subnet again. When the timing is right, a deadlock will result. The two thread are doing the following things that causes the deadlock:

   -- one thread that is to delete the DHCP port on the same network.
        . get_locked_port_and_binding() places update locks on the DB records: one on the port, one on the binding
        . get_port_binding_host() starts a new session, and queries the port_binding. At the end of the session, the oslo_db method _thread_yield() is called to yield. Keep in mind that at this point, the main session is still holding the lock on the dhcp port.

  -- the other thread is doing create_subnet_from_implicit_pool() and _add_auto_addrs_on_network_ports(). If the timing is right, when the first thread yields, this thread gets to work, and will try to allocate an IP for the dhcp port that is being deleted by the first thread. Due to the foreign key from the IPallocation table to the port table, and the port is locked for update, the DB operation to allocate an new ip for the dhcp port will be put on hold. As a result, this thread is putting on hold until the db lock is timed out (after 50s, which is the default timeout value). Literally the whole neutron server is put on hold (stalled) for 50s.

The purpose of _thread_yield() is to give other threads a chance to run after the expensive DB operations are completed in the current thread. It's not a good idea to yield, though, if the current thread has placed some locks in the DB records.

So the fix is good in the sense that it no longer uses a new session to conduct the port binding query, and it wouldn't yield any more. On the other hand, it's questionable to use another sub transaction for the query. For catching DB operation exceptions, the try block should be good enough, and the DB sub transaction is not needed.

The ml2 code should be closely looked at to see if similar situation exists in order to eliminate other potential deadlocks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/170968
Reason: Looks like the original patch was reverted and resubmitted here:

https://review.openstack.org/#/c/172092

Kyle Mestery (mestery)
tags: added: kilo-rc-potential
Changed in neutron:
assignee: nobody → Dane LeBlanc (leblancd)
tags: added: kilo-backport-potential
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/172092
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bd1044ba0e9d7d0f4752c891ac340b115f0019c4
Submitter: Jenkins
Branch: master

commit bd1044ba0e9d7d0f4752c891ac340b115f0019c4
Author: Dane LeBlanc <email address hidden>
Date: Thu Apr 9 10:32:33 2015 -0400

    IPv6 SLAAC subnet create should update ports on net

    If ports are first created on a network, and then an IPv6 SLAAC
    or DHCPv6-stateless subnet is created on that network, then the
    ports created prior to the subnet create are not getting
    automatically updated (associated) with addresses for the
    SLAAC/DHCPv6-stateless subnet, as required.

    Change-Id: I88d04a13ce5b8ed4c88eac734e589e8a90e986a0
    Closes-Bug: 1427474
    Closes-Bug: 1441382
    Closes-Bug: 1440183

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/174373

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/174373
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1dc98e414f200a78a6b1dc78f222c588646e6935
Submitter: Jenkins
Branch: stable/kilo

commit 1dc98e414f200a78a6b1dc78f222c588646e6935
Author: Dane LeBlanc <email address hidden>
Date: Thu Apr 9 10:32:33 2015 -0400

    IPv6 SLAAC subnet create should update ports on net

    If ports are first created on a network, and then an IPv6 SLAAC
    or DHCPv6-stateless subnet is created on that network, then the
    ports created prior to the subnet create are not getting
    automatically updated (associated) with addresses for the
    SLAAC/DHCPv6-stateless subnet, as required.

    Change-Id: I88d04a13ce5b8ed4c88eac734e589e8a90e986a0
    Closes-Bug: 1427474
    Closes-Bug: 1441382
    Closes-Bug: 1440183
    (cherry picked from commit bd1044ba0e9d7d0f4752c891ac340b115f0019c4)

Thierry Carrez (ttx)
tags: removed: kilo-backport-potential kilo-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/179286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (neutron-pecan)

Fix proposed to branch: neutron-pecan
Review: https://review.openstack.org/185072

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)
Download full text (11.7 KiB)

Reviewed: https://review.openstack.org/179286
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b8a09568b5bde8b71ad00468c27b757c98d9c0da
Submitter: Jenkins
Branch: master

commit 7260e0e3fc2ea479e80e0962624aca7fd38a1f60
Author: Henry Gessau <email address hidden>
Date: Mon Apr 27 09:59:21 2015 -0400

    Run radvd as root

    During the refactoring of external process management radvd lost
    its root privileges.

    Closes-bug: 1448813

    Change-Id: I84883fe81684afafac9b024282a03f447c8f825a
    (cherry picked from commit a5e54338770fc074e01fa88dbf909ee1af1b66b2)

commit d37e566dcadf8a540eb5f84b668847fa192393a1
Author: Kevin Benton <email address hidden>
Date: Fri Apr 24 00:35:31 2015 -0700

    Don't resync on DHCP agent setup failure

    There are various cases where the DHCP agent will try to
    create a DHCP port for a network and there will be a failure.
    This has primarily been caused by a lack of available IP addresses
    in the allocation pool. Trying to fix all availability corner cases
    on the server side will be very difficult due to race conditions between
    multiple ports being created, the dhcp_agents_per_network parameter, etc.

    This patch just stops the resync attempt on the agent side if a failure
    is caused by an IP address generation problem. Future updates to the subnet
    will cause another attempt so if the tenant does fix the issue they will
    get DHCP service.

    Change-Id: I0896730126d6dca13fe9284b4d812cfb081b6218
    Closes-Bug: #1447883
    (cherry picked from commit db9ac7e0110a0c2ef1b65213317ee8b7f1053ddc)

commit 38211ae67cb76ade85b08c028b6e88bfc867afc9
Author: Ihar Hrachyshka <email address hidden>
Date: Mon Apr 20 17:06:38 2015 +0200

    tests: confirm that _output_hosts_file does not log too often

    I3ad7864eeb2f959549ed356a1e34fa18804395cc didn't include any regression unit
    tests to validate that the method won't ever log too often again,
    reintroducing performance drop in later patches. It didn't play well
    with stable backports of the fix, where context was lost when doing the
    backport, that left the bug unfixed in stable/juno even though the patch
    was merged there [1].

    The patch adds an explicit note in the code that suggests not to add new
    log messages inside the loop to avoid regression, and a unit test was
    added to capture it.

    Once the test is merged in master, it will be proposed for stable/juno
    inclusion, with additional changes that would fix the regression again.

    Related-Bug: #1414218
    Change-Id: I5d43021932d6a994638c348eda277dd8337cf041
    (cherry picked from commit 3b74095a935f6d2027e6bf04cc4aa21f8a1b46f2)

commit 53b3e751f3c7b32bed48c14742d3dd3a1178d00d
Author: Maru Newby <email address hidden>
Date: Thu Apr 9 17:00:57 2015 +0000

    Double functional testing timeout to 180s

    The increase in ovs testing is resulting in job failure due to
    timeouts in test_killed_monitor_respawns. Giving the test more
    time to complete should reduce the failure rate.

    Change-Id: I2ba9b1eb388bfbbebbd6b0f3edb6d5a5ae0bfead
    Closes-Bug: #1442272
    (c...

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/qos)

Fix proposed to branch: feature/qos
Review: https://review.openstack.org/196097

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/qos)
Download full text (93.9 KiB)

Reviewed: https://review.openstack.org/196097
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1cfed745d54a6ce9cb3dd4e6f454666d9e6676c2
Submitter: Jenkins
Branch: feature/qos

commit ba7d673d1ddd5bfa5aa1be5b26a59e9a8cd78a9f
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:31:38 2015 -0700

    Remove duplicated call to setup_coreplugin

    The test case for vlan_transparent was calling setup_coreplugin
    before calling the super setUp method which already calls
    setup_coreplugin. This was causing duplicate core plugin fixtures
    which resulted in patching the dhcp periodic check twice.

    Change-Id: Ide4efad42748e799d8e9c815480c8ffa94b27b38
    Partial-Bug: #1468998

commit e64062efa3b793f7c4ce4ab9e62918af4f1bfcc9
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:29:37 2015 -0700

    Remove double mock of dhcp agent periodic check

    The test case for the periodic check was patching a target
    that the core plugin fixture already patched out. This removes
    that and exposes the mock from the fixture so the test case
    can reference it.

    Change-Id: I3adee6a875c497e070db4198567b52aa16b81ce8
    Partial-Bug: #1468998

commit 25ae0429a713143d42f626dd59ed4514ba25820c
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:24:10 2015 -0700

    Remove double fanout mock

    The test_mech_driver was duplicating a fanout mock already setup
    in the setUp routine.

    Change-Id: I5b88dff13113d55c72241d3d5025791a76672ac2
    Partial-Bug: #1468998

commit 993771556332d9b6bbf7eb3f0300cf9d8a2cb464
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 17:55:16 2015 -0700

    Remove double callback manager mocks

    setup_test_registry_instance() in the base test case class gives
    each test its own registry by mocking out the get_callback_manager.
    The L3 agent test cases were duplicating this.

    Partial-Bug: #1468998
    Change-Id: I7356daa846524611e9f92365939e8ad15d1e1cd8

commit 0be1efad93734f11cd63fb3b7bd2983442ce1268
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 16:57:30 2015 -0700

    Remove ensure_dirs double-patch

    test_spawn_radvd called mock.patch on ensure_dirs after the
    setup method already patched it out. This causes issues when
    mock.patch.stopall() is called because the mocks are stored
    as a set and are unwound in a non-deterministic fashion.[1]
    So some of the time they will be undone correctly, but others
    will leave a monkey-patched in mock, causing the ensure_dir
    test to fail.

    1. http://bugs.python.org/issue21239

    Closes-Bug: #1467908
    Change-Id: I321b5fed71dc73bd19b5099311c6f43640726cd4

commit 0a2238e34e72c17ca8a75e36b1f56e41a3ece74e
Author: Sukhdev Kapur <email address hidden>
Date: Thu Jun 25 15:11:28 2015 -0700

    Fix tenant-id in Arista ML2 driver to support HA router

    When HA router is created, the framework creates a network and does
    not specify the tenant-id. This casuse Arista ML2 driver to fail.
    This patch sets the tenant-id when it is not passed explicitly by
    by the network_create() call from the HA r...

tags: added: in-feature-qos
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196701

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/196701
Reason: This is lacking the functional fix [1], so I'll propose a new merge commit which includes that one.

[1] https://review.openstack.org/#/c/196711/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196920

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/pecan)
Download full text (171.5 KiB)

Reviewed: https://review.openstack.org/196920
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7f759c077f8f860c13db92d2ea6b353ef6b70900
Submitter: Jenkins
Branch: feature/pecan

commit 8123144fadd7c5d5e6e56a76ea860512619a2cf6
Author: Moshe Levi <email address hidden>
Date: Sun Jun 28 14:37:14 2015 +0300

    Fix Consolidate sriov agent and driver code

    This patch add mising __init to mech_sriov/mech_driver/
    and update the setup.cfg to the new agent entrypoint

    Trivial Fix

    Change-Id: I53a527081feb78472f496675bbb3c5121d38a14a

commit 8942fccf02e6e179d47582fdb2792a1ca972da21
Author: Assaf Muller <email address hidden>
Date: Mon Jun 29 11:38:51 2015 -0400

    Remove failing SafeFixture tests

    The fixtures 1.3 release attempted to fix the fixtures resource
    leak issue, but failed to do so completely. Our own SafeFixture
    is still needed: The 1.3 release broke our SafeFixture tests,
    but not the usage of SafeFixture itself. This patch removes
    those failing tests for now to unbreak the gate. Jakub reported
    a bug on fixtures 1.3:
    https://bugs.launchpad.net/python-fixtures/+bug/1469759

    We will continue to use SafeFixture until that bug is fixed
    in fixtures, at which point we will be able to require
    fixtures > 1.3.

    Change-Id: I59457c3bb198ff86d5ad55a1e623d008f0034b8f
    Closes-Bug: #1469734

commit 71dffb0a2c1720cd8233a329d32958a0160dd6f5
Author: Kevin Benton <email address hidden>
Date: Mon Jun 29 08:27:41 2015 +0000

    Revert "Removed test_lib module"

    This reverts commit 9a6536de6e1a7fe9b2552adc142e254426b82b6f.

    We pulled all of the plugins out of the tree, many of which still inherit
    from neutron test classes. This change then stated that we no longer
    support testing other plugins. I think this is a bit premature and should
    have been discussed under the subject
    "Neutron plugins can't use neutron plugin unit tests" or something
    similar.

    Change-Id: I68318589f010b731574ea3bfa8df98492bab31fc

commit b20fd81dbd497e058384a0af065dd0f1fdc4c728
Author: Jakub Libosvar <email address hidden>
Date: Fri Jun 5 14:32:51 2015 +0000

    Refactor NetcatTester class

    Following capabilities were added:
       - used transport protocol is passed as a constant instead of bool
       - src port for testing was added
       - connection can be established explicitly
       - change constructor parameters of NetcatTester

    As a part of removing bool for protocol definition
    get_free_namespace_port() was also modified to match the behavior.

    Change-Id: Id2ec322e7f731c05a3754a65411c9a5d8b258126

commit 83e37980dcd0b2bad6d64dd2cb23bcd2891cafca
Author: jingliuqing <email address hidden>
Date: Sat Jun 27 13:41:54 2015 +0800

    Use REST rather than ReST

    Change-Id: I06c9deaab58c5ec13bfeec39fb8fd4b1fe21f42d

commit 1b60df85ba3ad442c2e4e7e52538e1b9a1bf9378
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:34:38 2015 -0700

    Add a double-mock guard to the base test case

    Use mock to patch mock with a check to prevent multiple active
    patches to the...

tags: added: in-feature-pecan
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-1 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.