Spike in DBDeadlock errors in update_floatingip_statuses since 2/27

Bug #1426543 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Kevin Benton

Bug Description

http://logs.openstack.org/40/122240/19/gate/gate-tempest-dsvm-neutron-full/4ef0a02/logs/screen-n-cpu.txt.gz?level=TRACE#_2015-02-27_18_05_22_444

2015-02-27 18:05:22.444 8433 ERROR nova.compute.manager [-] Instance failed network setup after 1 attempt(s)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager Traceback (most recent call last):
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/opt/stack/new/nova/nova/compute/manager.py", line 1684, in _allocate_network_async
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager dhcp_options=dhcp_options)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 395, in allocate_for_instance
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager net_ids, neutron=neutron)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 226, in _get_available_networks
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager nets = neutron.list_networks(**search_opts).get('networks', [])
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 99, in with_params
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager ret = self.function(instance, *args, **kwargs)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 524, in list_networks
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager **_params)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 304, in list
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager for r in self._pagination(collection, path, **params):
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 317, in _pagination
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager res = self.get(path, params=params)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 290, in get
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager headers=headers, params=params)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 267, in retry_request
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager headers=headers, params=params)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 197, in do_request
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager content_type=self.content_type())
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/client.py", line 172, in do_request
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager **kwargs)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/client.py", line 108, in _cs_request
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager raise exceptions.ConnectionFailed(reason=e)
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager ConnectionFailed: Connection to neutron failed: HTTPConnectionPool(host='127.0.0.1', port=9696): Max retries exceeded with url: /v2.0/networks.json?tenant_id=1e707ab2d2be40a1902f3352c91a615a&shared=False (Caused by ReadTimeoutError("HTTPConnectionPool(host='127.0.0.1', port=9696): Read timed out. (read timeout=30)",))
2015-02-27 18:05:22.444 8433 TRACE nova.compute.manager

http://goo.gl/UMI2vZ

120 hits in the last 24 hours. There aren't any new python-neutronclient releases in that time so that's not it and there are no changes to nova.network.neutronv2 in the last 24 hours, so have to look at recent compute manager changes.

There were a few releases of the requests library this week but those were on 2/23 and 2/24 so if that was it we should have seen this by now.

Matt Riedemann (mriedem)
tags: added: network
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
summary: - Spike in neutronclient connection failures due to read timeouts since
- 2/27
+ Spike in DBDeadlock errors in update_floatingip_statuses since 2/27
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Kyle Mestery (mestery) wrote :

Looks like Carl found the culprit: https://review.openstack.org/#/c/143865/

The revert is here: https://review.openstack.org/#/c/159939/

Revision history for this message
Matt Riedemann (mriedem) wrote :

There was also an oslo.db release yesterday, 1.5.0, let's check the git log of that for related issues.

Revision history for this message
Matt Riedemann (mriedem) wrote :

mriedem@ubuntu:~/git/oslo.db$ git log --oneline --no-merges 1.4.1..1.5.0
7bfdb6a Make DBAPI class work with mocks correctly
96cabf4 Updated from global requirements
a3a1bdd Imported Translations from Transifex
ab20754 Fix PyMySQL reference error detection
99e2ab6 Updated from global requirements
6ccea34 Organize provisioning to use testresources
eeb7ea2 Add retry decorator allowing to retry DB operations on request
d78e3aa Imported Translations from Transifex
dcd137a Implement backend-specific drop_all_objects for provisioning.
3fb5098 Refactor database migration manager to use given engine
afcc3df Fix 0 version handling in migration_cli manager
f81653b Updated from global requirements
efdefa9 Fix PatchStacktraceTest for pypy
c0a4373 Update Oslo imports to remove namespace package
1b7c295 Retry query if db deadlock error is received
046e576 Ensure DBConnectionError is raised on failed revalidate

Revision history for this message
Matt Riedemann (mriedem) wrote :

These look possibly related to the issue:

1b7c295 Retry query if db deadlock error is received
eeb7ea2 Add retry decorator allowing to retry DB operations on request

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/159997

Revision history for this message
Assaf Muller (amuller) wrote :

If it's Oslo DB, there's this patch that pins to the previous release:
https://review.openstack.org/#/c/159997/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/159997
Reason: Oslo DB is not the issue.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

delete_floating ip starts a transaction and then calls into the core plugin, which uses locks. During the port delete, the core plugin thinks it's done with the transaction and starts making external calls which cause yields. Because of the outer transaction, the locks haven't been freed up yet. This yield with a lock behavior is known to cause deadlocks.

In the particular case I was looking at, it's caused by a floating IP status update happening during a floating IP delete. The delete starts, locks, yields and then the status update can't commit because of the locks.

Revision history for this message
Kevin Benton (kevinbenton) wrote :
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Recent merges have just increased the probability of this happening by adding another notification during the delete_port operation which results in another yield with the transaction still open. I do not think recent merges are the real root cause.

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/159939
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=294019139d575bd7144cfcc229c98c8497bfbf7c
Submitter: Jenkins
Branch: master

commit 294019139d575bd7144cfcc229c98c8497bfbf7c
Author: armando-migliaccio <email address hidden>
Date: Fri Feb 27 10:08:56 2015 -0800

    Revert "Remove port from ovsdb after its deletion"

    Sentinel patch to establish whether commit [1] is at
    fault of the sudden spike of DBDeadlock errors.

    [1] d6a55c17360d1aa8ca91849199987ae71e8600ee

    Change-Id: I021e94e322f9b5eca665ea3aede41e034d3047cc
    Closed-Bug: #1426543

Kyle Mestery (mestery)
Changed in neutron:
importance: Undecided → High
milestone: none → kilo-3
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Armando Migliaccio (armando-migliaccio)
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/160036
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4840fe79d7d58d26cd65113f896e766ddf55f201
Submitter: Jenkins
Branch: master

commit 4840fe79d7d58d26cd65113f896e766ddf55f201
Author: Kevin Benton <email address hidden>
Date: Fri Feb 27 14:56:39 2015 -0800

    Don't start transaction during floating IP delete

    The previous floating IP code was starting a transaction to
    delete the floating IP object from the database and then
    calling the core plugin's delete_port method which could make
    many yielding calls, risking deadlocks.

    This patch updates the relationship between floating IP and the
    ports table to allow the port deletion to automatically clear
    the floating IP object via a cascade.

    Closes-Bug: #1426543
    Closes-Bug: #1426482
    Change-Id: I4b1cdc198c18d5c673a35aaf9c7e960f97f1e20b

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Joe Gordon (jogo)
Changed in nova:
status: New → Invalid
no longer affects: nova
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.