Neutron DBDeadlocks a ridiculous amount in successful CI runs

Bug #1494886 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Kevin Benton

Bug Description

This came up in the -qa channel when trying to figure out why a neutron test failed and there is a big fat DBDeadlock in the q-svc logs:

http://logs.openstack.org/18/220218/5/check/gate-tempest-dsvm-neutron-dvr/3899ebf/logs/screen-q-svc.txt.gz?level=ERROR#_2015-09-11_17_22_42_284

We find that this shows up a ton in a 7 day check/gate run:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiX2dldF9kbnNfbmFtZXNfZm9yX3BvcnRcIiBBTkQgbWVzc2FnZTpcIkRCRGVhZGxvY2tcIiBBTkQgbWVzc2FnZTpcImlwYXZhaWxhYmlsaXR5cmFuZ2VzXCIgQU5EIHRhZ3M6XCJzY3JlZW4tcS1zdmMudHh0XCIgQU5EIChidWlsZF9xdWV1ZTpcImNoZWNrXCIgT1IgYnVpbGRfcXVldWU6XCJnYXRlXCIpIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDQxOTk2Mjk2ODcxfQ==

498 hits in 7 days, check and gate.

The interesting thing is that 85% of those are successful runs.

Like this was a successful run where the DBDeadlock shows up:

http://logs.openstack.org/20/195820/11/gate/gate-tempest-dsvm-neutron-full/35f6716/logs/screen-q-svc.txt.gz?level=TRACE

This is a serviceability / QA issue for anyone trying to deploy neutron at scale - when things go back, how is an operator supposed to be able to cut through the noise in the logs to determine what's actually a real failure and what can be ignored?

If these DBDeadlocks are just getting retried with a retry decorator, there should be a way to only trace when we fail and raise up the DBDeadlock error, we shouldn't be logging each time. For example, if we DBDeadlock and retry and then it's OK, don't trace that first DB error. If we retry like 5 times and eventually punt, then trace the error.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (11.5 KiB)

Example trace:

2015-09-11 16:36:08.026 ERROR oslo_db.api [req-545d550d-a494-405b-bbc1-561d2eb918ba admin admin] DB error.
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api Traceback (most recent call last):
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 136, in wrapper
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api return f(*args, **kwargs)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/api/v2/base.py", line 491, in create
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api obj = obj_creator(request.context, **kwargs)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_hamode_db.py", line 386, in create_router
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api self).create_router(context, router)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 186, in create_router
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api self.delete_router(context, router_db.id)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api six.reraise(self.type_, self.value, self.tb)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 181, in create_router
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api gw_info, router=router_db)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_gwmode_db.py", line 70, in _update_router_gw_info
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api context, router_id, info, router=router)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 402, in _update_router_gw_info
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api ext_ips)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_dvr_db.py", line 180, in _create_gw_port
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api ext_ips)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 378, in _create_gw_port
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api self._create_router_gw_port(context, router, new_network, ext_ips)
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/db/l3_db.py", line 296, in _create_router_gw_port
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api context.elevated(), {'port': port_data})
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/opt/stack/new/neutron/neutron/plugins/common/utils.py", line 140, in create_port
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api return core_plugin.create_port(context, {'port': port_data})
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 146, in wrapper
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api ectxt.value = e.inner_exc
2015-09-11 16:36:08.026 26473 ERROR oslo_db.api File "/usr/local/lib/python2.7/dist-packages/oslo_...

Changed in neutron:
status: New → Confirmed
tags: added: db
Revision history for this message
Kyle Mestery (mestery) wrote :

Seems like a good issue to address, assigning to Kevin to triage a bit more.

Changed in neutron:
importance: Undecided → Critical
milestone: none → liberty-rc1
assignee: nobody → Kevin Benton (kevinbenton)
Revision history for this message
Matt Riedemann (mriedem) wrote :

OK, the retry decorator is in:

File "/opt/stack/new/neutron/neutron/plugins/ml2/plugin.py", line 1031, in create_port

    @oslo_db_api.wrap_db_retry(max_retries=db_api.MAX_RETRIES,
                               retry_on_request=True)
    def create_port(self, context, port):

In this case we're not passing retry_on_deadlock=True to the decorator, so when we hit the DBDeadlock error it's actually oslo.db that's logging the exception, not neutron:

    def _is_exception_expected(self, exc):
        if isinstance(exc, self.db_error):
            # RetryRequest is application-initated exception
            # and not an error condition in case retries are
            # not exceeded
            if not isinstance(exc, exception.RetryRequest):
                LOG.exception(_LE('DB error.'))
            return True
        return self.exception_checker(exc)

I'm still puzzled as to why no tests failed here then:

http://logs.openstack.org/20/195820/11/gate/gate-tempest-dsvm-neutron-full/35f6716/logs/screen-q-svc.txt.gz?level=TRACE#_2015-09-11_16_36_07_813

What was trying to create the router and why didn't it matter that it failed?

Revision history for this message
Matt Riedemann (mriedem) wrote :

(3:10:05 PM) HenryG: mriedem: we retry on db errors. That looks like a create, so: https://github.com/openstack/neutron/blob/master/neutron/api/v2/base.py#L403

Revision history for this message
Matt Riedemann (mriedem) wrote :

(3:44:07 PM) kevinbenton: mriedem: https://github.com/openstack/neutron/commit/622dea818d851224a43d5276a81d5ce8a6eebb76
(3:44:15 PM) kevinbenton: mriedem: i'm not sure
(3:44:41 PM) kevinbenton: mriedem: so the general retry protection for deadlocks we have at the API layer no longer helps there

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/222776

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/222777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/222776
Reason: actually this shouldn't be an issue. these decorators won't log an dbdeadlock exception because they don't catch it

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/222777
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2d65cccba29220e46b490871210014b94f086984
Submitter: Jenkins
Branch: master

commit 2d65cccba29220e46b490871210014b94f086984
Author: Kevin Benton <email address hidden>
Date: Thu Sep 3 17:43:37 2015 -0700

    Don't log deadlock or retry exceptions in L3 DB

    We don't want to log exceptions in the l3 DB that will be retried
    by the DB retry decorator because it will look like a failure in
    the log when it actually ends up being retried.

    Change-Id: I024fc2db9022809194178c227d994bc6ed33c78b
    Closes-Bug: #1494886

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/222871

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/222871
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e959e474d65211991c12f9495b227da5e4d99ed7
Submitter: Jenkins
Branch: master

commit e959e474d65211991c12f9495b227da5e4d99ed7
Author: Kevin Benton <email address hidden>
Date: Fri Sep 4 04:22:35 2015 -0700

    Don't log exceptions in GW update on router create

    The LOG.exception statement is not necessary because the exception
    is re-raised so if it's a real error it will be logged like any
    other failure.

    Related-Bug: #1494886
    Change-Id: I29aacd8c1187ddf8653009865ed9a62be948c5a7

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (feature/pecan)

Related fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224334

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224357

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (feature/pecan)
Download full text (73.6 KiB)

Reviewed: https://review.openstack.org/224357
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fdc3431ccd219accf6a795079d9b67b8656eed8e
Submitter: Jenkins
Branch: feature/pecan

commit fe236bdaadb949661a0bfb9b62ddbe432b4cf5f1
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Sep 3 15:40:12 2015 +0200

    No network devices on network attached qos policies

    Network devices, like internal router legs, or dhcp ports
    should not be affected by bandwidth limiting rules.

    This patch disables application of network attached policies
    to network/neutron owned ports.

    Closes-bug: #1486039
    DocImpact

    Change-Id: I75d80227f1e6c4b3f5fa7762b8dc3b0c0f1abd46

commit db4a06f7caa20a4c7879b58b20e95b223ed8eeaf
Author: Ken'ichi Ohmichi <email address hidden>
Date: Wed Sep 16 10:04:32 2015 +0000

    Use tempest-lib's token_client

    Now tempest-lib provides token_client modules as library and the
    interface is stable. So neutron repogitory doesn't need to contain
    these modules.
    This patch makes neutron use tempest-lib's token_client and removes
    the own modules for the maintenance.

    Change-Id: Ieff7eb003f6e8257d83368dbc80e332aa66a156c

commit 78aed58edbe6eb8a71339c7add491fe9de9a0546
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 13 09:08:20 2015 +0000

    Fix establishing UDP connection

    Previously, in establish_connection() for UDP protocol data were sent
    but never read on peer socket. That lead to successful read on peer side
    if this connection was filtered. Having constant testing string masked
    this issue as we can't distinguish to which test of connectivity data
    belong.

    This patch makes unique data string per test_connectivity() and
    also makes establish_connection() to create an ASSURED entry in
    conntrack table. Finally, in last test after firewall filter was
    removed, connection is re-established in order to avoid troubles with
    terminated processes or TCP continuing sending packets which weren't
    successfully delivered.

    Closes-Bug: 1478847
    Change-Id: I2920d587d8df8d96dc1c752c28f48ba495f3cf0f

commit e6292fcdd6262434a7b713ad8802db6bc8a6d3dc
Author: YAMAMOTO Takashi <email address hidden>
Date: Wed Sep 16 13:20:51 2015 +0900

    ovsdb: Fix a few docstring

    Change-Id: I53e1e21655b28fe5da60e58aeeb7cbbd103ae014

commit c22949a4449d96a67caa616290cf76b67b182917
Author: fumihiko kakuma <email address hidden>
Date: Wed Sep 16 11:52:59 2015 +0900

    Remove requirements.txt for the ofagent mechanism driver

    It is no longer used.

    Related-Blueprint: core-vendor-decomposition
    https://blueprints.launchpad.net/neutron/+spec/core-vendor-decomposition

    Change-Id: Ib31fb3febf8968e50d86dd66e1e6e1ea2313f8ac

commit d1d4de19d85f961d388c91e70f31b3bafec418c5
Author: Kevin Benton <email address hidden>
Date: Thu Sep 3 20:25:57 2015 -0700

    Always return iterables in L3 get_candidates

    The caller of this function expects iterables.

    Closes-Bug: #1494996
    Change-Id: I3d103e63f4e127a77268502415c0ddb0d804b54a

commit 1ad6ac448067306...

tags: added: in-feature-pecan
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Doug Wiegley (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/224334

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-rc1 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.