Some DB operations aren't handling errors like Dead Locks correctly

Bug #1714898 reported by Michel Peterson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-odl
Fix Released
Critical
Michel Peterson

Bug Description

Looking at a specific log searching for ERROR [1], I can see that the DBDeadlocks are propagating up needlessly and I think it is and should be our responsibility to retry before asking Neutron to trigger a rollback.

We are getting hit by very natural deadlocks because we have the journal and the maintenance working at the same time. That's not the problem, however we really need to retry this deadlocks without triggering a rollback. I think we should add `is_retriable` [2] to the decorator, as that's what we want. That way we will do our best to allow the operation to succeed.

As you can see in the deadlock, it is ultimately leading to an ERROR in Neutron. We really don't want that.

Also, if we don't do this, we are triggering this bug/1590298 [3] (you can see it in the same log), which the `is_retriable` from `neutron_lib` is aware and handles flawlessly.

[1]: http://logs.openstack.org/81/498881/1/check/gate-rally-dsvm-networking-odl-carbon-snapshot/d2fcd3e/logs/screen-q-svc.txt.gz
[2]: https://github.com/openstack/neutron-lib/blob/b74981203704517a3c7e770c7c7049bc8f3e9887/neutron_lib/db/utils.py#L73
[3]: https://bugs.launchpad.net/neutron/+bug/1590298

Changed in networking-odl:
assignee: nobody → Michel Peterson (mpeterson)
summary: - Some DB operations aren't handling things like Dead Locks correctly
+ Some DB operations aren't handling errors like Dead Locks correctly
Revision history for this message
Michel Peterson (mpeterson) wrote :

Attaching log that is linked in order to not lose it when it's deleted from the CIs machines.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-odl (master)

Fix proposed to branch: master
Review: https://review.openstack.org/500584

Changed in networking-odl:
status: New → In Progress
Changed in networking-odl:
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-odl (master)

Reviewed: https://review.openstack.org/500584
Committed: https://git.openstack.org/cgit/openstack/networking-odl/commit/?id=75c8962918d47978ca3fcf7fbfcfcb40a156d802
Submitter: Zuul
Branch: master

commit 75c8962918d47978ca3fcf7fbfcfcb40a156d802
Author: Michel Peterson <email address hidden>
Date: Mon Sep 4 17:45:38 2017 +0300

    Fixes error handling of DB calls

    There are functions that are wrapped with a `wrap_db_retry` because they
    are part of a transaction happening at a higher level inside Neutron. The
    problem with wrapping it with that decorator is that without specifying
    an `exception_checker` then it only handles `RetryRequests` and not
    other types of exceptions that are expected to be received (ie:
    `DBDeadLock`). This was an oversight that allowed the related bug to
    ocurr and lead to errors reaching Neutron when they could have been
    handled at a lower level. In addition to this problem, some of the
    functions tried to retry without a proper `SAVEPOINT` or transaction
    that would allow the retry. Moreover, the retries were often made at a
    very low level, when it should have been at a higher level where the
    operation is retried as a whole and not "atomic" operations.

    This patch allows the inclusion of an `exception_checker` based on the
    Neutron library that handles these DB exceptions + a MySQL issue that
    is triggered by `DBDeadLock`s happening (see neutron bug #1590298). It
    also keeps the handling of these exceptions at the networking_odl
    level because we need to do a best effort of allowing to succeed without
    having to trigger a rollback of the Neutron action which would be much
    costly performance wise. This retry, however is done whenever possible
    at the highest level that makes sense, including `SAVEPOINT`s and
    transactions depending on the need.

    Also, fixes some issues where some of the operations that the driver
    does were not able to be retried because of incorrect handling of
    exceptions.

    Change-Id: I31085cf73618df48f55f3169e071d2cb64c9b018
    Closes-Bug: #1714898

Changed in networking-odl:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-odl (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/516001

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-odl (stable/pike)

Reviewed: https://review.openstack.org/516001
Committed: https://git.openstack.org/cgit/openstack/networking-odl/commit/?id=7f9a439aea15f02c2c2822231f6604c5f7435bc8
Submitter: Zuul
Branch: stable/pike

commit 7f9a439aea15f02c2c2822231f6604c5f7435bc8
Author: Michel Peterson <email address hidden>
Date: Mon Sep 4 17:45:38 2017 +0300

    Fixes error handling of DB calls

    There are functions that are wrapped with a `wrap_db_retry` because they
    are part of a transaction happening at a higher level inside Neutron. The
    problem with wrapping it with that decorator is that without specifying
    an `exception_checker` then it only handles `RetryRequests` and not
    other types of exceptions that are expected to be received (ie:
    `DBDeadLock`). This was an oversight that allowed the related bug to
    ocurr and lead to errors reaching Neutron when they could have been
    handled at a lower level. In addition to this problem, some of the
    functions tried to retry without a proper `SAVEPOINT` or transaction
    that would allow the retry. Moreover, the retries were often made at a
    very low level, when it should have been at a higher level where the
    operation is retried as a whole and not "atomic" operations.

    This patch allows the inclusion of an `exception_checker` based on the
    Neutron library that handles these DB exceptions + a MySQL issue that
    is triggered by `DBDeadLock`s happening (see neutron bug #1590298). It
    also keeps the handling of these exceptions at the networking_odl
    level because we need to do a best effort of allowing to succeed without
    having to trigger a rollback of the Neutron action which would be much
    costly performance wise. This retry, however is done whenever possible
    at the highest level that makes sense, including `SAVEPOINT`s and
    transactions depending on the need.

    Also, fixes some issues where some of the operations that the driver
    does were not able to be retried because of incorrect handling of
    exceptions.

    Change-Id: I31085cf73618df48f55f3169e071d2cb64c9b018
    Closes-Bug: #1714898
    (cherry picked from commit 75c8962918d47978ca3fcf7fbfcfcb40a156d802)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-odl 12.0.0.0b2

This issue was fixed in the openstack/networking-odl 12.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-odl 11.0.1

This issue was fixed in the openstack/networking-odl 11.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.