Deadlock detection code can be stale

Bug #1887523 reported by Mohammed Naser
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
High
Unassigned

Bug Description

oslo.db has plenty of infrastructure for detecting deadlocks, however, it seems that at the moment, neutron has it's own implementation of it which is missing a bunch of deadlocks, causing issues when doing work at scale.

this bug is to track the work in refactoring all of this to use the native oslo retry.

Tags: db oslo
Revision history for this message
Lajos Katona (lajos-katona) wrote :

I'm wondering if it should be an RFE, and discuss on drivers meeting, what do you think?

tags: added: db oslo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/740977

Changed in neutron:
importance: Undecided → High
status: New → In Progress
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

What is described here is a very important problem and hits the core of the API, which is the DB. The patch proposed, only for quota, is a good example of the direction we should take.

As commented by Lajos, this should be commented in the drivers meeting along with https://review.opendev.org/#/c/739139/ (Loki testing for Neutron).

I think both, the testing FW class and the refactor, should be done closely in parallel.

Regards.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I don't think that this should be an RFE. It's just about refactoring our code and use code from oslo instead of our (similar like with enginefacade IMO).
I think we can open Blueprint to track progress on it but I don't think we need to discuss anything related to it in the drivers team meeting.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I just created BP: https://blueprints.launchpad.net/neutron/+spec/handle-deadlocks-in-oslo-way to track progress of this refactoring.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/740977
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/762178

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Neutron db retry decorators do actually use native oslo db retries underneath, so nothing should be missed from oslo (though there were not many changes in oslo recently https://github.com/openstack/oslo.db/commits/master/oslo_db/api.py)

On top of oslo retries neutron has a plenty of checks and neutron specific handlings, which I don't think could be easily skipped without regressions.

So I'm wondering which exact problem are we trying to address here. Any examples would be good for better understanding.

Revision history for this message
Lajos Katona (lajos-katona) wrote :

The story came up from operators scaling issues, on PTG it was discussed Wednesday.
You are totally right that at the back oslo retries are used, the idea behind this refactoring
- to make the code simpler, easier to debug
- perhaps avoid the culprits in the neutron specific code around the oslo decorator

To avoid regressions Loki is to run in the check queue (https://review.opendev.org/739139 ).

Would be good to have a discussion on the reasons why Neutron needed the Neutron specific "spicing" around oslo db retry decorator, and list those and see if native oslo retry can now cover all of those (personally I have no such historical background).

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

We discussed that on our last drivers meeting (http://eavesdrop.openstack.org/meetings/neutron_drivers/2020/neutron_drivers.2020-11-13-14.00.log.html) and here is what we decided:

- first we should finish migration to the new engine facade - patch https://review.opendev.org/#/c/715315/ - with that we should get rid of subtransactions and will have one single transaction per thread. That may help with some of the deadlocks and db errors which we observe now,

- Limit this bug to the retries in the quota related code only - if this will still be the case after engine facade migration, we will need to fix it - and lets use this LP to track only that issue with quota,

- if we will have other retries issues, we should open new LPs for that and treat them as regular bugs.

We will not migrate to use oslo db retries directly. There are good reasons why we are using our own decorators now and as Oleg mentioned above such migration would very likely introduce new regressions.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Lajos Katona (<email address hidden>) on branch: master
Review: https://review.opendev.org/762178
Reason: Based on last drivers' meeting discussion (http://eavesdrop.openstack.org/irclogs/%23openstack-meeting/latest.log.html#t2020-11-13T14:09:08 ) let's wait till new db engine facad migration is ready, and see how that affects issues like the one mentioned in related bug

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.