[RFE] Add support for DLM

Bug #1552680 reported by John Schwarz on 2016-03-03
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Wishlist
Unassigned

Bug Description

Neutron has many code paths that can collide and be raceful which each other. Current ongoing work can mitigate and minimize these races but work is slow and it's very hard to fight against what you don't know (ie. there can always be more races you're not aware of). A DLM (Distributed Lock Mechanism) such as tooz [1] can help mitigate this greatly.

An excellent example of this racefulness in Neutron is the L3's auto_schedule_routers functionality. When creating a tenant's first HA router more resources must also be created (such as a HA network and HA ports). This specific flow of creating the resources can be invoke simultaneously by 2 codepaths: the original create_router (invoked from the REST API) and from the L3 agent's get_router_ids/sync_routers. These simultaneous runs can produce many races, such as creating 2 HA networks (where only one should exist), accidentally deleting valid port bindings and more. Instead of hunting down these races (which can be a long and inaccurate task since more races can always exist), this can be solved much easily by locking the operations done on a single router_id.

Using tooz [1] allows for a distributed lock, which crosses all the API/RPC workers on a single server and even crosses multiple neutron-servers. Also, this will help mitigate all sort of races with different resources (a lock can be associated with a uuid so it won't matter if the uuid is a router_id, network_id....)

[1]: https://github.com/openstack/tooz/tree/master/

I believe there was a decision lately in the community that we introduce DLM as a component for OpenStack reference architecture using tooz If that's the case, there is no drawback in making neutron depending on it since a DLM is supposed to be there.

https://specs.openstack.org/openstack/openstack-specs/specs/chronicles-of-a-dlm.html
https://lwn.net/Articles/662140/

Fix proposed to branch: master
Review: https://review.openstack.org/290688

Changed in neutron:
assignee: nobody → John Schwarz (jschwarz)
status: New → In Progress
John Schwarz (jschwarz) wrote :

Reverting the "status" to New as per the devref detailing new RFEs (to make clear this has not been decided upon yet).

Changed in neutron:
status: In Progress → New
Kevin Benton (kevinbenton) wrote :

Copying comment I left on the patch here for RFE discussion:

Many of the races we have been dealing with are not necessarily caused by multiple schedulers but are instead caused by concurrent create/update/deletes of the routers. For this to be an effective strategy, we would need to lock every operation that can mutate the state that is contentious (i.e. basically every router operation).

A lock only guarantees that everyone else attempting to acquire the same lock will not be executing at the same time. If just one operation doesn't attempt to acquire the lock before mutating state, every other operation is now at risk of breaking in a difficult to debug way (because the thing that breaks wont actually be the thing with the bug).

In other words, using locks changes our strategy from being defensive to making assumptions about everything that might change an object (e.g. other service plugins, core plugins, ML2 drivers, extensions, etc). This isn't something we can adopt lightheartedly. It needs to be a fundamental shift in the way we allow everything (even out-of-tree things) modify state.

John Schwarz (jschwarz) wrote :

I'm not interested in locking every router operation since it's counter-productive in more than one way (in addition to what you wrote, doing so will eventually be a severe bottle-neck we'd like to avoid). I am interested in locking specific codepaths which we know to be raceful - such as the schedule() function for example. Since not every router create/update/delete calls schedule() this mitigates this issue a bit while keeping the lowest common denominator (ie. where we create HA networks and ports and where we bind a router to an agent) safe.

Sure, a lock only guarantees safety if used properly and starting to use locks will necessarily mean changing our thinking to make sure everything that is raceful and difficult to solve without locks is locked (ie. not all of Neutron), but consider the following:
a. we probably all agree that locking huge chunks of codepaths is a bad idea, so we can just avoid it in favor of the above,
b. doing what we are doing now (ie. solving races with retries, etc) is making the code a lot more complicated and is subject to the same problems (ie. parts of the code that change behavior in a raceful way that we are not aware of is practically the same as locking the problematic code and forgetting just one operation since we are not aware of it either).

With that in mind, I strongly believe that while both approaches fail for "forgotten paths", the paths we do remember will be a lot safer with a DLM in place (since it *should* block the entire logic instead of just retrying specific things inside the logic).

Kevin Benton (kevinbenton) wrote :

There is a big difference between the two approaches. Dealing with races (e.g. catching errors caused by concurrently mutated resources) is defensive because you have an execution path regardless of whatever else is going on. This means the code that fails to deal with the race is the code that breaks so responsibility stays isolated.

Inversely, locking makes it everyone else's responsibility to not accidentally break your code. If they forget to acquire the lock and they deal with race conditions via recovery logic, their code works just fine but all of the code using locking becomes exposed to a race. This makes it look like the lock-bounded code is buggy when it could be caused by something else entirely.

Consider the bug reports we will get in the above scenario. It will be a traceback in the code that you wrapped with a lock. This is useless to actually fix the issue. You now need to request the whole environment configuration from the user and dig through everything else that could have modified the resources your lock was supposed to protect. It's not the end of the world, but it's not pleasant code to debug because visible errors become side effects of the real errors.

So if you are only proposing this lock be in the scheduler, what race is this addressing? We will still then need to handle concurrent ha network and port creation because of the path in the create/delete routers calls, right? Doesn't that just leave the agent binding race (i.e. a single try/except statement to catch duplicates) that this would protect us from?

John Schwarz (jschwarz) wrote :

AFAIK, all the logic that deals with actually creating the ha network and ha ports is inside the schedule() logic, so locking that should be enough for that case. This should hopefully fix (or go a long way for fixing) the following bugs: [1] [2] [3] [4] [5]. Catching duplicates practically everywhere is possible but I don't think wrapping everything in a try/except is an excellent solution.

I'm not looking at DLM as a "solve all problems" button. While I do appreciate that debugging issues as described above is complicated, debugging races is *already* complicated enough and the real cause of issues can be very difficult to track down as well. Since we shouldn't be using it all over the code anyway (but rather only where we really need it, like auto_schedule_routers) I think we can only benefit from using DLMs.

As a side note, I hope that actually using DLMs in specific parts of code will make future code that handles it more safe (people will see that the related code is using DLMs and will think "how can my change hurt this?"). If not by the code writers than at least by the reviewers. The code is as good as the coders (and that's true for both approaches).

[1]: https://bugs.launchpad.net/neutron/+bug/1550886
[2]: https://bugs.launchpad.net/neutron/+bug/1548285
[3]: https://bugs.launchpad.net/neutron/+bug/1533460
[4]: https://bugs.launchpad.net/neutron/+bug/1499647
[5]: https://bugs.launchpad.net/neutron/+bug/1533441

Terry Wilson (otherwiseguy) wrote :

> a. we probably all agree that locking huge chunks of codepaths is a bad idea, so we can just avoid it in favor of the above,

I don't necessarily agree. Is there really that much contention for an object-level lock?

I'm very skeptical about piecemeal solutions where we just try to be consistent in known problem areas.

John Schwarz (jschwarz) wrote :

> AFAIK, all the logic that deals with actually creating the ha network and ha ports is inside the schedule() logic

This is actually false, as it's actually done in 3 places: auto_schedule_routers() logic, create_router() logic and schedule() router.

John Schwarz (jschwarz) wrote :

I'm lost in my own train of thought, so a clarification - creating the actual ha network and ports happen in the create_router() logic, but binding them to an agent happens in the schedule()/auto_schedule() logic. Some patch proposal I wrote a while back changed the behaviour of the schedule() so that if the network doesn't exist - create it, herego my confusion. Still, auto_schedule() shouldn't assume the network exists (if it doesn't it will fail). Changing the code so that if an ha network will be created if it doesn't exist yet is racey - and a DLM will go a long way to fix it (as well as retrying...)

Changed in neutron:
importance: Undecided → Wishlist

Let's discuss it during next meeting to see if can get an agreement on this one.

Changed in neutron:
status: New → Confirmed
status: Confirmed → Triaged

More brainstorming needed.

No brainstorming these past few weeks? Busy uh?

John Schwarz (jschwarz) wrote :

During the drivers meeting where this was discussed (link by armax a few comments up), a suggestion was made to begin using DLMs only on new features/code that is written (as opposed to existing code which might not be very forgiving for this kind of change). With this option, we'll implement the DLMs using tooz as intended but only use it in a select (and small) number of places in the code. This will allow us to have real-life understanding of what the tooz package brings with it (deployment, failovers, backend...) while not hurting existing functionality.

I would like to get some opinions from the core drivers team about this idea.

Doug Wiegley (dougwig) wrote :

Is it possible to put this behind a config toggle, so it's not a giant switch with no soak period?

Assaf Muller (amuller) wrote :

It's possible to introduce it with a toggle, but I think that if we do that, we do that with the intention of eventually removing the toggle and have it always on. The reason is that the main selling point of adding a DLM is that you can make assumptions in the code that is locked. If the existence of the lock is controlled via config option, you can't really make any assumptions anymore, and the whole thing is useless.

Miguel Angel Ajo (mangelajo) wrote :

@doug, @amuller put in words what I was about to say. IMO we may avoid such approach if we decide to use a DLM in any feature.

John Schwarz (jschwarz) wrote :

This was discussed during the last day of the summit. We decided to do some benchmarks to see how much of negative effect using zookeeper/etcd has. Once we have benchmarks we'll be all the wiser on the benefits of this.

Status update:

http://eavesdrop.openstack.org/meetings/neutron_drivers/2016/neutron_drivers.2016-05-12-22.01.log.html#l-42

Until we hear back on the results, it's safe to err on the side of caution.

No update thus far.

John Schwarz (jschwarz) wrote :

Aye, apologies for taking my time with this one, I'm a bit swamped with work atm.

I'll begin working on the benchmarks shortly (within a week or 2).

No update since last week.

John Schwarz (jschwarz) wrote :

I've finely gotten around to working on the benchmarks as discussed last month in Austin. I'm putting my thoughts into [1] and the idea is to present the results in this doc. If this indeed goes forward we can also put this information in a .rst file in-tree.

Please find the time to have a look and comment if you have any ideas on what I should/shouldn't comment. I'm in the midst of getting the deployment ready and once I do (probably next week) I will begin the runs. Hopefully results will be available for next week's drivers meeting.

[1]: https://docs.google.com/document/d/1jdI8gkQKBE0G9koR0nLiW02d5rwyWv_-gAp7yavt4w8/edit?usp=sharing

Doug Wiegley (dougwig) wrote :

In your doc, I don't see a bullet point for 'no DLM' in the list of tests.

Assaf Muller (amuller) wrote :

@John, can you enable comments on the Google Doc? That'd probably be much more efficient than adding comments here.

John Schwarz (jschwarz) wrote :

@Doug, the first test bullet is intended as a no DLM test. I've put this comment in for clarity.

@Assaf, Done.

Carl Baldwin (carl-baldwin) wrote :

Discussed in the Drivers' meeting: http://eavesdrop.openstack.org/meetings/neutron_drivers/2016/neutron_drivers.2016-06-02-22.00.log.html#l-12

tl;dr: We are going to review the google doc with comments turned on by next week.

I reviewed the doc but I was expecting to see the results broken down by tooz backend. Having said that, I think that as first step it would perhaps be nice to bring the distributed coordination in such a way that operators that do not want to use the extra deployment complexity of tooz/(redis|zookeeper|etcd) can disable it (effectively the coordinator.lock() is a noop).

So in this RFE we can track the work to make Neutron Tooz aware and provide it as a mechanism to developers to use it at will.

I realized that I made the same point as commit #16-17...now it's coming back to me!!

I think that one point worth considering in embracing DLM is the operation complexity that it induces on the administrator now that he/she has to manage/maintain/upgrade/strengthen/secure yet another system. Do we have an idea of what other core OpenStack components uses Tooz? I think it would be a high price if Neutron were to be the only one.

Kevin Benton (kevinbenton) wrote :

I'm pretty strongly against making this optional. We need to either have it as a deployment requirement or not adopt it at all.

If it's optional we can't eliminate the code that deals with all of the race conditions, which defeats the point of adopting DLM.

So it has to come down to whether or not we feel comfortable forcing deployers to adopt tooz with Neutron.

Yes, I think now that the whole conversation has come back to me I agree with comment #31/#17. I suspect operators and distros would be opposed to the adoption of Tooz if it were just for Neutron: why would they want to make the lives of the developers simpler if it makes theirs more miserable? Surely enough DLM makes a system more reliable but with a bit of extra code complexity the same can be achieved. So can we reach the conclusion that if say, Nova does not intend to adopt Tooz (and I know that for a fact having already discussed this with the Nova team), as well as the other core services like Keystone or Glance, then Neutron would be the only one to the party?

John Schwarz (jschwarz) wrote :

Actually, Cinder are using Tooz for DLMs ([1], [2], [3]), so it will seem that Neutron aren't dancing alone at this party. In this case the added complexity Armando mentioned involving admin managing/maintaining/... is already needed by Cinder and the overhead we get by adding Tooz support to Neutron isn't so big.

I talked with Gorka a while back about problems they are encountering with Tooz in this arena. There are obviously problems but since the Cinder folks have roughly more experience with DLMs we can only go up from here.

On the subject of the result document - once I'll finish the benchmarks, I promise it will look a lot better and understandable :)

[1]: https://specs.openstack.org/openstack/cinder-specs/specs/mitaka/ha-aa-tooz-locks.html
[2]: https://review.openstack.org/#/q/topic:bp/cinder-volume-active-active-support
[3]: https://github.com/openstack/cinder/search?utf8=%E2%9C%93&q=synchronized

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/290688
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

I poked around and even though Cinder are on track to adopting Tooz, they are not quite there yet. Zookeeper and Tooz at the top are becoming more adopted by other projects in the ecosystem and consequently by distros that want to provide these services. As far as HOS goes, adopting Tooz/Zookeeper would not be a technical today.

John Schwarz (jschwarz) wrote :

Similar status at RDO - while etcd and and redis are packaged for RHEL, ZooKeeper is something that's going to take some effort.

Start now, to be made available in Ocata. No need for a blueprint at this point.

Changed in neutron:
milestone: none → newton-2

Brainfart of mine during tag change.

tags: added: rfe-postponed
removed: rfe-approved
Changed in neutron:
milestone: newton-2 → none
John Schwarz (jschwarz) on 2017-01-31
Changed in neutron:
assignee: John Schwarz (jschwarz) → nobody
Changed in neutron:
status: Triaged → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers