Problem in l3-agent tenant-network interface would cause split-brain in HA router

Bug #1375625 reported by Yair Fried
58
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Unassigned
openstack-manuals
Won't Fix
Low
Unassigned

Bug Description

Assuming l3-agents have 1 NIC (ie eth0) assigned to tenant-network (tunnel) traffic and another (ie eth1) assigned to external network,.
Disconnecting eth0 would prevent keeplived reports and trigger one of the slaves to become master. However, since the error is outside the router namespace, the original master is unaware of that and would not trigger "fault" state. Instead it will continue to receive traffic on the, yet active, external network interface - eth1.

Tags: ha-guide
Yair Fried (yfried)
summary: - Problem in l3-agent tunnel interface would cause split-brain in HA
- router
+ Problem in l3-agent tenant-network interface would cause split-brain in
+ HA router
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Assaf Muller (amuller) wrote :

This is an issue inherent to the way we create routers in namespaces, and that we chose to carry VRRP messages in-band. This is a design level issue and I don't see this ever getting fixed. I think this can be set to won't fix.

Assaf Muller (amuller)
Changed in neutron:
importance: Medium → Wishlist
status: New → Triaged
Lubosz Kosnik (diltram)
Changed in neutron:
assignee: nobody → Lubosz Kosnik (diltram)
Revision history for this message
Matt Kassawara (ionosphere80) wrote :

Didn't we solve part of this issue by allowing the operator to specify a separate network for VRRP traffic?

Revision history for this message
Lubosz Kosnik (diltram) wrote :

Unfortunately it will not solve this problem. Adding new interface will give us the same result. Right now keepalived monitor only one interface specified in NS for router. Adding any additional will not solve the problem. We need extend the verification of connectivity on multiple NICs but because of the NS we cannot add any additional interfaces to track_interface section because they're not available from that NS.
Because of that we need to introduce that functionality in other way - track_script gives us that possibility because we can run any bash script using that config file section - so we can run different ip/ping/ping6 commands to verify connectivity with GW, is interface is UP in multiple NS and in OVS if available.

Revision history for this message
Assaf Muller (amuller) wrote :

track_scripts are tricky to get right but they're probably the best approach here.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This could be easily left up to an out-of-band health-checking mechanism, but if it's trivial enough, I imagine we can take it in.

Changed in neutron:
importance: Wishlist → Medium
milestone: none → mitaka-3
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Optimistically targeting M3.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/273546

Changed in neutron:
status: Triaged → In Progress
Changed in neutron:
milestone: mitaka-3 → mitaka-rc1
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

As much as I hate postponing these to N-1, targeting a fix for RC1 is an ambitious goal. That said, I believe in miracles and there's still a chance this can make into Mitaka.

Changed in neutron:
milestone: mitaka-rc1 → newton-1
importance: Medium → High
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The severity of this issue is high: there's no reasonable workaround (as far as I am aware) and not being able to rely on a robust HA solution, kinda defeats the points of HA.

Revision history for this message
Assaf Muller (amuller) wrote :

> not being able to rely on a robust HA solution

I disagree.

What's being described here is a very specific scenario. If the NIC to the tenant networks goes down, all of the routers lose connection to their VMs. Router replicas on other nodes will be become active, however the originals are still active, and still have connections to the external network via another NIC, duplicating FIPs. So this is a scenario of one NIC failing on one node, but not the other NIC on the same node. This scenario is not covered by L3 HA, while other scenarios are. Luckily, in most HA solutions (And any HA solution I've encountered myself) a tool like Pacemaker is used. In the RDO HA architecture, for example, Pacemaker is configured to fence a node with a dead NIC, which would resolve this specific error scenario. I don't see this as a high priority bug.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

What happens to the existing connections to the router that continues to be master? What I mean is that there's no remedy (i.e. workaround) action once you incur in the failure, or am I mistaken?

What you suggest is sensible (Pacemaker to detect NICs failures and reset the host) and would prevent the situation from occurring, but is it deemed acceptable? You made the recommendation of wontfix once, are you still firmly of the same idea?

Revision history for this message
Assaf Muller (amuller) wrote :

Every time I've brainstormed about this bug I concluded that it should not be fixed within Neutron (And the solution is Pacemaker fencing the node). If anyone can come up with an elegant solution in Neutron I'd love to hear it.

Revision history for this message
Lubosz Kosnik (diltram) wrote : Re: [Bug 1375625] Re: Problem in l3-agent tenant-network interface would cause split-brain in HA router

Like Assaf wrote it maybe is very specific situation but in my opinion that
L3 HA
implementation has huge amount of things which are working only with
Peacemaker.
There is lack of checks what is going one with connectivity to GW,
VRRP is not working completely - when router is rescheduled the FIP's are
left untouched.
In my opinion there is a huge amount of fixes to prepare to independent
that solution from
Peacemaker and this is one of the first steps to prepare L3 HA as a
production ready solution
without so huge dependency on Pacemaker.

On Mon, Mar 14, 2016 at 6:01 AM Assaf Muller <email address hidden> wrote:

> > not being able to rely on a robust HA solution
>
> I disagree.
>
> What's being described here is a very specific scenario. If the NIC to
> the tenant networks goes down, all of the routers lose connection to
> their VMs. Router replicas on other nodes will be become active, however
> the originals are still active, and still have connections to the
> external network via another NIC, duplicating FIPs. So this is a
> scenario of one NIC failing on one node, but not the other NIC on the
> same node. This scenario is not covered by L3 HA, while other scenarios
> are. Luckily, in most HA solutions (And any HA solution I've encountered
> myself) a tool like Pacemaker is used. In the RDO HA architecture, for
> example, Pacemaker is configured to fence a node with a dead NIC, which
> would resolve this specific error scenario. I don't see this as a high
> priority bug.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1375625
>
> Title:
> Problem in l3-agent tenant-network interface would cause split-brain
> in HA router
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1375625/+subscriptions
>

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

It doesn't look like you reviewed [1]. Do you mind looking into it? I trust your expertise on the matter.

[1] https://review.openstack.org/#/c/273546/

Revision history for this message
Assaf Muller (amuller) wrote :

"It doesn't look like you reviewed [1]. Do you mind looking into it? I trust your expertise on the matter."

I will.

"prepare L3 HA as a production ready solution without so huge dependency on Pacemaker."

I've never seen an OpenStack HA architecture that doesn't use Pacemaker or equivalent solution. What do you do about fencing? Do you propose that OpenStack itself should take care of that? There will always be requirements from an HA solution that the different OpenStack projects should not deal with (Out of scope), that's where something like Pacemaker comes in. Therefor, I've always assumed it's there, and never considered pursuing a solution that is completely independent from Pacemaker.

Revision history for this message
Assaf Muller (amuller) wrote :

About the health checks: I think that makes sense and is within the scope of Neutron, because what happens for example if the node loses connectivity to its gateway? Pacemaker is not usually configured to check something like that (Only that the nodes can talk to one another, NICs are up, processes are up, etc). There's information that Neutron is better positioned to deal with, and I think keepalived health checks in the data plane is one of those situations.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I tend to agree with Assaf that, if one is really serious about HA, he/she may end up going beyond simply relying on built-in mechanisms coming from the platform (this case Neutron). That said, I wonder how much of a low-hanging fruit there is in the context of this bug report to improve the existing L3-HA solution.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

So this boils down to: let's see if diltram's patch is worth pursuing and if so, we'd better nail it down, and then we can document that other failure modes are left to out-of-band detection mechanisms. Sounds like a plan?

Revision history for this message
Lubosz Kosnik (diltram) wrote :

Assaf - I'm not talking about completely removing the Peacemaker but to do as much as possible in Neutron - that checks and validations should be done internally and like you said Pacemaker is not validation is things are working on that level so because of that I would like to implement that fixes to provided that functionality.

Revision history for this message
Assaf Muller (amuller) wrote :

There's a difference between validating connectivity to the external network, and validating connectivity to the internal network (What this bug is about).

Revision history for this message
Assaf Muller (amuller) wrote :

To validate behavior to the internal network, you need quorum protocol (Like Corosync & Pacemaker), keepalived is not well suited for this.

Revision history for this message
Lubosz Kosnik (diltram) wrote :

It's true. I miss read that description. But what about addind Tooz to implement that functionality?
it's OpenStack Big Tent project so I would prefer to implement solution using it.

Revision history for this message
Assaf Muller (amuller) wrote :

Ah ha, that's a whole different discussion. There's currently an RFE bug floating around for Neutron adopting Tooz for its locking API (https://bugs.launchpad.net/neutron/+bug/1552680). There's critical questions yet resolved in the wider OpenStack context: Can a core project like Neutron assume that a Tooz backend is available in any OpenStack deployment? How does packaging / downstream consumption look like for Tooz backends? Does the API sufficiently hide the differences between Tooz's backends (It doesn't) so that users (Neutron) can be agnostic of redis/etcd/zookeeper implementation details / bugs / workarounds?

I'm trying to say that I'm not convinced in our ability to adopt Tooz in the short term in the OpenStack context, but we will see :)

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Lubosz: do you intend to keep on working on to achieve a solution for this issue?

Revision history for this message
Lubosz Kosnik (diltram) wrote :

Yes. I would like to do that
On Mon, 14 Mar 2016 at 10:56 PM, Armando Migliaccio <
<email address hidden>> wrote:

> @Lubosz: do you intend to keep on working on to achieve a solution for
> this issue?
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1375625
>
> Title:
> Problem in l3-agent tenant-network interface would cause split-brain
> in HA router
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1375625/+subscriptions
>

Revision history for this message
Alok Kumar Maurya (alok-kumar-maurya) wrote :

A thought , can we try to solve the problem by trying to add multiple HA interfaces , each having a VIP , so that keppalived sends advertisements on both interfaces (does keepalived support that ?)

We can let user define in how many physnets , user wants to have HA VIP , for example physnet1,physnet2

physnet1 could be physnet of tenant datanetwork physical network
physnet2 could be physnet of tenant datanetwork physical network

so Router will have two HA interfaces , one from each phsynet

Revision history for this message
Lubosz Kosnik (diltram) wrote :

Matt Kassawara already proposed that solution but unfortunately it'll not
solve that issue. In previous messages is my explanation.
My idea is to implement quorum mechanism into Neutron and additionally
increase the stability of this solution thru implementing VRRP load
balancing feature. Multiple routers are active in same subnet but only for
a specific hosts. These will increase throughput of the network in SNAT
situation and additionally on routers failure only part of machines will
loose connectivity for around 5-8 seconds.

On Tue, Mar 15, 2016 at 12:55 AM Alok Kumar Maurya <email address hidden>
wrote:

> A thought , can we try to solve the problem by trying to
> add multiple HA interfaces , each having a VIP , so that
> keppalived sends advertisements on both interfaces (does
> keepalived support that ?)
>
> We can let user define in how many physnets , user wants to
> have HA VIP , for example physnet1,physnet2
>
> physnet1 could be physnet of tenant datanetwork physical network
> physnet2 could be physnet of tenant datanetwork physical network
>
> so Router will have two HA interfaces , one from each phsynet
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1375625
>
> Title:
> Problem in l3-agent tenant-network interface would cause split-brain
> in HA router
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1375625/+subscriptions
>

tags: added: mitaka-rc-potential
tags: removed: mitaka-rc-potential
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

If we are unable to solve this within the context of Neutron we should at least document this limitation in the networking guide under the HA scenario section.

Changed in neutron:
status: In Progress → New
status: New → Confirmed
Changed in neutron:
status: Confirmed → In Progress
Changed in openstack-manuals:
status: New → Confirmed
Changed in neutron:
assignee: Lubosz Kosnik (diltram) → nobody
status: In Progress → Confirmed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Download full text (3.4 KiB)

After some consideration, and brainstorming, Assaf and I have reached a reasonable consensus on a path to partially resolving this issue. The proposal is two-pronged:

The first part is properly documenting L3 HA, and how to overcome its current/future limitations. For instance,
it is my understanding that the system environment conditions where this issue can be reproduced are such that the hardware configuration does not provide any form of redundancy, for instance by means of NIC teaming (i.e. link aggregation). Now, under this premise, networking failures may indeed lead to Neutron control plane failures. As for L3, when HA configuration is desired, current Neutron support is such that even a temporary connection loss in the data plane may lead to an unrecoverable invalid state of the VRRP group associated to the HA router where multiple replicas are marked as master. There is currently no way to rectify the situation, if not manually. One possible corrective measure to this problem would be the implementation of a STONITH solution. Pursuing this approach in Neutron is obviously a non-starter because of the great deal of development and maintenance complexity that will result over time. Using something like Pacemaker/corosync is more appropriate. Alternatively, where the aforementioned corrective measure is not deemed feasible or undesirable, preventive measures can be put in place so that the occurrence of a broken VRRP state is made less likely, for instance by relying on hardware redundancy (e.g. NIC link aggregation). Documenting this would go at great length to set the right expectations when using L3 HA.

The second part to solving this issue is also a preventive measure that however requires enhancing the existing Neutron HA framework in order to implement a more elaborate error detection mechanism, so that the chance of multiple master replicas in a group is indeed reduced. Since it is still tricky to do this within Neutron itself, a solution can be to leverage keepalived’s check script and piggyback fix for bug 1365461 where a user supplied check script is used to determine whether a keepalived replica is healthy or not depending on user provided logic. If a failure is detected, a fail over should occur. This script will have to be invoked with a number of parameters (interface names, IP addresses, router ID’s etc) to augment the existing fault detection strategy. Neutron's sole job is to generate a keepalived configuration that allows the user supplied script to be invoked. It's up to the user to ensure correctness of the logic implemented. Neutron itself can also be extended to emit a log error (warning) at any time more than one agent associated to a HA router is both in master and in alive (dead) state. To the administrator who does monitor logs, this should help provide an alert to initiate manual corrective actions, in case the enhanced detection mechanism proves itself ineffective.

It is noteworthy that the approach of making Neutron more resilient to hardware failures by extending its detection/reporting capabilities has itself potential limitations (that should also be documented):

 * it may introduce a potential secu...

Read more...

Changed in neutron:
assignee: nobody → Lubosz Kosnik (diltram)
status: Confirmed → In Progress
Changed in neutron:
milestone: newton-1 → newton-2
Lubosz Kosnik (diltram)
Changed in neutron:
assignee: Lubosz Kosnik (diltram) → nobody
Changed in neutron:
milestone: newton-2 → newton-3
Changed in neutron:
milestone: newton-3 → newton-rc1
Changed in neutron:
milestone: newton-rc1 → ocata-1
Changed in neutron:
milestone: ocata-1 → ocata-2
Revision history for this message
Adam Spiers (adam.spiers) wrote :

Thank you Armando for comment #31 which was an extremely helpful summary of the status quo! Has anything significant changed since you wrote that in April?

I will be at the Atlanta PTG and I'm very keen to meet up to discuss how we can make progress on this. Until now SUSE has been using a completely different approach to neutron HA which has the capability for fencing to avoid the split brain scenario, but has its own drawbacks. I'd be keen to see us converge on an upstream best-of-breed solution.

After some brief chats with Rossella, I got the impression that a reasonable approach might be to introduce a driver-based architecture to the HA code, so that there is one driver for keepalived, and then another one could be added for Pacemaker which could harness Pacemaker's STONITH capabilities. Does this sound plausible?

Revision history for this message
Assaf Muller (amuller) wrote :

@Adam, we have a ton of RDO users using the keepalived based solution for what its worth. It's turned on by default and has been for many releases. Fixing this bug has never been a priority and was never requested by an actual user thus far.

Revision history for this message
Adam Spiers (adam.spiers) wrote :

Hey Assaf :-) Thanks a lot for the quick reply. Yes, I remember thinking the keepalived solution had a lot of really nice characteristics when you presented it in the Tokyo talk we did together, and I'm not surprised to hear that it's been working well since then.

Hope you don't mind if I check my understanding of the status quo. I notice that this bug also references openstack-manuals, and together with Andrew Beekhof I'm supposed to be helping to ensure that the upstream HA guide documents all this stuff correctly :-)

IIUC, the main failure scenario which could cause this multiple master split brain issue is a loss of connectivity on the data plane where the VRRP traffic is supposed to flow. This could be caused by a dead NIC, or a failure somewhere on the path in between two NICs (e.g. a switch dying, or more likely, getting misconfigured). And IIUC there are two ways to mitigate these failures:

1. As you noted in comment #10, configuring Pacemaker to monitor NICs and fence nodes with failing NICs takes care of this first failure case at least. Depending on exactly how this monitoring is configured, I guess it could also detect failures on the network path. How are you performing the monitoring in RDO - with something like ocf:pacemaker:ping as described here? http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html

2. As Armando observed in comment #31, another way to mitigate these failure cases is with the use of redundant hardware, although this would require both NIC teaming and multiple paths through separate switches in order to avoid single points of failure, which might be too expensive for some users' tastes.

Presumably we should document both of these techniques in the HA guide, right?

Finally, you said that "fixing this bug [...] was never requested by an actual user thus far" - do you mean RDO users, or any neutron users in general? I'm trying to understand why this would not be a more common problem. Is it because

- users are typically deploying one or both of the two techniques listed above?
- connectivity failures simply don't happen often?
- if this split brain happens, it doesn't tend to cause problems?
- some other reason(s) I missed?

Thanks a lot!

Changed in neutron:
milestone: ocata-2 → ocata-3
Revision history for this message
Alexandra Settle (alexandra-settle) wrote :

Adam and Assaf - what's the update here?

Marking as incomplete until we have an understanding for docs and an appropriate outline.

Changed in openstack-manuals:
status: Confirmed → Incomplete
importance: Undecided → High
tags: added: ha-guide
Changed in neutron:
milestone: ocata-3 → ocata-rc1
Revision history for this message
Adam Spiers (adam.spiers) wrote :

@Alex https://review.openstack.org/#/c/273546/ has now been merged, but I haven't had a chance to look at it and grok the impact on this bug. That'll probably happen in Atlanta, unless someone more knowledgeable is kind enough to explain it before then. But currently the questions I asked in #34 are still open (for me, at least).

Changed in neutron:
milestone: ocata-rc1 → pike-1
Changed in openstack-manuals:
status: Incomplete → Confirmed
importance: High → Low
Revision history for this message
Miguel Lavalle (minsel) wrote :

Code fix for this bug was committed here: https://review.openstack.org/#/c/273546/

Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron:
status: Fix Committed → Fix Released
tags: removed: l3-ha
Revision history for this message
Adam Spiers (adam.spiers) wrote :

@Alex Shouldn't this bug keep the l3-ha tag?

Also, I just noticed that apparently I never updated this bug with the results from the Atlanta PTG, but they are summarised here:

http://lists.openstack.org/pipermail/openstack-dev/2017-February/112868.html

In particular, this ethercalc summarises how the various failure modes are or aren't covered depending on how L3 HA is set up:

https://ethercalc.openstack.org/Pike-Neutron-L3-HA

Having said that, rereading the comments here makes me wonder if the spreadsheet was 100% accurate - at very least it was missing the possibility that Pacemaker can monitor networks other than the corosync network, and I have a vague memory of Assaf *maybe* mentioning that RH has it set up to do that (although I see no evidence of that in the https://github.com/beekhof/osp-ha-deploy repo).

Revision history for this message
Frank Kloeker (f-kloeker) wrote :

Won't track this issue further more here. Please open a new one on Storyboard for HA Guide if required.

Changed in openstack-manuals:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.