Bug #1375625 “Problem in l3-agent tenant-network interface would...” : Bugs : openstack-manuals

Yair Fried (yfried) on 2014-09-30

summary:

- Problem in l3-agent tunnel interface would cause split-brain in HA
- router
+ Problem in l3-agent tenant-network interface would cause split-brain in
+ HA router

Eugene Nikanorov (enikanorov) on 2014-10-03

Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

Assaf Muller (amuller) wrote on 2014-11-21:

#1

This is an issue inherent to the way we create routers in namespaces, and that we chose to carry VRRP messages in-band. This is a design level issue and I don't see this ever getting fixed. I think this can be set to won't fix.

Assaf Muller (amuller) on 2015-04-30

Changed in neutron:
importance:	Medium → Wishlist
status:	New → Triaged

Lubosz Kosnik (diltram) on 2015-12-28

Changed in neutron:
assignee:	nobody → Lubosz Kosnik (diltram)

Revision history for this message

Matt Kassawara (ionosphere80) wrote on 2016-01-21:

#2

Didn't we solve part of this issue by allowing the operator to specify a separate network for VRRP traffic?

Revision history for this message

Lubosz Kosnik (diltram) wrote on 2016-01-21:

#3

Unfortunately it will not solve this problem. Adding new interface will give us the same result. Right now keepalived monitor only one interface specified in NS for router. Adding any additional will not solve the problem. We need extend the verification of connectivity on multiple NICs but because of the NS we cannot add any additional interfaces to track_interface section because they're not available from that NS.
Because of that we need to introduce that functionality in other way - track_script gives us that possibility because we can run any bash script using that config file section - so we can run different ip/ping/ping6 commands to verify connectivity with GW, is interface is UP in multiple NS and in OVS if available.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-01-21:

#4

track_scripts are tricky to get right but they're probably the best approach here.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-01-25:

#5

This could be easily left up to an out-of-band health-checking mechanism, but if it's trivial enough, I imagine we can take it in.

Changed in neutron:
importance:	Wishlist → Medium
milestone:	none → mitaka-3

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-01-25:

#6

Optimistically targeting M3.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-28: Fix proposed to neutron (master)

#7

Fix proposed to branch: master
Review: https://review.openstack.org/273546

Changed in neutron:
status:	Triaged → In Progress

Armando Migliaccio (armando-migliaccio) on 2016-03-03

Changed in neutron:
milestone:	mitaka-3 → mitaka-rc1

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-14:

#8

As much as I hate postponing these to N-1, targeting a fix for RC1 is an ambitious goal. That said, I believe in miracles and there's still a chance this can make into Mitaka.

Changed in neutron:
milestone:	mitaka-rc1 → newton-1
importance:	Medium → High

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-14:

#9

The severity of this issue is high: there's no reasonable workaround (as far as I am aware) and not being able to rely on a robust HA solution, kinda defeats the points of HA.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-03-14:

#10

> not being able to rely on a robust HA solution

I disagree.

What's being described here is a very specific scenario. If the NIC to the tenant networks goes down, all of the routers lose connection to their VMs. Router replicas on other nodes will be become active, however the originals are still active, and still have connections to the external network via another NIC, duplicating FIPs. So this is a scenario of one NIC failing on one node, but not the other NIC on the same node. This scenario is not covered by L3 HA, while other scenarios are. Luckily, in most HA solutions (And any HA solution I've encountered myself) a tool like Pacemaker is used. In the RDO HA architecture, for example, Pacemaker is configured to fence a node with a dead NIC, which would resolve this specific error scenario. I don't see this as a high priority bug.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-14:

#13

What happens to the existing connections to the router that continues to be master? What I mean is that there's no remedy (i.e. workaround) action once you incur in the failure, or am I mistaken?

What you suggest is sensible (Pacemaker to detect NICs failures and reset the host) and would prevent the situation from occurring, but is it deemed acceptable? You made the recommendation of wontfix once, are you still firmly of the same idea?

Revision history for this message

Assaf Muller (amuller) wrote on 2016-03-14:

#14

Every time I've brainstormed about this bug I concluded that it should not be fixed within Neutron (And the solution is Pacemaker fencing the node). If anyone can come up with an elegant solution in Neutron I'd love to hear it.

Revision history for this message

Lubosz Kosnik (diltram) wrote on 2016-03-14: Re: [Bug 1375625] Re: Problem in l3-agent tenant-network interface would cause split-brain in HA router

#15

Like Assaf wrote it maybe is very specific situation but in my opinion that
L3 HA
implementation has huge amount of things which are working only with
Peacemaker.
There is lack of checks what is going one with connectivity to GW,
VRRP is not working completely - when router is rescheduled the FIP's are
left untouched.
In my opinion there is a huge amount of fixes to prepare to independent
that solution from
Peacemaker and this is one of the first steps to prepare L3 HA as a
production ready solution
without so huge dependency on Pacemaker.

On Mon, Mar 14, 2016 at 6:01 AM Assaf Muller <email address hidden> wrote:

> > not being able to rely on a robust HA solution
>
> I disagree.
>
> What's being described here is a very specific scenario. If the NIC to
> the tenant networks goes down, all of the routers lose connection to
> their VMs. Router replicas on other nodes will be become active, however
> the originals are still active, and still have connections to the
> external network via another NIC, duplicating FIPs. So this is a
> scenario of one NIC failing on one node, but not the other NIC on the
> same node. This scenario is not covered by L3 HA, while other scenarios
> are. Luckily, in most HA solutions (And any HA solution I've encountered
> myself) a tool like Pacemaker is used. In the RDO HA architecture, for
> example, Pacemaker is configured to fence a node with a dead NIC, which
> would resolve this specific error scenario. I don't see this as a high
> priority bug.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1375625
>
> Title:
> Problem in l3-agent tenant-network interface would cause split-brain
> in HA router
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1375625/+subscriptions
>

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-14:

#16

It doesn't look like you reviewed [1]. Do you mind looking into it? I trust your expertise on the matter.

[1] https://review.openstack.org/#/c/273546/

Revision history for this message

Assaf Muller (amuller) wrote on 2016-03-14:

#17

"It doesn't look like you reviewed [1]. Do you mind looking into it? I trust your expertise on the matter."

I will.

"prepare L3 HA as a production ready solution without so huge dependency on Pacemaker."

I've never seen an OpenStack HA architecture that doesn't use Pacemaker or equivalent solution. What do you do about fencing? Do you propose that OpenStack itself should take care of that? There will always be requirements from an HA solution that the different OpenStack projects should not deal with (Out of scope), that's where something like Pacemaker comes in. Therefor, I've always assumed it's there, and never considered pursuing a solution that is completely independent from Pacemaker.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-03-14:

#18

About the health checks: I think that makes sense and is within the scope of Neutron, because what happens for example if the node loses connectivity to its gateway? Pacemaker is not usually configured to check something like that (Only that the nodes can talk to one another, NICs are up, processes are up, etc). There's information that Neutron is better positioned to deal with, and I think keepalived health checks in the data plane is one of those situations.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-14:

#19

I tend to agree with Assaf that, if one is really serious about HA, he/she may end up going beyond simply relying on built-in mechanisms coming from the platform (this case Neutron). That said, I wonder how much of a low-hanging fruit there is in the context of this bug report to improve the existing L3-HA solution.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-14:

#20

So this boils down to: let's see if diltram's patch is worth pursuing and if so, we'd better nail it down, and then we can document that other failure modes are left to out-of-band detection mechanisms. Sounds like a plan?

Revision history for this message

Lubosz Kosnik (diltram) wrote on 2016-03-14:

#21

Assaf - I'm not talking about completely removing the Peacemaker but to do as much as possible in Neutron - that checks and validations should be done internally and like you said Pacemaker is not validation is things are working on that level so because of that I would like to implement that fixes to provided that functionality.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-03-14:

#22

There's a difference between validating connectivity to the external network, and validating connectivity to the internal network (What this bug is about).

Revision history for this message

Assaf Muller (amuller) wrote on 2016-03-14:

#23

To validate behavior to the internal network, you need quorum protocol (Like Corosync & Pacemaker), keepalived is not well suited for this.

Revision history for this message

Lubosz Kosnik (diltram) wrote on 2016-03-14:

#24

It's true. I miss read that description. But what about addind Tooz to implement that functionality?
it's OpenStack Big Tent project so I would prefer to implement solution using it.

Revision history for this message

Assaf Muller (amuller) wrote on 2016-03-14:

#25

Ah ha, that's a whole different discussion. There's currently an RFE bug floating around for Neutron adopting Tooz for its locking API (https://bugs.launchpad.net/neutron/+bug/1552680). There's critical questions yet resolved in the wider OpenStack context: Can a core project like Neutron assume that a Tooz backend is available in any OpenStack deployment? How does packaging / downstream consumption look like for Tooz backends? Does the API sufficiently hide the differences between Tooz's backends (It doesn't) so that users (Neutron) can be agnostic of redis/etcd/zookeeper implementation details / bugs / workarounds?

I'm trying to say that I'm not convinced in our ability to adopt Tooz in the short term in the OpenStack context, but we will see :)

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-15:

#26

@Lubosz: do you intend to keep on working on to achieve a solution for this issue?

Revision history for this message

Lubosz Kosnik (diltram) wrote on 2016-03-15:

#27

Yes. I would like to do that
On Mon, 14 Mar 2016 at 10:56 PM, Armando Migliaccio <
<email address hidden>> wrote:

> @Lubosz: do you intend to keep on working on to achieve a solution for
> this issue?
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1375625
>
> Title:
> Problem in l3-agent tenant-network interface would cause split-brain
> in HA router
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1375625/+subscriptions
>

Revision history for this message

Alok Kumar Maurya (alok-kumar-maurya) wrote on 2016-03-15:

#28

A thought , can we try to solve the problem by trying to add multiple HA interfaces , each having a VIP , so that keppalived sends advertisements on both interfaces (does keepalived support that ?)

We can let user define in how many physnets , user wants to have HA VIP , for example physnet1,physnet2

physnet1 could be physnet of tenant datanetwork physical network
physnet2 could be physnet of tenant datanetwork physical network

so Router will have two HA interfaces , one from each phsynet

Revision history for this message

Lubosz Kosnik (diltram) wrote on 2016-03-15:

#29

Matt Kassawara already proposed that solution but unfortunately it'll not
solve that issue. In previous messages is my explanation.
My idea is to implement quorum mechanism into Neutron and additionally
increase the stability of this solution thru implementing VRRP load
balancing feature. Multiple routers are active in same subnet but only for
a specific hosts. These will increase throughput of the network in SNAT
situation and additionally on routers failure only part of machines will
loose connectivity for around 5-8 seconds.

On Tue, Mar 15, 2016 at 12:55 AM Alok Kumar Maurya <email address hidden>
wrote:

> A thought , can we try to solve the problem by trying to
> add multiple HA interfaces , each having a VIP , so that
> keppalived sends advertisements on both interfaces (does
> keepalived support that ?)
>
> We can let user define in how many physnets , user wants to
> have HA VIP , for example physnet1,physnet2
>
> physnet1 could be physnet of tenant datanetwork physical network
> physnet2 could be physnet of tenant datanetwork physical network
>
> so Router will have two HA interfaces , one from each phsynet
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1375625
>
> Title:
> Problem in l3-agent tenant-network interface would cause split-brain
> in HA router
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1375625/+subscriptions
>

Armando Migliaccio (armando-migliaccio) on 2016-03-15

tags:	added: mitaka-rc-potential
tags:	removed: mitaka-rc-potential

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-31:

#30

If we are unable to solve this within the context of Neutron we should at least document this limitation in the networking guide under the HA scenario section.

Changed in neutron:
status:	In Progress → New
status:	New → Confirmed

OpenStack Infra (hudson-openstack) on 2016-04-01

Changed in neutron:
status:	Confirmed → In Progress

Armando Migliaccio (armando-migliaccio) on 2016-04-12

Changed in openstack-manuals:
status:	New → Confirmed
Changed in neutron:
assignee:	Lubosz Kosnik (diltram) → nobody
status:	In Progress → Confirmed

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-04-12:

#31

Download full text (3.4 KiB)

After some consideration, and brainstorming, Assaf and I have reached a reasonable consensus on a path to partially resolving this issue. The proposal is two-pronged:

The first part is properly documenting L3 HA, and how to overcome its current/future limitations. For instance,
it is my understanding that the system environment conditions where this issue can be reproduced are such that the hardware configuration does not provide any form of redundancy, for instance by means of NIC teaming (i.e. link aggregation). Now, under this premise, networking failures may indeed lead to Neutron control plane failures. As for L3, when HA configuration is desired, current Neutron support is such that even a temporary connection loss in the data plane may lead to an unrecoverable invalid state of the VRRP group associated to the HA router where multiple replicas are marked as master. There is currently no way to rectify the situation, if not manually. One possible corrective measure to this problem would be the implementation of a STONITH solution. Pursuing this approach in Neutron is obviously a non-starter because of the great deal of development and maintenance complexity that will result over time. Using something like Pacemaker/corosync is more appropriate. Alternatively, where the aforementioned corrective measure is not deemed feasible or undesirable, preventive measures can be put in place so that the occurrence of a broken VRRP state is made less likely, for instance by relying on hardware redundancy (e.g. NIC link aggregation). Documenting this would go at great length to set the right expectations when using L3 HA.

The second part to solving this issue is also a preventive measure that however requires enhancing the existing Neutron HA framework in order to implement a more elaborate error detection mechanism, so that the chance of multiple master replicas in a group is indeed reduced. Since it is still tricky to do this within Neutron itself, a solution can be to leverage keepalived’s check script and piggyback fix for bug 1365461 where a user supplied check script is used to determine whether a keepalived replica is healthy or not depending on user provided logic. If a failure is detected, a fail over should occur. This script will have to be invoked with a number of parameters (interface names, IP addresses, router ID’s etc) to augment the existing fault detection strategy. Neutron's sole job is to generate a keepalived configuration that allows the user supplied script to be invoked. It's up to the user to ensure correctness of the logic implemented. Neutron itself can also be extended to emit a log error (warning) at any time more than one agent associated to a HA router is both in master and in alive (dead) state. To the administrator who does monitor logs, this should help provide an alert to initiate manual corrective actions, in case the enhanced detection mechanism proves itself ineffective.

It is noteworthy that the approach of making Neutron more resilient to hardware failures by extending its detection/reporting capabilities has itself potential limitations (that should also be documented):

* it may introduce a potential secu...

After some consideration, and brainstorming, Assaf and I have reached a reasonable consensus on a path to partially resolving this issue. The proposal is two-pronged:

The first part is properly documenting L3 HA, and how to overcome its current/future limitations. For instance, 
it is my understanding that the system environment conditions where this issue can be reproduced are such that the hardware configuration does not provide any form of redundancy, for instance by means of NIC teaming (i.e. link aggregation). Now, under this premise, networking failures may indeed lead to Neutron control plane failures. As for L3, when HA configuration is desired, current Neutron support is such that even a temporary connection loss in the data plane may lead to an unrecoverable invalid state of the VRRP group associated to the HA router where multiple replicas are marked as master. There is currently no way to rectify the situation, if not manually. One possible corrective measure to this problem would be the implementation of a STONITH solution. Pursuing this approach in Neutron is obviously a non-starter because of the great deal of development and maintenance complexity that will result over time. Using something like Pacemaker/corosync is more appropriate. Alternatively, where the aforementioned corrective measure is not deemed feasible or undesirable, preventive measures can be put in place so that the occurrence of a broken VRRP state is made less likely, for instance by relying on hardware redundancy (e.g. NIC link aggregation). Documenting this would go at great length to set the right expectations when using L3 HA.

The second part to solving this issue is also a preventive measure that however requires enhancing the existing Neutron HA framework in order to implement a more elaborate error detection mechanism, so that the chance of multiple master replicas in a group is indeed reduced. Since it is still tricky to do this within Neutron itself, a solution can be to leverage keepalived’s check script and piggyback fix for bug 1365461 where a user supplied check script is used to determine whether a keepalived replica is healthy or not depending on user provided logic. If a failure is detected, a fail over should occur. This script will have to be invoked with a number of parameters (interface names, IP addresses, router ID’s etc) to augment the existing fault detection strategy. Neutron's sole job is to generate a keepalived configuration that allows the user supplied script to be invoked. It's up to the user to ensure correctness of the logic implemented. Neutron itself can also be extended to emit a log error (warning) at any time more than one agent associated to a HA router is both in master and in alive (dead) state. To the administrator who does monitor logs, this should help provide an alert to initiate manual corrective actions, in case the enhanced detection mechanism proves itself ineffective.

It is noteworthy that the approach of making Neutron more resilient to hardware failures by extending its detection/reporting capabilities has itself potential limitations (that should also be documented):

* it may introduce a potential security vulnerability: check scripts are given root access when executed. This is a risk to the inexperienced operator.
 * an overly aggressive detection mechanism may lead to false positives that cause fail over when not necessary.
 * a faulty check script may lead to a state where no master can be elected in the VRRP cluster.

OpenStack Infra (hudson-openstack) on 2016-04-13

Changed in neutron:
assignee:	nobody → Lubosz Kosnik (diltram)
status:	Confirmed → In Progress

Armando Migliaccio (armando-migliaccio) on 2016-06-03

Changed in neutron:
milestone:	newton-1 → newton-2

Lubosz Kosnik (diltram) on 2016-06-13

Changed in neutron:
assignee:	Lubosz Kosnik (diltram) → nobody

Armando Migliaccio (armando-migliaccio) on 2016-07-15

Changed in neutron:
milestone:	newton-2 → newton-3

Armando Migliaccio (armando-migliaccio) on 2016-09-01

Changed in neutron:
milestone:	newton-3 → newton-rc1

Armando Migliaccio (armando-migliaccio) on 2016-09-09

Changed in neutron:
milestone:	newton-rc1 → ocata-1

Armando Migliaccio (armando-migliaccio) on 2016-11-16

Changed in neutron:
milestone:	ocata-1 → ocata-2

Revision history for this message

Adam Spiers (adam.spiers) wrote on 2017-01-03:

#32

Thank you Armando for comment #31 which was an extremely helpful summary of the status quo! Has anything significant changed since you wrote that in April?

I will be at the Atlanta PTG and I'm very keen to meet up to discuss how we can make progress on this. Until now SUSE has been using a completely different approach to neutron HA which has the capability for fencing to avoid the split brain scenario, but has its own drawbacks. I'd be keen to see us converge on an upstream best-of-breed solution.

After some brief chats with Rossella, I got the impression that a reasonable approach might be to introduce a driver-based architecture to the HA code, so that there is one driver for keepalived, and then another one could be added for Pacemaker which could harness Pacemaker's STONITH capabilities. Does this sound plausible?

Revision history for this message

Assaf Muller (amuller) wrote on 2017-01-03:

#33

@Adam, we have a ton of RDO users using the keepalived based solution for what its worth. It's turned on by default and has been for many releases. Fixing this bug has never been a priority and was never requested by an actual user thus far.

Revision history for this message

Adam Spiers (adam.spiers) wrote on 2017-01-04:

#34

Hey Assaf :-) Thanks a lot for the quick reply. Yes, I remember thinking the keepalived solution had a lot of really nice characteristics when you presented it in the Tokyo talk we did together, and I'm not surprised to hear that it's been working well since then.

Hope you don't mind if I check my understanding of the status quo. I notice that this bug also references openstack-manuals, and together with Andrew Beekhof I'm supposed to be helping to ensure that the upstream HA guide documents all this stuff correctly :-)

IIUC, the main failure scenario which could cause this multiple master split brain issue is a loss of connectivity on the data plane where the VRRP traffic is supposed to flow. This could be caused by a dead NIC, or a failure somewhere on the path in between two NICs (e.g. a switch dying, or more likely, getting misconfigured). And IIUC there are two ways to mitigate these failures:

1. As you noted in comment #10, configuring Pacemaker to monitor NICs and fence nodes with failing NICs takes care of this first failure case at least. Depending on exactly how this monitoring is configured, I guess it could also detect failures on the network path. How are you performing the monitoring in RDO - with something like ocf:pacemaker:ping as described here? http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html

2. As Armando observed in comment #31, another way to mitigate these failure cases is with the use of redundant hardware, although this would require both NIC teaming and multiple paths through separate switches in order to avoid single points of failure, which might be too expensive for some users' tastes.

Presumably we should document both of these techniques in the HA guide, right?

Finally, you said that "fixing this bug [...] was never requested by an actual user thus far" - do you mean RDO users, or any neutron users in general? I'm trying to understand why this would not be a more common problem. Is it because

- users are typically deploying one or both of the two techniques listed above?
- connectivity failures simply don't happen often?
- if this split brain happens, it doesn't tend to cause problems?
- some other reason(s) I missed?

Thanks a lot!

Hey Assaf :-)  Thanks a lot for the quick reply.  Yes, I remember thinking the keepalived solution had a lot of really nice characteristics when you presented it in the Tokyo talk we did together, and I'm not surprised to hear that it's been working well since then.

Hope you don't mind if I check my understanding of the status quo.  I notice that this bug also references openstack-manuals, and together with Andrew Beekhof I'm supposed to be helping to ensure that the upstream HA guide documents all this stuff correctly :-)

IIUC, the main failure scenario which could cause this multiple master split brain issue is a loss of connectivity on the data plane where the VRRP traffic is supposed to flow.  This could be caused by a dead NIC, or a failure somewhere on the path in between two NICs (e.g. a switch dying, or more likely, getting misconfigured).  And IIUC there are two ways to mitigate these failures:

1. As you noted in comment #10, configuring Pacemaker to monitor NICs and fence nodes with failing NICs takes care of this first failure case at least.  Depending on exactly how this monitoring is configured, I guess it could also detect failures on the network path.  How are you performing the monitoring in RDO - with something like ocf:pacemaker:ping as described here?  http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html

2. As Armando observed in comment #31, another way to mitigate these failure cases is with the use of redundant hardware, although this would require both NIC teaming and multiple paths through separate switches in order to avoid single points of failure, which might be too expensive for some users' tastes.

Presumably we should document both of these techniques in the HA guide, right?

Finally, you said that "fixing this bug [...] was never requested by an actual user thus far" - do you mean RDO users, or any neutron users in general?  I'm trying to understand why this would not be a more common problem.  Is it because

- users are typically deploying one or both of the two techniques listed above?
- connectivity failures simply don't happen often?
- if this split brain happens, it doesn't tend to cause problems?
- some other reason(s) I missed?

Thanks a lot!

Armando Migliaccio (armando-migliaccio) on 2017-01-06

Changed in neutron:
milestone:	ocata-2 → ocata-3

Revision history for this message

Alexandra Settle (alexandra-settle) wrote on 2017-01-24:

#35

Adam and Assaf - what's the update here?

Marking as incomplete until we have an understanding for docs and an appropriate outline.

Changed in openstack-manuals:
status:	Confirmed → Incomplete
importance:	Undecided → High
tags:	added: ha-guide

Armando Migliaccio (armando-migliaccio) on 2017-01-26

Changed in neutron:
milestone:	ocata-3 → ocata-rc1

Revision history for this message

Adam Spiers (adam.spiers) wrote on 2017-02-04:

#36

@Alex https://review.openstack.org/#/c/273546/ has now been merged, but I haven't had a chance to look at it and grok the impact on this bug. That'll probably happen in Atlanta, unless someone more knowledgeable is kind enough to explain it before then. But currently the questions I asked in #34 are still open (for me, at least).

Armando Migliaccio (armando-migliaccio) on 2017-02-07

Changed in neutron:
milestone:	ocata-rc1 → pike-1

Alexandra Settle (alexandra-settle) on 2017-02-08

Changed in openstack-manuals:
status:	Incomplete → Confirmed
importance:	High → Low

Revision history for this message

Miguel Lavalle (minsel) wrote on 2017-04-28:

#37

Code fix for this bug was committed here: https://review.openstack.org/#/c/273546/

Changed in neutron:
status:	In Progress → Fix Committed

Ihar Hrachyshka (ihar-hrachyshka) on 2017-07-12

Changed in neutron:
status:	Fix Committed → Fix Released

Alexandra Settle (alexandra-settle) on 2017-09-18

tags:

removed: l3-ha

Revision history for this message

Adam Spiers (adam.spiers) wrote on 2017-09-18:

#38

@Alex Shouldn't this bug keep the l3-ha tag?

Also, I just noticed that apparently I never updated this bug with the results from the Atlanta PTG, but they are summarised here:

http://lists.openstack.org/pipermail/openstack-dev/2017-February/112868.html

In particular, this ethercalc summarises how the various failure modes are or aren't covered depending on how L3 HA is set up:

https://ethercalc.openstack.org/Pike-Neutron-L3-HA

Having said that, rereading the comments here makes me wonder if the spreadsheet was 100% accurate - at very least it was missing the possibility that Pacemaker can monitor networks other than the corosync network, and I have a vague memory of Assaf *maybe* mentioning that RH has it set up to do that (although I see no evidence of that in the https://github.com/beekhof/osp-ha-deploy repo).

Revision history for this message

Frank Kloeker (f-kloeker) wrote on 2019-05-21:

#39

Won't track this issue further more here. Please open a new one on Storyboard for HA Guide if required.

Changed in openstack-manuals:
status:	Confirmed → Won't Fix

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Unassigned	neutron pike-1 "p-1"
	openstack-manuals	Won't Fix	Low	Unassigned

openstack-manuals

Problem in l3-agent tenant-network interface would cause split-brain in HA router

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches