Comment 34 for bug 1375625

Revision history for this message
Adam Spiers (adam.spiers) wrote :

Hey Assaf :-) Thanks a lot for the quick reply. Yes, I remember thinking the keepalived solution had a lot of really nice characteristics when you presented it in the Tokyo talk we did together, and I'm not surprised to hear that it's been working well since then.

Hope you don't mind if I check my understanding of the status quo. I notice that this bug also references openstack-manuals, and together with Andrew Beekhof I'm supposed to be helping to ensure that the upstream HA guide documents all this stuff correctly :-)

IIUC, the main failure scenario which could cause this multiple master split brain issue is a loss of connectivity on the data plane where the VRRP traffic is supposed to flow. This could be caused by a dead NIC, or a failure somewhere on the path in between two NICs (e.g. a switch dying, or more likely, getting misconfigured). And IIUC there are two ways to mitigate these failures:

1. As you noted in comment #10, configuring Pacemaker to monitor NICs and fence nodes with failing NICs takes care of this first failure case at least. Depending on exactly how this monitoring is configured, I guess it could also detect failures on the network path. How are you performing the monitoring in RDO - with something like ocf:pacemaker:ping as described here? http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html

2. As Armando observed in comment #31, another way to mitigate these failure cases is with the use of redundant hardware, although this would require both NIC teaming and multiple paths through separate switches in order to avoid single points of failure, which might be too expensive for some users' tastes.

Presumably we should document both of these techniques in the HA guide, right?

Finally, you said that "fixing this bug [...] was never requested by an actual user thus far" - do you mean RDO users, or any neutron users in general? I'm trying to understand why this would not be a more common problem. Is it because

- users are typically deploying one or both of the two techniques listed above?
- connectivity failures simply don't happen often?
- if this split brain happens, it doesn't tend to cause problems?
- some other reason(s) I missed?

Thanks a lot!