openstack-manuals

Bug #1375625
Comment #31

Comment 31 for bug 1375625

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-04-12:

#31

After some consideration, and brainstorming, Assaf and I have reached a reasonable consensus on a path to partially resolving this issue. The proposal is two-pronged:

The first part is properly documenting L3 HA, and how to overcome its current/future limitations. For instance,
it is my understanding that the system environment conditions where this issue can be reproduced are such that the hardware configuration does not provide any form of redundancy, for instance by means of NIC teaming (i.e. link aggregation). Now, under this premise, networking failures may indeed lead to Neutron control plane failures. As for L3, when HA configuration is desired, current Neutron support is such that even a temporary connection loss in the data plane may lead to an unrecoverable invalid state of the VRRP group associated to the HA router where multiple replicas are marked as master. There is currently no way to rectify the situation, if not manually. One possible corrective measure to this problem would be the implementation of a STONITH solution. Pursuing this approach in Neutron is obviously a non-starter because of the great deal of development and maintenance complexity that will result over time. Using something like Pacemaker/corosync is more appropriate. Alternatively, where the aforementioned corrective measure is not deemed feasible or undesirable, preventive measures can be put in place so that the occurrence of a broken VRRP state is made less likely, for instance by relying on hardware redundancy (e.g. NIC link aggregation). Documenting this would go at great length to set the right expectations when using L3 HA.

The second part to solving this issue is also a preventive measure that however requires enhancing the existing Neutron HA framework in order to implement a more elaborate error detection mechanism, so that the chance of multiple master replicas in a group is indeed reduced. Since it is still tricky to do this within Neutron itself, a solution can be to leverage keepalived’s check script and piggyback fix for bug 1365461 where a user supplied check script is used to determine whether a keepalived replica is healthy or not depending on user provided logic. If a failure is detected, a fail over should occur. This script will have to be invoked with a number of parameters (interface names, IP addresses, router ID’s etc) to augment the existing fault detection strategy. Neutron's sole job is to generate a keepalived configuration that allows the user supplied script to be invoked. It's up to the user to ensure correctness of the logic implemented. Neutron itself can also be extended to emit a log error (warning) at any time more than one agent associated to a HA router is both in master and in alive (dead) state. To the administrator who does monitor logs, this should help provide an alert to initiate manual corrective actions, in case the enhanced detection mechanism proves itself ineffective.

It is noteworthy that the approach of making Neutron more resilient to hardware failures by extending its detection/reporting capabilities has itself potential limitations (that should also be documented):

* it may introduce a potential security vulnerability: check scripts are given root access when executed. This is a risk to the inexperienced operator.
* an overly aggressive detection mechanism may lead to false positives that cause fail over when not necessary.
* a faulty check script may lead to a state where no master can be elected in the VRRP cluster.

After some consideration, and brainstorming, Assaf and I have reached a reasonable consensus on a path to partially resolving this issue. The proposal is two-pronged:

The first part is properly documenting L3 HA, and how to overcome its current/future limitations. For instance, 
it is my understanding that the system environment conditions where this issue can be reproduced are such that the hardware configuration does not provide any form of redundancy, for instance by means of NIC teaming (i.e. link aggregation). Now, under this premise, networking failures may indeed lead to Neutron control plane failures. As for L3, when HA configuration is desired, current Neutron support is such that even a temporary connection loss in the data plane may lead to an unrecoverable invalid state of the VRRP group associated to the HA router where multiple replicas are marked as master. There is currently no way to rectify the situation, if not manually. One possible corrective measure to this problem would be the implementation of a STONITH solution. Pursuing this approach in Neutron is obviously a non-starter because of the great deal of development and maintenance complexity that will result over time. Using something like Pacemaker/corosync is more appropriate. Alternatively, where the aforementioned corrective measure is not deemed feasible or undesirable, preventive measures can be put in place so that the occurrence of a broken VRRP state is made less likely, for instance by relying on hardware redundancy (e.g. NIC link aggregation). Documenting this would go at great length to set the right expectations when using L3 HA.

* it may introduce a potential security vulnerability: check scripts are given root access when executed. This is a risk to the inexperienced operator.
 * an overly aggressive detection mechanism may lead to false positives that cause fail over when not necessary.
 * a faulty check script may lead to a state where no master can be elected in the VRRP cluster.