Comment 4 for bug 1520517

Revision history for this message
Tore Anderson (toreanderson) wrote :

Hello Assaf, I'm happy to help out with any further information you might need:

The issue does indeed impact all address assignment modes. This is as expected, as address assignment and default route assignment are two orthogonal mechanisms in IPv6, and it is the default route that is continuously removed and re-added. The instance's assigned address is stable and is not impacted.

It does break IPv6 entirely, to the point where the instance's external network connectivity is really completely unusable for any sort of production purpose. The only thing that keeps working reliably is internal traffic between instances on the same subnet. The way I see it, the issue is without question critical - an OpenStack infrastructure where the instances do not have working network connectivity is only marginally more useful than having no OpenStack infrastructure at all.

My one-line patch to Keepalived got quickly applied upstream, see https://github.com/acassen/keepalived/commit/be69d87151325b7d906ade988b519ca35fbb25cf. However, as you rightly point out, this does not help in the short or probably even medium term as the fix is not included in any stable release of Keepalived, much less any binary distro packages.

One could accurately argue that the bug here is really in Keepalived, and not in Neutron. However, before commit 5d38dc5 it had virtually no impact, as the problematic (default-route-removing) unsolicited Neighbour Advertisements got sent only for a few seconds after a Neutron router failover event. This prolonged the network downtime/convergence time following such a failure, but only with a few seconds. That was probably not even noticeable amidst all the other transient connectivity issues an instance would likely experience during a router failover. After commit 5d38dc5 however, the network became perpetually broken. Therefore I do think that this urgently needs to be fixed/reverted in Neutron.

A final comment: The solution for bug #1453855 in commit 5d38dc5 is far from ideal. Even if didn't cause the issues described in this report, I'd still suggest you reconsider that approach. The reason for that is that it causes a constant and perpetual 0.5pps stream of multicast IPv6 traffic per subnet (not including the IPv4 gratuitous ARP packets that also occur). I believe most operators are interested in limiting the amount of so-called "BUM" traffic (Broadcast, Unknown Unicast, and Multicast) in their networks because this traffic is much more expensive/difficult to deliver than regular unicast traffic. Commit 5d38dc5 goes the exact opposite way - endlessly spamming the network with tons of BUM traffic in order to solve a transient and short-lived service ordering/dependency issue occurring (as I understand it) only when a network node running the Neutron L3 agent boots up. It's an extremely blunt approach.

Anyway, please do not hesitate to ask again if you need any further information from me - I'm very interested in helping out any way I can in getting this issue promptly resolved.