ha router duplicated routes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Medium
|
Unassigned |
Bug Description
In our openstack stein installation (neutron 14.4.2) we upgraded keepalived from 1.3.9 to 2.2.4.
After that when restarting the neutron-l3-agent we saw that the router state of routers with external gatways were not able to update anymore. So we ended up with only standby routers even though keepalived is working fine and we can see one keepalived in master state.
After debugging a bit we found the following traceback:
```
Traceback (most recent call last):
File "/usr/lib/
timer()
File "/usr/lib/
cb(*args, **kw)
File "/usr/lib/
ri.
File "/usr/lib/
ns_name, preserve_ips)
File "/usr/lib/
clean_
File "/usr/lib/
preserve_ips)
File "/usr/lib/
device.
File "/usr/lib/
self.
File "/usr/lib/
self.
File "/usr/lib/
raise exceptions.
File "/usr/lib/
self.
File "/usr/lib/
six.
File "/usr/local/
raise value
File "/usr/lib/
return self._as_
File "/usr/lib/
use_
File "/usr/lib/
namespace=
File "/usr/lib/
log_
File "/usr/lib/
returncode=
neutron_
```
This traceback is triggered because the routers got duplicated routes, here an example router:
```
ip netns exec qrouter-
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
```
First I thought that we hit a keepalived bug which I filed here: https:/
What I got to understand from the communication with pqarmitage from the issue is that keepalived is setting `proto 18/keepalived` in newer versions and I think that this breaks with neutron.
So what I assume is happening here is the following.
On a "fresh" router or a failover the qg- interface of the router is down, therefore keepalived is not able to set the virtual routes. neutron then creates the gateway routes through the set_external_
Here is a router example of getting into this state:
backup router:
```
ip netns exec qrouter-
1: lo: <LOOPBACK,
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,
link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
valid_lft forever preferred_lft forever
inet6 fe80::f816:
valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,
link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
1757: qg-6c2ee5e0-ad: <BROADCAST,
link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff
ip netns exec qrouter-
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
```
After a failover to the backup node where I assume neutron is setting the gateway routes:
```
ip netns exec qrouter-
1: lo: <LOOPBACK,
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,
link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
valid_lft forever preferred_lft forever
inet 169.254.0.13/24 scope global ha-f64d319f-ed
valid_lft forever preferred_lft forever
inet6 fe80::f816:
valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,
link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.10/24 scope global qr-15d63a29-8e
valid_lft forever preferred_lft forever
inet6 fe80::f816:
valid_lft forever preferred_lft forever
1757: qg-6c2ee5e0-ad: <BROADCAST,
link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff
inet x.x.244.116/26 scope global qg-6c2ee5e0-ad
valid_lft forever preferred_lft forever
inet6 x.x:1003::22b/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::f816:
valid_lft forever preferred_lft forever
ip netns exec qrouter-
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
```
And then after a neutron-l3-agent restart which triggers a keepalived reload:
```
ip netns exec qrouter-
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
```
Changed in neutron: | |
status: | New → Incomplete |
Changed in neutron: | |
importance: | Low → Medium |
As a workaround I tested setting `proto 0` to all virtual_routes inside the keepalive.conf by adding `output += ' proto 0'` to the build_config function of KeepalivedVirtu alRoute. (https:/ /opendev. org/openstack/ neutron/ src/branch/ master/ neutron/ agent/linux/ keepalived. py#L140)
This works fine as a workaround and fixes the issue for me, but it does not feel like the right solution to do.