ha router duplicated routes

Bug #1956846 reported by Maximilian Stinsky
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned

Bug Description

In our openstack stein installation (neutron 14.4.2) we upgraded keepalived from 1.3.9 to 2.2.4.
After that when restarting the neutron-l3-agent we saw that the router state of routers with external gatways were not able to update anymore. So we ended up with only standby routers even though keepalived is working fine and we can see one keepalived in master state.

After debugging a bit we found the following traceback:
```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 461, in fire_timers
    timer()
  File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/usr/lib/python3/dist-packages/neutron/agent/l3/ha.py", line 166, in _enqueue_state_change
    ri.set_external_gw_port_link_status(link_up=True, set_gw=True)
  File "/usr/lib/python3/dist-packages/neutron/agent/l3/ha_router.py", line 547, in set_external_gw_port_link_status
    ns_name, preserve_ips)
  File "/usr/lib/python3/dist-packages/neutron/agent/l3/router_info.py", line 750, in _external_gateway_settings
    clean_connections=True)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 179, in init_router_port
    preserve_ips)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/interface.py", line 203, in set_onlink_routes
    device.route.add_onlink_route(route)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 662, in add_onlink_route
    self.add_route(cidr, scope='link')
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 712, in add_route
    self._run_as_root_detect_device_not_found([ip_version], args)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 615, in _run_as_root_detect_device_not_found
    raise exceptions.DeviceNotFoundError(device_name=self.name)
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 610, in _run_as_root_detect_device_not_found
    return self._as_root(options, tuple(args))
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 407, in _as_root
    use_root_namespace=use_root_namespace)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 121, in _as_root
    namespace=namespace)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py", line 129, in _execute
    log_fail_as_error=self.log_fail_as_error)
  File "/usr/lib/python3/dist-packages/neutron/agent/linux/utils.py", line 147, in execute
    returncode=returncode)
neutron_lib.exceptions.ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: File exists
```

This traceback is triggered because the routers got duplicated routes, here an example router:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
```

First I thought that we hit a keepalived bug which I filed here: https://github.com/acassen/keepalived/issues/2076

What I got to understand from the communication with pqarmitage from the issue is that keepalived is setting `proto 18/keepalived` in newer versions and I think that this breaks with neutron.

So what I assume is happening here is the following.
On a "fresh" router or a failover the qg- interface of the router is down, therefore keepalived is not able to set the virtual routes. neutron then creates the gateway routes through the set_external_gw_port_link_status function (https://opendev.org/openstack/neutron/src/tag/14.4.2/neutron/agent/l3/ha_router.py#L528) after it brings up the qg- interface. When I now restart the neutron-l3-agent it reloads keepalived which triggers keepalived to recreate the virtual_routes which it was not able to create when the qg- interface was down and because of the new functionality it creates the same route but with the addition of `proto 18` and we end up with duplicated routes. After that neutron fails on the `ip route replace` command with the RTNETLINK answers: File exists error.

Here is a router example of getting into this state:
backup router:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
    inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe56:3f59/64 scope link
       valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
```

After a failover to the backup node where I assume neutron is setting the gateway routes:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
1755: ha-f64d319f-ed: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:56:3f:59 brd ff:ff:ff:ff:ff:ff
    inet 169.254.197.12/18 brd 169.254.255.255 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet 169.254.0.13/24 scope global ha-f64d319f-ed
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe56:3f59/64 scope link
       valid_lft forever preferred_lft forever
1756: qr-15d63a29-8e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a9:cc:49 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.10/24 scope global qr-15d63a29-8e
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fea9:cc49/64 scope link nodad
       valid_lft forever preferred_lft forever
1757: qg-6c2ee5e0-ad: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:d2:6c:c7 brd ff:ff:ff:ff:ff:ff
    inet x.x.244.116/26 scope global qg-6c2ee5e0-ad
       valid_lft forever preferred_lft forever
    inet6 x.x:1003::22b/64 scope global nodad
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fed2:6cc7/64 scope link nodad
       valid_lft forever preferred_lft forever

ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link
```

And then after a neutron-l3-agent restart which triggers a keepalived reload:
```
ip netns exec qrouter-76f69b0d-c9ac-4a98-851a-f74b23b2de49 ip r
default via x.x.244.67 dev qg-6c2ee5e0-ad proto 18
default via x.x.244.67 dev qg-6c2ee5e0-ad
10.0.0.0/24 dev qr-15d63a29-8e proto kernel scope link src 10.0.0.10
169.254.0.0/24 dev ha-f64d319f-ed proto kernel scope link src 169.254.0.13
169.254.192.0/18 dev ha-f64d319f-ed proto kernel scope link src 169.254.197.12
x.x.244.64/26 dev qg-6c2ee5e0-ad proto kernel scope link src x.x.244.116
x.x.244.128/25 dev qg-6c2ee5e0-ad proto 18 scope link
x.x.244.128/25 dev qg-6c2ee5e0-ad scope link

```

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

As a workaround I tested setting `proto 0` to all virtual_routes inside the keepalive.conf by adding `output += ' proto 0'` to the build_config function of KeepalivedVirtualRoute. (https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/linux/keepalived.py#L140)

This works fine as a workaround and fixes the issue for me, but it does not feel like the right solution to do.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Maximiliam:

Since [1] (available in Train release), we define the default protocol "static" for all routes. When adding (or replacing) a route, the protocol will filter in case of having two similar routes as is the case you are reporting.

E.g.:
default via 192.168.20.10 dev qg-e07e01ef-c5 proto 112
default via 192.168.20.10 dev qg-e07e01ef-c5 proto static
10.10.0.0/26 dev qr-76e31f4e-72 proto kernel scope link src 10.10.0.1
169.254.0.0/24 dev ha-12c1f503-78 proto kernel scope link src 169.254.0.97
169.254.192.0/18 dev ha-12c1f503-78 proto kernel scope link src 169.254.194.195
192.168.20.0/24 dev qg-e07e01ef-c5 proto kernel scope link src 192.168.20.235

Although this should be considered when using HA and "keepalived", the shared code between DVR, HA and legacy routers is now the same [2]. In the "set_onlink_routes" we should consider the case of having a HA router to avoid defining those routes that will be set by "keepalived". However this is not interfering with the new code [1].

What I don't understand is the protocol number. "keepalived" will set protocol 112 (VRRP), not 18. Can you confirm that?

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/661981
[2]https://review.opendev.org/c/openstack/neutron/+/622449/5/neutron/agent/linux/interface.py

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Hi Rodolfo,

thanks for the answer, so my problem will more or less be fixed with an upgrade to train. Good to know then I will live with the workaround until we are able to upgrade.

keepalived got a reserved protocol number with linux 5.8 which is protocol number 18.
This will be used from keepalived >= 2.1.4 (https://github.com/acassen/keepalived/commit/f34f8777d121513837dc784bceddb4859f0d9780)

Changed in neutron:
status: New → Incomplete
Revision history for this message
Patrick Quentin Armitage (pqa) wrote : Re: [Bug 1956846] Re: ha router duplicated routes

Rodolfo,

When searching to try and understand what was causing this, I came
across https://wiki.openstack.org/wiki/Neutron/L3_High_Availability_VRRP, and in
particular the template configuration.

There are a couple of issues with the template:
1. Do not include the track_interface block. keepalived logs a configuration
error, since the interface is already tracked due to it being the interface of
the vrrp instance.
2. 'state SLAVE' is wrong, it should be 'state BACKUP'. However, it is better
not to specify 'state ...' at all, and leave keepalived to sort it ouy.

Since I am not an OpenStack user I didn't know where to report it. Is it
something you can resolve, or report in the correct place?

The reason for the change from proto 112 to proto 18 is, as Maximilian said,
proto 18 was allocated for keepalived in Linux 5.8. Prior to that I had to make
up a number to use, and I used 112 since that is the VRRP protocol number. We
use the proto field so that, if keepalived crashes and restarts, it can
determine which routes it previously added, and then it removes them when
tidying up the residual configuration.

On Mon, 2022-01-10 at 11:44 +0000, Rodolfo Alonso wrote:
> Hello Maximiliam:
>
> Since [1] (available in Train release), we define the default protocol
> "static" for all routes. When adding (or replacing) a route, the
> protocol will filter in case of having two similar routes as is the case
> you are reporting.
>
> E.g.:
> default via 192.168.20.10 dev qg-e07e01ef-c5 proto 112
> default via 192.168.20.10 dev qg-e07e01ef-c5 proto static
> 10.10.0.0/26 dev qr-76e31f4e-72 proto kernel scope link src 10.10.0.1
> 169.254.0.0/24 dev ha-12c1f503-78 proto kernel scope link src 169.254.0.97
> 169.254.192.0/18 dev ha-12c1f503-78 proto kernel scope link src
> 169.254.194.195
> 192.168.20.0/24 dev qg-e07e01ef-c5 proto kernel scope link src 192.168.20.235
>
>
> Although this should be considered when using HA and "keepalived", the shared
> code between DVR, HA and legacy routers is now the same [2]. In the
> "set_onlink_routes" we should consider the case of having a HA router to avoid
> defining those routes that will be set by "keepalived". However this is not
> interfering with the new code [1].
>
> What I don't understand is the protocol number. "keepalived" will set
> protocol 112 (VRRP), not 18. Can you confirm that?
>
> Regards.
>
> [1]https://review.opendev.org/c/openstack/neutron/+/661981
> [2]https://review.opendev.org/c/openstack/neutron/+/622449/5/neutron/agent/lin
> ux/interface.py
>

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

The workaround we tested as mentioned in comment #1 is sadly not working as the `protocol 0` in the keepalived.conf is still a protocol so we end up with duplicated routes anyway.
e.g.:
default via x.x.244.67 dev qg-6c2ee5e0-ad proto unspec
default via x.x.244.67 dev qg-6c2ee5e0-ad

The question now for us is, is there an easy "workaround" way to get neutron stein working with keepalived >= 2.0.1.
@Rodolfo we were thinking about removing the virtual_route part from the keepalived config until we can upgrade to train. Our assumption is that neutron is already managing all the routes when failing over. Is there any particular reason the virtual_routes are needed inside keepalived?

Revision history for this message
Patrick Quentin Armitage (pqa) wrote :

Maximilian,

I have looked at this further, and I think if you specify 'proto boot' or 'proto
3' it should do what you want.

If you execute 'ip -d route show' you should see that the default proto is boot.
See keepalived commit edd8326 to see what the old code did.

On Mon, 2022-01-10 at 18:46 +0000, Maximilian Stinsky wrote:
> The workaround we tested as mentioned in comment #1 is sadly not working as
> the `protocol 0` in the keepalived.conf is still a protocol so we end up with
> duplicated routes anyway.
> e.g.:
> default via x.x.244.67 dev qg-6c2ee5e0-ad proto unspec
> default via x.x.244.67 dev qg-6c2ee5e0-ad
>
>
> The question now for us is, is there an easy "workaround" way to get neutron
> stein working with keepalived >= 2.0.1.
> @Rodolfo we were thinking about removing the virtual_route part from the
> keepalived config until we can upgrade to train. Our assumption is that
> neutron is already managing all the routes when failing over. Is there any
> particular reason the virtual_routes are needed inside keepalived?
>

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

That seems to work like a charm. Ill test it a bit more in our lab environment, but that seems to be the easiest solution to set `proto boot` into the keepalived template for the virtual_routes.

Thanks again @Patrick Quentin Armitage !

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

Please, upgrade to Train. As commented, since this version any route without a defined protocol will be created/updated with protocol "static". That will skip the issue you are facing without any workaround or specific modification.

@Patrick, L3 agent does include the "track_interface" section in the keepalived config [1]. Valid states for "state" parameter are ['MASTER', 'BACKUP'] [2]. I don't know were did you find "SLAVE" value but if present, we should change it.

I'll close this bug. Please feel free to reopen with new information if needed.

Regards.

[1]https://github.com/openstack/neutron/blob/9a90af915dade2c7bbb4fbbc98902c954f6a00c3/neutron/agent/linux/keepalived.py#L252-L256
[2]https://github.com/openstack/neutron/blob/9a90af915dade2c7bbb4fbbc98902c954f6a00c3/neutron/agent/linux/keepalived.py#L33

Changed in neutron:
status: Incomplete → Invalid
Revision history for this message
Keepalived (keepalived-project) wrote :

Rudolfo,

The keepalived configuration that I was referring to is in the Appendix of https://wiki.openstack.org/wiki/Neutron/L3_High_Availability_VRRP

It has:
interface ${L3_AGENT.get_ha_device_name(TRACK_PORT_ID)}
    virtual_router_id ${VR_ID}
    priority ${PRIORITY}
    track_interface {
        ${L3_AGENT.get_ha_device_name(TRACK_PORT_ID)}
    }

What is wrong here is specifying the same interface in both 'interface' and in the 'track_interface' block. By virtue of the 'interface' specification, the configured interface will be tracked, and so keepalived issues a warning when the same interface is specified in the 'track_interface' block.

In the same appendix, it has
    % if TYPE == 'MASTER':
    state MASTER
    % else:
    state SLAVE
    % endif

It is in fact completely unnecessary to specify the initial state; unless the priority of the VRRP instance is 255 the VRRP instance will always start in backup state since it has to ensure that there is no other instance in master state with the same or a higher priority.

Revision history for this message
Keepalived (keepalived-project) wrote :

Rudolfo, Maximilian,

It isn't clear to me that the change in Neutron to adding routes with protocol "static" rather than the old (default) protocol "boot" will make any difference for Maximilian; he will still end up with duplicate routes which seems to be the basic problem.

Rudolfo, you state 'upgrade to Train. As commented, since this version any route without a defined protocol will be created/updated with protocol "static".' Does this imply that if Maximilian specifies the routes to be "proto keepalived" or "proto 18" that Neutron will create the routes with that protocol? A quick look at the code you have provided links to doesn't appear to support a protocol being configured.

I presume that when Maximilian upgrades to train, if he wants to use the same workaround as he is using now by specifing "proto boot", he will need to change that to "proto static".

Rudolfo, as you mention above, I think the real cause of the problem is that Neutron is adding the routes which keepalived is managing. I cannot see any reason for Neutron doing that, and it means that the virtual_routes functionality in keepalived cannot be used properly, since the routes will exist even when the VRRP instance is not in master state, since Neutron has created them. The purpose of the virtual_routes, virtual_rules and virtual_ipaddresses is that they are only configured when the VRRP instance is in master state.

It isn't clear to me in the code where Neutron is creating the routes, but is it also the virtual IP addresses and virtual ip rules that keepalived manages?

Revision history for this message
Keepalived (keepalived-project) wrote :

Rudolfo,

When I was looking at the keepalived code in Neutron, I noticed a reference to keepalived bugs [1]. If there are any bugs in keepalived that are detrimental to Neutron, it would be helpful if they could be reported at https://github.com/acassen/keepalived/issues so that we can endeavour to fix them.

What is described at [1] is very strange, and the workaround seems quite problematic to me. I am not aware what keepalived bugs are being referred to, and I am not aware of a problem if the "primary VIP" is changed on a reload, but if there is a problem we will fix it. It seems quite strange to me, and problematic, that Neutron should make up and configure an additional address, which could very easily be used by another device on the network, and then allow keepalived to configure that address on the system, potentially thereby causing a duplicate address on the network.

Can we work together to try and sort out what is happening here, since it doesn't feel right to me. I will happily work to resolve any bugs identified in keepalived.

[1]https://github.com/openstack/neutron/blob/master/neutron/agent/linux/keepalived.py#L273-L278

Changed in neutron:
status: Invalid → Incomplete
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

This documentation [1] is outdated. We should document how keepalived is configured now.

This is an example of configuration [2]. We don't create an "interface" but an "vrrp_instance" section, named with random ID (in this case, VR_252).

According to the documentation [3], the "state" config knob is the initial state of this instance of the virtual router. Of course, as you comment, keepalived will vote to actually decide which one is the active one.

As commented before, the method initializing the router port [4] is the same for any router, independently of the type (legacy, DVR, HA, DVR-HA). Since Train, the routes written in the router namespace have the protocol defined ("static" by default). That works as the workaround Maximilian commented.

If you want to push a patch to handle this scenario, you are more than welcome. Any help is always accepted in OpenStack.

Regards.

[1]https://wiki.openstack.org/wiki/Neutron/L3_High_Availability_VRRP
[2]https://paste.opendev.org/show/812099/
[3]https://manpages.debian.org/unstable/keepalived/keepalived.conf.5.en.html
[4]https://github.com/openstack/neutron/blob/5730eae0e96ea68a700bb7afd280cdd4284532ba/neutron/agent/linux/interface.py#L154-L177

Changed in neutron:
status: Incomplete → New
importance: Undecided → Low
Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

Hi Rodolfo,

we upgraded our test environment to ussuri and can verify that we now have duplicated routes with proto 18 and proto static.
ip netns exec qrouter-ae1b03de-1e17-42f4-9314-be99e29654e2 ip r
default via x.x.x.67 dev qg-9ca358b2-a2 proto 18
default via x.x.x.67 dev qg-9ca358b2-a2 proto static
10.2.0.0/24 dev qr-4ad7e7be-65 proto kernel scope link src 10.2.0.1
169.254.0.0/24 dev ha-783c9ba6-72 proto kernel scope link src 169.254.0.40
169.254.192.0/18 dev ha-783c9ba6-72 proto kernel scope link src 169.254.193.156
x.x.x.64/26 dev qg-9ca358b2-a2 proto kernel scope link src x.x.x.89
x.x.x.128/25 dev qg-9ca358b2-a2 proto 18 scope link
x.x.x.128/25 dev qg-9ca358b2-a2 proto static scope link

With that the initial problem that we described here is still present. If we now restart the l3 agent, the l3 agent fails to update its state from standby to master because the ip route replace[1] that is getting executed on state change fails with a RTNETLINK answers: File exists error.

Neutron writes the following stacktrace for the error:
Traceback (most recent call last):
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 461, in fire_timers
    timer()
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/hubs/timer.py", line 59, in __call__
    cb(*args, **kw)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/eventlet/semaphore.py", line 147, in _do_acquire
    waiter.switch()
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/l3/ha.py", line 168, in _enqueue_state_change
    ri.set_external_gw_port_link_status(link_up=True, set_gw=True)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/l3/ha_router.py", line 559, in set_external_gw_port_link_status
    ns_name, preserve_ips)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/l3/router_info.py", line 799, in _external_gateway_settings
    device.route.add_gateway(ip)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py", line 607, in add_gateway
    scope=scope)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py", line 645, in add_route
    table=table, metric=metric, scope=scope, **kwargs)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py", line 1508, in add_ip_route
    metric=metric, scope=scope, **kwargs)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 247, in _wrap
    return self.channel.remote_call(name, args, kwargs)
  File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 224, in remote_call
    raise exc_type(*result[2])
pyroute2.netlink.exceptions.NetlinkError: (17, 'File exists')

We will now test our old workaround again and patch the keepalived config to set proto static to all of its routes.
But the ha routers implementation with duplicated routes seems to have a general problem that we would need to think about.

[1] https://github.com/openstack/neutron/blob/16.4.2/neutron/privileged/agent/linux/ip_lib.py#L732

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/865525

Changed in neutron:
status: New → In Progress
Changed in neutron:
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/865525
Committed: https://opendev.org/openstack/neutron/commit/c813b658d0e5e0c00093f90f849fcf67ddca16cf
Submitter: "Zuul (22348)"
Branch: master

commit c813b658d0e5e0c00093f90f849fcf67ddca16cf
Author: Maximilian Stinsky <email address hidden>
Date: Thu Nov 24 11:40:04 2022 +0100

    Fix duplicated routes exceptions

    Since the train release neutron adds routes with protocol static.
    Keepalived also adds the same routes with different protocols depending
    on the keepalived version. This can result in duplicated routes inside
    network namespaces. On l3 agent restarts those duplicate routes
    then prevent the l3 agent from updating its router state
    because it runs into 'RTNETLINK answers: File exists expections'
    when it tries to execute 'ip route' commands.

    This patch adds the same protocol static to each virtual route of
    keepalived's configuration so network namespaces do not run into
    duplicated routes anymore.

    Closes-Bug: #1956846
    Change-Id: Ic35b5d4b9110b832c10345c45ec62c0923237cfd

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/865893

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/865894

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/865895

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/865896

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/865893
Committed: https://opendev.org/openstack/neutron/commit/9e02a10bed4a3b5a41468fa8ff14b613c798a06a
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 9e02a10bed4a3b5a41468fa8ff14b613c798a06a
Author: Maximilian Stinsky <email address hidden>
Date: Thu Nov 24 11:40:04 2022 +0100

    Fix duplicated routes exceptions

    Since the train release neutron adds routes with protocol static.
    Keepalived also adds the same routes with different protocols depending
    on the keepalived version. This can result in duplicated routes inside
    network namespaces. On l3 agent restarts those duplicate routes
    then prevent the l3 agent from updating its router state
    because it runs into 'RTNETLINK answers: File exists expections'
    when it tries to execute 'ip route' commands.

    This patch adds the same protocol static to each virtual route of
    keepalived's configuration so network namespaces do not run into
    duplicated routes anymore.

    Closes-Bug: #1956846
    Change-Id: Ic35b5d4b9110b832c10345c45ec62c0923237cfd
    (cherry picked from commit c813b658d0e5e0c00093f90f849fcf67ddca16cf)

tags: added: in-stable-zed
tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/865894
Committed: https://opendev.org/openstack/neutron/commit/0e427ecc495404e3c2dcb1e32479b89acb8a6b76
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 0e427ecc495404e3c2dcb1e32479b89acb8a6b76
Author: Maximilian Stinsky <email address hidden>
Date: Thu Nov 24 11:40:04 2022 +0100

    Fix duplicated routes exceptions

    Since the train release neutron adds routes with protocol static.
    Keepalived also adds the same routes with different protocols depending
    on the keepalived version. This can result in duplicated routes inside
    network namespaces. On l3 agent restarts those duplicate routes
    then prevent the l3 agent from updating its router state
    because it runs into 'RTNETLINK answers: File exists expections'
    when it tries to execute 'ip route' commands.

    This patch adds the same protocol static to each virtual route of
    keepalived's configuration so network namespaces do not run into
    duplicated routes anymore.

    Closes-Bug: #1956846
    Change-Id: Ic35b5d4b9110b832c10345c45ec62c0923237cfd
    (cherry picked from commit c813b658d0e5e0c00093f90f849fcf67ddca16cf)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/865895
Committed: https://opendev.org/openstack/neutron/commit/7559a2ea827b6f36ece8029cb3fceff0308fee03
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 7559a2ea827b6f36ece8029cb3fceff0308fee03
Author: Maximilian Stinsky <email address hidden>
Date: Thu Nov 24 11:40:04 2022 +0100

    Fix duplicated routes exceptions

    Since the train release neutron adds routes with protocol static.
    Keepalived also adds the same routes with different protocols depending
    on the keepalived version. This can result in duplicated routes inside
    network namespaces. On l3 agent restarts those duplicate routes
    then prevent the l3 agent from updating its router state
    because it runs into 'RTNETLINK answers: File exists expections'
    when it tries to execute 'ip route' commands.

    This patch adds the same protocol static to each virtual route of
    keepalived's configuration so network namespaces do not run into
    duplicated routes anymore.

    Closes-Bug: #1956846
    Change-Id: Ic35b5d4b9110b832c10345c45ec62c0923237cfd
    (cherry picked from commit c813b658d0e5e0c00093f90f849fcf67ddca16cf)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/865896
Committed: https://opendev.org/openstack/neutron/commit/5efb87d38ae64e4f4e24ca1e131dfe949ac3d718
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 5efb87d38ae64e4f4e24ca1e131dfe949ac3d718
Author: Maximilian Stinsky <email address hidden>
Date: Thu Nov 24 11:40:04 2022 +0100

    Fix duplicated routes exceptions

    Since the train release neutron adds routes with protocol static.
    Keepalived also adds the same routes with different protocols depending
    on the keepalived version. This can result in duplicated routes inside
    network namespaces. On l3 agent restarts those duplicate routes
    then prevent the l3 agent from updating its router state
    because it runs into 'RTNETLINK answers: File exists expections'
    when it tries to execute 'ip route' commands.

    This patch adds the same protocol static to each virtual route of
    keepalived's configuration so network namespaces do not run into
    duplicated routes anymore.

    Closes-Bug: #1956846
    Change-Id: Ic35b5d4b9110b832c10345c45ec62c0923237cfd
    (cherry picked from commit c813b658d0e5e0c00093f90f849fcf67ddca16cf)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.5.0

This issue was fixed in the openstack/neutron 19.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.0.0.0rc1

This issue was fixed in the openstack/neutron 22.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.3.0

This issue was fixed in the openstack/neutron 20.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.1.0

This issue was fixed in the openstack/neutron 21.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron wallaby-eom

This issue was fixed in the openstack/neutron wallaby-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.