Activity log for bug #2003359

Date Who What changed Old value New value Message
2023-01-19 16:38:56 Anton Kurbatov bug added bug
2023-01-19 16:51:01 Anton Kurbatov description We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state. Preconditions: 1) there is no router with a specific external network yet 2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state The reason for such behavior is not removed fip-ns on the agent while the floatingip_agent_gateway port was removed. Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with. Сreate a router and quickly delete it while the l3 agent processes the external GW adding: [root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process [root@devstack ~]# count_fip_requests() { journalctl -u devstack@q-l3.service | grep 'FloatingIP agent gateway port received' | wc -l; } [root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw [root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id waiting before deletion... waiting before deletion... [root@devstack ~]# As a result fip-ns is not deleted even though the floatingip_agent_gateway port was removed: [root@devstack ~]# ip netns fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1) [root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long <empty> [root@devstack ~]# Re-create the router together with external gw now: [root@devstack ~]# openstack router create r1 --ha --distributed --external-gateway public In the logs, one can see a traceback that the creation of this router failed initially, followed by a successful creation: ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last): ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in create_or_update_gateway_port ERROR neutron.agent.l3.dvr_fip_ns self._update_gateway_port( ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in _update_gateway_port ERROR neutron.agent.l3.dvr_fip_ns self.driver.set_onlink_routes( ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in set_onlink_routes ERROR neutron.agent.l3.dvr_fip_ns onlink = device.route.list_onlink_routes(constants.IP_VERSION_4) ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in list_onlink_routes ERROR neutron.agent.l3.dvr_fip_ns routes = self.list_routes(ip_version, scope='link') ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes ERROR neutron.agent.l3.dvr_fip_ns return list_ip_routes(self._parent.namespace, ip_version, scope=scope, ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes ERROR neutron.agent.l3.dvr_fip_ns routes = privileged.list_ip_routes(namespace, ip_version, device=device, ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f ERROR neutron.agent.l3.dvr_fip_ns return self(f, *args, **kw) ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__ ERROR neutron.agent.l3.dvr_fip_ns do = self.iter(retry_state=retry_state) ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter ERROR neutron.agent.l3.dvr_fip_ns return fut.result() ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result ERROR neutron.agent.l3.dvr_fip_ns return self.__get_result() ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result ERROR neutron.agent.l3.dvr_fip_ns raise self._exception ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__ ERROR neutron.agent.l3.dvr_fip_ns result = fn(*args, **kwargs) ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap ERROR neutron.agent.l3.dvr_fip_ns return self.channel.remote_call(name, args, kwargs, ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call ERROR neutron.agent.l3.dvr_fip_ns raise exc_type(*result[2]) ERROR neutron.agent.l3.dvr_fip_ns neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f. The result is the following state: [root@devstack ~]# ip netns fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2) qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1) [root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long +--------------------------------------+----------------------------------+--------+ | ID | Device Owner | Status | +--------------------------------------+----------------------------------+--------+ | 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway | DOWN | | b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | ACTIVE | +--------------------------------------+----------------------------------+--------+ [root@devstack ~]# [root@devstack ~]# cat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state backup [root@devstack ~]# stat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log ... Access: 2023-01-19 11:10:10.715245690 -0500 Modify: 2023-01-19 11:10:18.976208238 -0500 Change: 2023-01-19 11:10:18.976208238 -0500 Birth: 2023-01-19 11:10:10.715245690 -0500 [root@devstack ~]# stat /var/run/netns/fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f ... Access: 2023-01-19 11:10:19.533205713 -0500 Modify: 2023-01-19 11:10:19.533205713 -0500 Change: 2023-01-19 11:10:19.533205713 -0500 Birth: - [root@devstack ~]# By timestamp we can see that a keepalived monitoring started to work before the fip-ns was re-created after unsuccessful first attempt to create a router. So keepalived monitoring is still bound to the FIP-ns that was created on the previously stuck namespace. Adding an external gw and removing a router has a race condition and it's not always possible to get 100% reproduction. To achieve 100% reproduction, just add a small sleep with the following patch: [root@devstack neutron]# git diff diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py index 6e37c09511..d01eb0de9b 100644 --- a/neutron/agent/l3/dvr_local_router.py +++ b/neutron/agent/l3/dvr_local_router.py @@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase): self.agent.context, ex_gw_port['network_id']) LOG.debug("FloatingIP agent gateway port received from the " "plugin: %s", fip_agent_port) + import time + time.sleep(5) self.fip_ns.create_or_update_gateway_port(fip_agent_port) def update_routing_table(self, operation, route): [root@devstack neutron]# We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state. Preconditions: 1) there is no router with a specific external network yet 2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state The reason for such behavior is not removed fip-ns on the agent while the floatingip_agent_gateway port was removed. Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with. Сreate a router and quickly delete it while the l3 agent processes the external GW adding: [root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process [root@devstack ~]# count_fip_requests() { journalctl -u devstack@q-l3.service | grep 'FloatingIP agent gateway port received' | wc -l; } [root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw [root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id waiting before deletion... waiting before deletion... [root@devstack ~]# As a result fip-ns is not deleted even though the floatingip_agent_gateway port was removed: [root@devstack ~]# ip netns fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1) [root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long <empty> [root@devstack ~]# Re-create the router together with external gw now: [root@devstack ~]# openstack router create r1 --ha --distributed --external-gateway public In the logs, one can see a traceback that the creation of this router failed initially, followed by a successful creation: ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last): ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in create_or_update_gateway_port ERROR neutron.agent.l3.dvr_fip_ns self._update_gateway_port( ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in _update_gateway_port ERROR neutron.agent.l3.dvr_fip_ns self.driver.set_onlink_routes( ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in set_onlink_routes ERROR neutron.agent.l3.dvr_fip_ns onlink = device.route.list_onlink_routes(constants.IP_VERSION_4) ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in list_onlink_routes ERROR neutron.agent.l3.dvr_fip_ns routes = self.list_routes(ip_version, scope='link') ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes ERROR neutron.agent.l3.dvr_fip_ns return list_ip_routes(self._parent.namespace, ip_version, scope=scope, ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes ERROR neutron.agent.l3.dvr_fip_ns routes = privileged.list_ip_routes(namespace, ip_version, device=device, ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f ERROR neutron.agent.l3.dvr_fip_ns return self(f, *args, **kw) ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__ ERROR neutron.agent.l3.dvr_fip_ns do = self.iter(retry_state=retry_state) ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter ERROR neutron.agent.l3.dvr_fip_ns return fut.result() ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result ERROR neutron.agent.l3.dvr_fip_ns return self.__get_result() ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result ERROR neutron.agent.l3.dvr_fip_ns raise self._exception ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__ ERROR neutron.agent.l3.dvr_fip_ns result = fn(*args, **kwargs) ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap ERROR neutron.agent.l3.dvr_fip_ns return self.channel.remote_call(name, args, kwargs, ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call ERROR neutron.agent.l3.dvr_fip_ns raise exc_type(*result[2]) ERROR neutron.agent.l3.dvr_fip_ns neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f. The result is the following state: [root@devstack ~]# ip netns fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2) qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1) [root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long +--------------------------------------+----------------------------------+--------+ | ID | Device Owner | Status | +--------------------------------------+----------------------------------+--------+ | 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway | DOWN | | b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | ACTIVE | +--------------------------------------+----------------------------------+--------+ [root@devstack ~]# [root@devstack ~]# cat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state backup [root@devstack ~]# stat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log ... Access: 2023-01-19 11:10:10.715245690 -0500 Modify: 2023-01-19 11:10:18.976208238 -0500 Change: 2023-01-19 11:10:18.976208238 -0500 Birth: 2023-01-19 11:10:10.715245690 -0500 [root@devstack ~]# stat /var/run/netns/snat-1f384e52-533c-49ed-b809-71f6358a2e5b ... Access: 2023-01-19 11:10:18.499210400 -0500 Modify: 2023-01-19 11:10:18.499210400 -0500 Change: 2023-01-19 11:10:18.499210400 -0500 Birth: - [root@devstack ~]# By timestamp we can see that a keepalived monitoring started to work before the snat-ns was re-created after unsuccessful first attempt to create a router. So, it looks like the keepalived monitoring is still bound to the deleted snat-ns that was created on the previously unsuccessful attempt to create the router. Adding an external gw and removing a router has a race condition and it's not always possible to get 100% reproduction. To achieve 100% reproduction, just add a small sleep with the following patch: [root@devstack neutron]# git diff diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py index 6e37c09511..d01eb0de9b 100644 --- a/neutron/agent/l3/dvr_local_router.py +++ b/neutron/agent/l3/dvr_local_router.py @@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase): self.agent.context, ex_gw_port['network_id']) LOG.debug("FloatingIP agent gateway port received from the " "plugin: %s", fip_agent_port) + import time + time.sleep(5) self.fip_ns.create_or_update_gateway_port(fip_agent_port) def update_routing_table(self, operation, route): [root@devstack neutron]#
2023-01-23 15:39:08 Jakub Libosvar tags l3-ha
2023-01-23 15:39:21 Jakub Libosvar neutron: importance Undecided Medium
2023-01-24 14:08:58 Brian Haley bug added subscriber Brian Haley