DVR HA router gets stuck in backup state

Bug #2003359 reported by Anton Kurbatov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Unassigned

Bug Description

We found the issue when a created HA DVR router gets stuck in the backup state and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next creation of the router can get stuck in the backup state

The reason for such behavior is not removed fip-ns on the agent while the floatingip_agent_gateway port was removed.
Further is a demo with the help of which I managed to reproduce this behavior on a single node devstack setup with.

Сreate a router and quickly delete it while the l3 agent processes the external GW adding:

[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value); sleep 30 # give time to process
[root@devstack ~]# count_fip_requests() { journalctl -u <email address hidden> | grep 'FloatingIP agent gateway port received' | wc -l; }
[root@devstack ~]# # add an external gateway and then delete the router while the agent processes gw
[root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id
waiting before deletion...
waiting before deletion...
[root@devstack ~]#

As a result fip-ns is not deleted even though the floatingip_agent_gateway port was removed:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
<empty>
[root@devstack ~]#

Re-create the router together with external gw now:

[root@devstack ~]# openstack router create r1 --ha --distributed --external-gateway public

In the logs, one can see a traceback that the creation of this router failed initially, followed by a successful creation:

ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in create_or_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self._update_gateway_port(
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in _update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self.driver.set_onlink_routes(
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in set_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns onlink = device.route.list_onlink_routes(constants.IP_VERSION_4)
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in list_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = self.list_routes(ip_version, scope='link')
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
ERROR neutron.agent.l3.dvr_fip_ns return list_ip_routes(self._parent.namespace, ip_version, scope=scope,
ERROR neutron.agent.l3.dvr_fip_ns File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = privileged.list_ip_routes(namespace, ip_version, device=device,
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in wrapped_f
ERROR neutron.agent.l3.dvr_fip_ns return self(f, *args, **kw)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in __call__
ERROR neutron.agent.l3.dvr_fip_ns do = self.iter(retry_state=retry_state)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
ERROR neutron.agent.l3.dvr_fip_ns return fut.result()
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
ERROR neutron.agent.l3.dvr_fip_ns return self.__get_result()
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
ERROR neutron.agent.l3.dvr_fip_ns raise self._exception
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in __call__
ERROR neutron.agent.l3.dvr_fip_ns result = fn(*args, **kwargs)
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
ERROR neutron.agent.l3.dvr_fip_ns return self.channel.remote_call(name, args, kwargs,
ERROR neutron.agent.l3.dvr_fip_ns File "/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
ERROR neutron.agent.l3.dvr_fip_ns raise exc_type(*result[2])
ERROR neutron.agent.l3.dvr_fip_ns neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface fg-b489f216-35not found in namespace fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f.

The result is the following state:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 2)
qrouter-1f384e52-533c-49ed-b809-71f6358a2e5b
snat-1f384e52-533c-49ed-b809-71f6358a2e5b (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner -c status --long
+--------------------------------------+----------------------------------+--------+
| ID | Device Owner | Status |
+--------------------------------------+----------------------------------+--------+
| 17679644-d775-4182-b5b3-f2035e6483d9 | network:router_gateway | DOWN |
| b489f216-356a-456a-82ab-849e43a3226d | network:floatingip_agent_gateway | ACTIVE |
+--------------------------------------+----------------------------------+--------+
[root@devstack ~]#
[root@devstack ~]# cat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/state
backup
[root@devstack ~]# stat /opt/stack/data/neutron/ha_confs/1f384e52-533c-49ed-b809-71f6358a2e5b/neutron-keepalived-state-change.log
...
Access: 2023-01-19 11:10:10.715245690 -0500
Modify: 2023-01-19 11:10:18.976208238 -0500
Change: 2023-01-19 11:10:18.976208238 -0500
 Birth: 2023-01-19 11:10:10.715245690 -0500
[root@devstack ~]# stat /var/run/netns/snat-1f384e52-533c-49ed-b809-71f6358a2e5b
...
Access: 2023-01-19 11:10:18.499210400 -0500
Modify: 2023-01-19 11:10:18.499210400 -0500
Change: 2023-01-19 11:10:18.499210400 -0500
 Birth: -
[root@devstack ~]#

By timestamp we can see that a keepalived monitoring started to work before the snat-ns was re-created after unsuccessful first attempt to create a router.
So, it looks like the keepalived monitoring is still bound to the deleted snat-ns that was created on the previously unsuccessful attempt to create the router.

Adding an external gw and removing a router has a race condition and it's not always possible to get 100% reproduction. To achieve 100% reproduction, just add a small sleep with the following patch:

[root@devstack neutron]# git diff
diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py
index 6e37c09511..d01eb0de9b 100644
--- a/neutron/agent/l3/dvr_local_router.py
+++ b/neutron/agent/l3/dvr_local_router.py
@@ -837,6 +837,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase):
                 self.agent.context, ex_gw_port['network_id'])
             LOG.debug("FloatingIP agent gateway port received from the "
                       "plugin: %s", fip_agent_port)
+ import time
+ time.sleep(5)
         self.fip_ns.create_or_update_gateway_port(fip_agent_port)

     def update_routing_table(self, operation, route):
[root@devstack neutron]#

Tags: l3-ha
description: updated
tags: added: l3-ha
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Lajos Katona (lajos-katona) wrote :

Hi, thanks for the detailed reproduction and explanation.
I tried to reproduce the issue with fresh devstack (I tried it several times) but for me everything was ok, could you give some more details?

Revision history for this message
Anton Kurbatov (akurbatov) wrote :
Download full text (4.2 KiB)

Hello Lajos,
I just double-checked on a fresh devstack installation (single node stand).
I'm running devstack on CentOS-Stream-GenericCloud-9, but looks like it's not so important.
My devstack local.conf:

[root@devstack0 ~]# cat ~cloud-user/devstack/local.conf
[[local|localrc]]
ADMIN_PASSWORD=<hidden>
DATABASE_PASSWORD=${ADMIN_PASSWORD}
RABBIT_PASSWORD=${ADMIN_PASSWORD}
SERVICE_PASSWORD=${ADMIN_PASSWORD}

ENABLED_SERVICES=key
ENABLED_SERVICES+=,n-api,n-cond,n-sch,n-api-meta,n-cpu
ENABLED_SERVICES+=,placement-api,placement-client
ENABLED_SERVICES+=,q-svc,q-agt,q-l3,q-meta,q-dhcp
ENABLED_SERVICES+=,c-sch,c-api,c-vol
ENABLED_SERVICES+=,g-api
ENABLED_SERVICES+=,rabbit,mysql

Q_DVR_MODE=dvr_snat
Q_ML2_TENANT_NETWORK_TYPE=vxlan
Q_ML2_PLUGIN_MECHANISM_DRIVERS=openvswitch
Q_AGENT=openvswitch
PUBLIC_INTERFACE=eth1
NEUTRON_CREATE_INITIAL_NETWORKS=False

LOGFILE=${DEST}/logs/stack.sh.log
[root@devstack0 ~]#

Then, to avoid guessing at what point to remove the router, I made a delay at this point:

[root@devstack0 ~]# cd /opt/stack/neutron; git log --oneline | head -1; git diff; cd -
e6eeb700c8 Merge "Honor debug mode in keepalived-state-change script logs"
diff --git a/neutron/agent/l3/dvr_local_router.py b/neutron/agent/l3/dvr_local_router.py
index e12cc1a0ab..822d9d6a86 100644
--- a/neutron/agent/l3/dvr_local_router.py
+++ b/neutron/agent/l3/dvr_local_router.py
@@ -836,6 +836,8 @@ class DvrLocalRouter(dvr_router_base.DvrRouterBase):
                 self.agent.context, ex_gw_port['network_id'])
             LOG.debug("FloatingIP agent gateway port received from the "
                       "plugin: %s", fip_agent_port)
+ import time
+ time.sleep(5)
         self.fip_ns.create_or_update_gateway_port(fip_agent_port)

     def update_routing_table(self, operation, route):
/root
[root@devstack0 ~]# systemctl restart <email address hidden>
[root@devstack0 ~]#

and repeated the commands from the description:

[root@devstack0 ~]# source ~cloud-user/devstack/openrc admin
[root@devstack0 ~]# openstack network create public --provider-network-type flat --provider-physical-network public --external
[root@devstack0 ~]# openstack subnet create public --network public --subnet-range 10.136.16.0/20 --no-dhcp
[root@devstack0 ~]# r_id=$(openstack router create r1 --distributed --ha -c id -f value);
[root@devstack0 ~]# count_fip_requests() { journalctl -u <email address hidden> | grep 'FloatingIP agent gateway port received' | wc -l; }
[root@devstack0 ~]# fip_requests=$(count_fip_requests); openstack router set $r_id --external-gateway public; while :; do [[ $fip_requests == $(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || break; done; openstack router delete $r_id
waiting before deletion...
waiting before deletion...
[root@devstack0 ~]#
[root@devstack0 ~]# ip netns
fip-aea344ef-1a3b-4256-8b74-c04cc9603b9d (id: 1)
[root@devstack0 ~]# openstack port list --network public

[root@devstack0 ~]# openstack router create r1 --ha --distributed --external-gateway public
[root@devstack0 ~]# neutron l3-agent-list-hosting-router r1
neutron CLI is deprecated and will be removed in the Z cycle. Use openstack CLI instead.
The python ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.