neutron

DVR HA Router Update Error

Bug #1957189 reported by Yusuf Güngör on 2022-01-12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	New	Medium	Unassigned

Bug Description

Hi, we are getting the error below when removing a tenant network port from a router.

- pyroute2.netlink.exceptions.NetlinkError: (3, 'No such process')

This situation happens with the following scenario:

  1. Create a subnet from a "subnet pool" with custom CIDR prefix.
  2. Add interface from this subnet to router. (After adding both router_interface_distributed and router_centralized_snat ports created on router)
  3. If network is not used anymore then we are trying to delete it. (At first deleting the instances)
  4. Remove the added interface for this subnet from the router.
  5. Remove the subnet.
  6. Now try to create a subnet from the same "subnet pool" with same CIRD. We are getting the same CIDR subnet which was deleted before.
  7. Add interface from this subnet to router. (After adding only router_interface_distributed port created on router)
  8. Create an instance from the network which uses that tenant subnet only. Instance DNS queries are not working! (This is the step which we recognized that something is wrong)
  9. Delete the created instances. (Instance create-delete step is optional)
  10. Now try removing the added interface for this subnet from the router. Now we are getting "pyroute2.netlink.exceptions.NetlinkError" error when "router updates" on L3 Agent until hitting the retry limit for router update.

  Until restarting the Neutron L3 Agent on all controller nodes:
    - We always getting that error when adding the interface from that subnet to router. If we are getting this error then port shown as DOWN on router page and router_centralized_snat does not exist.
    - We always getting the same error when deleting the interface which we already added.

11. Restart the Neutron L3 Agents and everything is ok.

Do you have any idea about this situation? We are using OVS and DVR.

After adding subnet to router why sometimes we saw only one port?
- network:router_interface_distributed (Always exist after attaching port to router)
- network:router_centralized_snat (Sometimes exist when attaching port to router)

Environment Details:
OpenStack Victoria Cluster installed via kolla-ansible to Ubuntu 20.04.2 LTS Hosts. (Kernel:5.4.0-80-generic)
There exist 5 controller+network node.
"neutron-openvswitch-agent", "neutron-l3-agent" and "neutron-server" version is "17.2.2.dev46"
OpenvSwitch used in DVR mode with router HA configured. (l3_ha = true)
We are using a single centralized neutron router for connecting all tenant networks to provider network.
We are using bgp_dragent to announce unique tenant networks.
Tenant network type: vxlan
External network type: vlan

Tags:

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-01-12:

neutron-bug-logs.txt Edit (189.9 KiB, text/plain)

Jakub Libosvar (libosvar) on 2022-01-17

Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2022-01-19:

Hello:

I deployed an environment with 3 HA controllers and 2 compute nodes. The L3 agent mode of the controllers was "dvr_snat" and the compute nodes "dvr".

I executed the following commands: https://paste.opendev.org/show/812228/

When the subnet was added to the router, the router routes of the instance in the controllers were:
[root@controller-0 ~]# ip netns exec $router ip r
10.10.0.0/26 dev qr-b66d5313-71 proto kernel scope link src 10.10.0.1
169.254.107.94/31 dev rfp-72922eaf-3 proto kernel scope link src 169.254.107.94

When the subnet was deleted:
[root@controller-0 ~]# ip netns exec $router ip r
169.254.107.94/31 dev rfp-72922eaf-3 proto kernel scope link src 169.254.107.94

The route for the subnet CIDR was correctly added and removed and no error was found in the L3 agent logs, thus I was not able to reproduce the error you reported. Did I miss something in my deployment?

BTW, I was using Wallaby.

Regards.

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-09-30 (last edit on 2022-09-30):

Download full text (3.3 KiB)

Hi Rodolfo, sorry for late response.

Thanks you for your test.

We have observed something interesting.

Yes, L3 agent mode of the controllers is "dvr_snat" and the compute nodes "dvr". We also upgraded to Wallaby but issue still persist.

Can you try when a vlan provider network attached as GW to the router? Provider Network subnet pool and Tenant Network subnet pool also resides on the same address scope to use bgp. Details: https://docs.openstack.org/neutron/wallaby/admin/config-bgp-dynamic-routing.html

You are right, in our case qrouter ns have the routes but does not have the ip rules ("$ ip rule")

Also, fip ns does not have routes at all!

Assume there exist 5 network node, max_l3_agents_per_router:3 and dvr ha router running on network node 01, 02 and 03. Router have a vlan provider network as GW. Provider Network and Tenant Network have subnets from two different subnet pools which in the same address scope.

When attaching a vxlan tenant network to router, then tenant network cidr route is updated in the fip netns for the network nodes which does not have the router (Network node 04 and 05) but not updated for network node 01,02 and 03 (which have the router) We have seen the "Starting router update for..." and "Finished a router update..." logs for all network node l3 agents. But somehow l3 agent ignore to add qrouter netns ip rules and fip netns ip routes for attached tenant network.

! The first network attachment to router after router creation may success but next attach operations have this situation.

Restarting the l3 agent fixes the all this issues!

Also when detaching the network, we also getting errors because l3 agent tries to remove routes which are not already added.

We are using dvr, address scopes and bgp. Only dns request of instances NAT'ted over controller nodes. If all of the dhcp agents resides on the network nodes which does not have that fip ns routes then our instances fail for dns queries.

Scenario

BGP Provider Network
10.10.10.0/24

BGP Tenant Network
10.10.12.0/24

# openstack address scope create --ip-version 4 test-scope
# openstack subnet pool create --pool-prefix 10.10.10.0/24 --address-scope test-scope test-provider-subnet-pool-01
# openstack subnet pool create --pool-prefix 10.10.12.0/24 --address-scope test-scope test-tenant-subnet-pool-01

# Create Provider Network and Subnet
# openstack network create \
--external \
--provider-network-type vlan \
--provider-physical-network physnet1 \
--provider-segment 118 test-provider-net-01

openstack subnet create test-provider-subnet-01 \
--network test-provider-net-01 \
--subnet-pool test-provider-subnet-pool-01 \
--prefix-length 24

===> We also changes the allocation pool and default gw after subnet creation. Because some of these IPs used by router. (For BGP)
--allocation-pool start=10.10.10.20,end=10.10.10.253
GW: 10.10.10.12

# Create vxlan tenant network and subnet
# openstack network create tenant-vxlan-net
# openstack subnet create tenant-vxlan-subnet-01 \
--network tenant-vxlan-net \
--subnet-pool test-tenant-subnet-pool-01 \
--prefix-length 26

# Create a router and set external gw as provider network. Also attach test-vxlan ...

Hi Rodolfo, sorry for late response.

Thanks you for your test.

We have observed something interesting.

Yes, L3 agent mode of the controllers is "dvr_snat" and the compute nodes "dvr". We also upgraded to Wallaby but issue still persist.

You are right, in our case qrouter ns have the routes but does not have the ip rules ("$ ip rule")

Also, fip ns does not have routes at all!

! The first network attachment to router after router creation may success but next attach operations have this situation.

Restarting the l3 agent fixes the all this issues!

Also when detaching the network, we also getting errors because l3 agent tries to remove routes which are not already added.

Scenario

BGP Provider Network
  10.10.10.0/24

BGP Tenant Network
  10.10.12.0/24

# Create Provider Network and Subnet
# openstack network create \
--external \
--provider-network-type vlan \
--provider-physical-network physnet1 \
--provider-segment 118 test-provider-net-01

openstack subnet create test-provider-subnet-01 \
--network test-provider-net-01 \
--subnet-pool  test-provider-subnet-pool-01 \
--prefix-length 24

===> We also changes the allocation pool and default gw after subnet creation. Because some of these IPs used by router. (For BGP)
    --allocation-pool start=10.10.10.20,end=10.10.10.253
    GW: 10.10.10.12

# Create a router and set external gw as provider network. Also attach test-vxlan net to this router
# openstack router create test-router-01
# openstack router add subnet test-router-01 tenant-vxlan-subnet-01
# openstack router set --external-gateway test-provider-net-01 test-router-01

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-10-03:

Hi,

As a workaround we can add ip rule to qrouter ns and ip route to fip netns.

qrouter-netns# ip rule add pref 80000 from 10.216.12.112/28 table 16

fip-netns# ip route add 10.216.12.112/28 via 169.254.88.134 dev fpr-9ebc2eb1-c proto static

Then our instances can resolve domains and no more exception when detaching networks from router.

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-10-03:

Hi again,

If not using address scopes, then ip rules added to qrouter netns. IP routes not added to fip netns but it is probably ok.

Yusuf Güngör (yusuf2) on 2022-10-04

summary:

- DVR Router Update Error
+ DVR HA Router Update Error

Revision history for this message

Yusuf Güngör (yusuf2) wrote on 2022-10-05:

RedHat OSP 16.0 “Configuring distributed virtual routing (DVR)” documentation states that

- DVR is not supported in conjunction with L3 HA. If you use DVR with Red Hat OpenStack Platform 16.0 director, L3 HA is disabled. This means that routers are still scheduled on the Network nodes (and load-shared between the L3 agents), but if one agent fails, all routers hosted by this agent fail as well. This affects only SNAT traffic. The allow_automatic_l3agent_failover feature is recommended in such cases, so that if one network node fails, the routers are rescheduled to a different node.

Also end of the doc there exist “Migrating centralized routers to distributed routing” section which dictates to disable HA before enabling DVR.

Doc Link: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/networking_guide/sec-dvr

There exist an official openstack doc: Open vSwitch: High availability using DVR (https://docs.openstack.org/neutron/wallaby/admin/deploy-ovs-ha-dvr.html)

It is confusing for dvr+l3_ha is supported or not supported?

Also is it possible that fast-exit is not supported for dvr+l3_ha ? The commits which are linked to https://bugs.launchpad.net/neutron/+bug/1577488 [RFE]“Fast exit” show changes for "agent/l3/dvr_local_router.py" file.

When we are disabling the HA for DVR router then it works as expected! (ip rules added to qrouter netns, ip routes added to fip netns etc)