Comment 0 for bug 2035332

Revision history for this message
Graeme Moss (gramimoss) wrote : VLAN networks for North / South Traffic Broken

## Environment

### Deployment

- Ubuntu 22.04 LTS
- Openstack Release ZED
- Kolla-ansible - stable/zed repo
- Kolla - stable/zed repo
- Containers built with ubuntu 22.04 LTS
- Conainters built on 2023-08-23
- OVN+DVR+VLAN tenant networks.
- We have three controllers occ00001, occ00002 occ00003
- Neutron version neutron-21.1.3.dev34 commit d6ee668cc32725cb7d15d2e08fdb50a761f91fe4
- ovn-nbctl 22.09.1
- Open vSwitch Library 3.0.3
- DB Schema 6.3.0

1. New provider network deployed into openstack, on vlan 504.
2. Router connected to this provider network.
3. Instance connected to provider network no FIP

## Issues

Attempting to send north/south traffic (ping 8.8.8.8), results in the following symptoms. 2 pings are successful, all other pings fail, until the ping is cancelled, and a couple of minutes pass, then two pings will be successful again, then back to failing.

New routers with vlan networks attached don't create all three ports on the controllers.

Even when fixing the localnet ports on the router to have three with changing the priority when attaching a FIP the N/S traffic is limited to 2 pings

Only when setting `reside-on-redirect-chassis` to `True` can we get the vlan to work with FIP and have baremetal nodes have FIP.

## Diagnostics

After looking at the ovn-controller logs on the control nodes we can see that it tries to claim the port on occ0001. which matches the gateway chassis on the routers LRP port.

```
2023-09-06T14:13:32.454Z|00718|binding|INFO|Claiming lport cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f for this chassis.
2023-09-06T14:13:32.454Z|00719|binding|INFO|cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f: Claiming fa:16:3e:fc:ba:cf 1xx.xx.xxx.xxx/25
```

Gateway chassis of the LRP port.

```
ovn-nbctl list Gateway_Chassis | grep -A2 -B4 lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1

_uuid : cf26be06-206d-443c-b224-25cc06ef2094
chassis_name : occ00002
external_ids : {}
name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00002
options : {}
priority : 2
--

_uuid : 1d9e8314-ed00-4694-8974-0328b78d34e1
chassis_name : occ00001
external_ids : {}
name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00001
options : {}
priority : 3
--

_uuid : b1e41ceb-ca2d-42eb-a896-b3551ea1b32f
chassis_name : occ00003
external_ids : {}
name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00003
options : {}
priority : 1
```

We see nothing about `occ00002` or `occ00003` trying to claim the LRP port but we found that when you change the priority around to try resolve, we can see that the port is not on `occ00001` but is on occ0002
We change occ0001 = 1 and occ0003 = 3 which means `occ00003` will be come the highest gateway.

```
ovn-nbctl set gateway_chassis 1d9e8314-ed00-4694-8974-0328b78d34e1 priority=1
ovn-nbctl set gateway_chassis b1e41ceb-ca2d-42eb-a896-b3551ea1b32f priority=3
```

the logs show the following.

occ0001

```
2023-09-06T14:10:06.134Z|00667|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0)
2023-09-06T14:10:06.134Z|00668|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1
```

occ0002

```
2023-09-06T14:10:14.883Z|00444|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0)
2023-09-06T14:10:14.883Z|00445|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1
```

occ0003

```
2023-09-06T14:10:14.789Z|00459|binding|INFO|Changing chassis for lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from occ00002 to occ00003.
2023-09-06T14:10:14.789Z|00460|binding|INFO|cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1: Claiming fa:16:3e:71:df:71 1xx.xx.xxx.xxx/25
```

on `occ00003` we can see that `occ00002` had the gateway and not `occ00001` which it should of had. This happens on creating new routers on the vlan provider network.All exisiting Routers before upgrade are working and that they have the same priority.

## Second diagnostics

Looking at each Logical Router we can see that when the router is first created that only two of the three ports are created.
Broken router:

```
_uuid : 773bb527-f193-4b47-8685-e62c9325dd7b
copp : []
enabled : true
external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="1a089d8f-d7a3-4116-a496-94cb87abe57f", "neutron:revision_number"="4", "neutron:router_name"=new-r1-test}
load_balancer : []
load_balancer_group : []
name : neutron-2b51e12e-5505-477e-9720-e5db31a05790
nat : [f22e6004-ad69-4b12-9445-7006a03495f5]
options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"}
policies : []
ports : [c59b5f9e-707e-43eb-912a-ea2679f1f723, c8f8ba72-64b4-4209-8209-128c93b157bc]
static_routes : [36ad39c0-c3f0-4842-b9b8-b4e986147624]
```

The working Router has all three ports after we make the priority change this means that the change forces the ports to be created.
Working Router:

```
_uuid : 8734ea01-21e7-4e69-8649-b05b125ce36e
copp : []
enabled : true
external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="dbe08713-97e1-4bea-880b-70910e05180d", "neutron:revision_number"="16", "neutron:router_name"=R2-test-demo2}
load_balancer : []
load_balancer_group : []
name : neutron-cbabcf4c-08a3-4e31-9485-a456237ef427
nat : [4bba0f50-6937-47cc-8771-2caef2aee7e6, 51f7f8fc-3b07-4a75-8dc3-32b0e2c4e02a, 663f6c59-4cc1-4802-b0ff-5ae34e83210e]
options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"}
policies : []
ports : [a9590024-feb2-4724-be7a-8bdb5fe3f9af, c1b94349-d320-4573-a2d5-2b1d3e91f679, ccae3d63-7203-4e39-8960-1e17df22fb31]
static_routes : [8e89f98e-cf75-4ae4-bbb6-e459e6ae9a6c]
```

## Resolution

When we look at the Logical Router Port of the internal interface (the one attached to the vlan) we can see that options has the following.

```
name : lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc
networks : ["192.168.0.1/24"]
options : {reside-on-redirect-chassis="false"}
```

And on the External LRP we have the following.

```
mac : "fa:16:3e:fc:ba:cf"
name : lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f
networks : ["1xx.xx.2xx.2xx/25"]
options : {redirect-type=bridged, reside-on-redirect-chassis="false"}
```

My understanding is that `reside-on-redirect-chassis` is to force traffic to the gateway rather then DVR this should be `True` as Vlan networks will need to go through the chassis gateway for everything where geneve DVR can have this as false to allow for DVR.
When I change this to true `ovn-nbctl set logical_router_port lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc options:reside-on-redirect-chassis=true` on the VLAN LRP, packets flow through the chassis and I can ping outwards FIP's can now be attached to the VLAN network and we can connect with no problem.

When looking at the merged https://review.opendev.org/c/openstack/neutron/+/879296 fix I don't understand what is meant to happen but the VLAN LRP is not been set to true which causes problems. the External LRP is been set correctly but VLANS need to be centralised.