Activity log for bug #2035332

Date Who What changed Old value New value Message
2023-09-13 09:22:49 Graeme Moss bug added bug
2023-09-13 09:27:38 Graeme Moss description ## Environment ### Deployment - Ubuntu 22.04 LTS - Openstack Release ZED - Kolla-ansible - stable/zed repo - Kolla - stable/zed repo - Containers built with ubuntu 22.04 LTS - Conainters built on 2023-08-23 - OVN+DVR+VLAN tenant networks. - We have three controllers occ00001, occ00002 occ00003 - Neutron version neutron-21.1.3.dev34 commit d6ee668cc32725cb7d15d2e08fdb50a761f91fe4 - ovn-nbctl 22.09.1 - Open vSwitch Library 3.0.3 - DB Schema 6.3.0 1. New provider network deployed into openstack, on vlan 504. 2. Router connected to this provider network. 3. Instance connected to provider network no FIP ## Issues Attempting to send north/south traffic (ping 8.8.8.8), results in the following symptoms. 2 pings are successful, all other pings fail, until the ping is cancelled, and a couple of minutes pass, then two pings will be successful again, then back to failing. New routers with vlan networks attached don't create all three ports on the controllers. Even when fixing the localnet ports on the router to have three with changing the priority when attaching a FIP the N/S traffic is limited to 2 pings Only when setting `reside-on-redirect-chassis` to `True` can we get the vlan to work with FIP and have baremetal nodes have FIP. ## Diagnostics After looking at the ovn-controller logs on the control nodes we can see that it tries to claim the port on occ0001. which matches the gateway chassis on the routers LRP port. ``` 2023-09-06T14:13:32.454Z|00718|binding|INFO|Claiming lport cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f for this chassis. 2023-09-06T14:13:32.454Z|00719|binding|INFO|cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f: Claiming fa:16:3e:fc:ba:cf 1xx.xx.xxx.xxx/25 ``` Gateway chassis of the LRP port. ``` ovn-nbctl list Gateway_Chassis | grep -A2 -B4 lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 _uuid : cf26be06-206d-443c-b224-25cc06ef2094 chassis_name : occ00002 external_ids : {} name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00002 options : {} priority : 2 -- _uuid : 1d9e8314-ed00-4694-8974-0328b78d34e1 chassis_name : occ00001 external_ids : {} name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00001 options : {} priority : 3 -- _uuid : b1e41ceb-ca2d-42eb-a896-b3551ea1b32f chassis_name : occ00003 external_ids : {} name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00003 options : {} priority : 1 ``` We see nothing about `occ00002` or `occ00003` trying to claim the LRP port but we found that when you change the priority around to try resolve, we can see that the port is not on `occ00001` but is on occ0002 We change occ0001 = 1 and occ0003 = 3 which means `occ00003` will be come the highest gateway. ``` ovn-nbctl set gateway_chassis 1d9e8314-ed00-4694-8974-0328b78d34e1 priority=1 ovn-nbctl set gateway_chassis b1e41ceb-ca2d-42eb-a896-b3551ea1b32f priority=3 ``` the logs show the following. occ0001 ``` 2023-09-06T14:10:06.134Z|00667|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0) 2023-09-06T14:10:06.134Z|00668|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 ``` occ0002 ``` 2023-09-06T14:10:14.883Z|00444|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0) 2023-09-06T14:10:14.883Z|00445|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 ``` occ0003 ``` 2023-09-06T14:10:14.789Z|00459|binding|INFO|Changing chassis for lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from occ00002 to occ00003. 2023-09-06T14:10:14.789Z|00460|binding|INFO|cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1: Claiming fa:16:3e:71:df:71 1xx.xx.xxx.xxx/25 ``` on `occ00003` we can see that `occ00002` had the gateway and not `occ00001` which it should of had. This happens on creating new routers on the vlan provider network.All exisiting Routers before upgrade are working and that they have the same priority. ## Second diagnostics Looking at each Logical Router we can see that when the router is first created that only two of the three ports are created. Broken router: ``` _uuid : 773bb527-f193-4b47-8685-e62c9325dd7b copp : [] enabled : true external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="1a089d8f-d7a3-4116-a496-94cb87abe57f", "neutron:revision_number"="4", "neutron:router_name"=new-r1-test} load_balancer : [] load_balancer_group : [] name : neutron-2b51e12e-5505-477e-9720-e5db31a05790 nat : [f22e6004-ad69-4b12-9445-7006a03495f5] options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"} policies : [] ports : [c59b5f9e-707e-43eb-912a-ea2679f1f723, c8f8ba72-64b4-4209-8209-128c93b157bc] static_routes : [36ad39c0-c3f0-4842-b9b8-b4e986147624] ``` The working Router has all three ports after we make the priority change this means that the change forces the ports to be created. Working Router: ``` _uuid : 8734ea01-21e7-4e69-8649-b05b125ce36e copp : [] enabled : true external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="dbe08713-97e1-4bea-880b-70910e05180d", "neutron:revision_number"="16", "neutron:router_name"=R2-test-demo2} load_balancer : [] load_balancer_group : [] name : neutron-cbabcf4c-08a3-4e31-9485-a456237ef427 nat : [4bba0f50-6937-47cc-8771-2caef2aee7e6, 51f7f8fc-3b07-4a75-8dc3-32b0e2c4e02a, 663f6c59-4cc1-4802-b0ff-5ae34e83210e] options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"} policies : [] ports : [a9590024-feb2-4724-be7a-8bdb5fe3f9af, c1b94349-d320-4573-a2d5-2b1d3e91f679, ccae3d63-7203-4e39-8960-1e17df22fb31] static_routes : [8e89f98e-cf75-4ae4-bbb6-e459e6ae9a6c] ``` ## Resolution When we look at the Logical Router Port of the internal interface (the one attached to the vlan) we can see that options has the following. ``` name : lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc networks : ["192.168.0.1/24"] options : {reside-on-redirect-chassis="false"} ``` And on the External LRP we have the following. ``` mac : "fa:16:3e:fc:ba:cf" name : lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f networks : ["1xx.xx.2xx.2xx/25"] options : {redirect-type=bridged, reside-on-redirect-chassis="false"} ``` My understanding is that `reside-on-redirect-chassis` is to force traffic to the gateway rather then DVR this should be `True` as Vlan networks will need to go through the chassis gateway for everything where geneve DVR can have this as false to allow for DVR. When I change this to true `ovn-nbctl set logical_router_port lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc options:reside-on-redirect-chassis=true` on the VLAN LRP, packets flow through the chassis and I can ping outwards FIP's can now be attached to the VLAN network and we can connect with no problem. When looking at the merged https://review.opendev.org/c/openstack/neutron/+/879296 fix I don't understand what is meant to happen but the VLAN LRP is not been set to true which causes problems. the External LRP is been set correctly but VLANS need to be centralised. ## Environment ### Deployment - Ubuntu 22.04 LTS - Openstack Release ZED - Kolla-ansible - stable/zed repo - Kolla - stable/zed repo - Containers built with ubuntu 22.04 LTS - Containers built on 2023-08-23 - OVN+DVR+VLAN tenant networks. - We have three controllers occ00001, occ00002 occ00003 - Neutron version neutron-21.1.3.dev34 commit d6ee668cc32725cb7d15d2e08fdb50a761f91fe4 - ovn-nbctl 22.09.1 - Open vSwitch Library 3.0.3 - DB Schema 6.3.0 1. New provider network deployed into openstack, on vlan 504. 2. Router connected to this provider network. 3. Instance connected to provider network no FIP ## Issues Attempting to send north/south traffic (ping 8.8.8.8), results in the following symptoms. 2 pings are successful, all other pings fail, until the ping is cancelled, and a couple of minutes pass, then two pings will be successful again, then back to failing. New routers with vlan networks attached don't create all three ports on the controllers. Even when fixing the localnet ports on the router to have three with changing the priority when attaching a FIP the N/S traffic is limited to 2 pings Only when setting `reside-on-redirect-chassis` to `True` can we get the vlan to work with FIP and have baremetal nodes have FIP. ## Diagnostics After looking at the ovn-controller logs on the control nodes we can see that it tries to claim the port on occ0001. which matches the gateway chassis on the routers LRP port. ``` 2023-09-06T14:13:32.454Z|00718|binding|INFO|Claiming lport cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f for this chassis. 2023-09-06T14:13:32.454Z|00719|binding|INFO|cr-lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f: Claiming fa:16:3e:fc:ba:cf 1xx.xx.xxx.xxx/25 ``` Gateway chassis of the LRP port. ``` ovn-nbctl list Gateway_Chassis | grep -A2 -B4 lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 _uuid : cf26be06-206d-443c-b224-25cc06ef2094 chassis_name : occ00002 external_ids : {} name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00002 options : {} priority : 2 -- _uuid : 1d9e8314-ed00-4694-8974-0328b78d34e1 chassis_name : occ00001 external_ids : {} name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00001 options : {} priority : 3 -- _uuid : b1e41ceb-ca2d-42eb-a896-b3551ea1b32f chassis_name : occ00003 external_ids : {} name : lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1_occ00003 options : {} priority : 1 ``` We see nothing about `occ00002` or `occ00003` trying to claim the LRP port but we found that when you change the priority around to try resolve, we can see that the port is not on `occ00001` but is on occ0002 We change occ0001 = 1 and occ0003 = 3 which means `occ00003` will be come the highest gateway. ``` ovn-nbctl set gateway_chassis 1d9e8314-ed00-4694-8974-0328b78d34e1 priority=1 ovn-nbctl set gateway_chassis b1e41ceb-ca2d-42eb-a896-b3551ea1b32f priority=3 ``` the logs show the following. occ0001 ``` 2023-09-06T14:10:06.134Z|00667|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0) 2023-09-06T14:10:06.134Z|00668|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 ``` occ0002 ``` 2023-09-06T14:10:14.883Z|00444|binding|INFO|Releasing lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from this chassis (sb_readonly=0) 2023-09-06T14:10:14.883Z|00445|if_status|WARN|Trying to release unknown interface cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 ``` occ0003 ``` 2023-09-06T14:10:14.789Z|00459|binding|INFO|Changing chassis for lport cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1 from occ00002 to occ00003. 2023-09-06T14:10:14.789Z|00460|binding|INFO|cr-lrp-71cf7286-de37-4d86-b362-eb7ba689d2d1: Claiming fa:16:3e:71:df:71 1xx.xx.xxx.xxx/25 ``` on `occ00003` we can see that `occ00002` had the gateway and not `occ00001` which it should of had. This happens on creating new routers on the vlan provider network.All exisiting Routers before upgrade are working and that they have the same priority. ## Second diagnostics Looking at each Logical Router we can see that when the router is first created that only two of the three ports are created. Broken router: ``` _uuid : 773bb527-f193-4b47-8685-e62c9325dd7b copp : [] enabled : true external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="1a089d8f-d7a3-4116-a496-94cb87abe57f", "neutron:revision_number"="4", "neutron:router_name"=new-r1-test} load_balancer : [] load_balancer_group : [] name : neutron-2b51e12e-5505-477e-9720-e5db31a05790 nat : [f22e6004-ad69-4b12-9445-7006a03495f5] options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"} policies : [] ports : [c59b5f9e-707e-43eb-912a-ea2679f1f723, c8f8ba72-64b4-4209-8209-128c93b157bc] static_routes : [36ad39c0-c3f0-4842-b9b8-b4e986147624] ``` The working Router has all three ports after we make the priority change this means that the change forces the ports to be created. Working Router: ``` _uuid : 8734ea01-21e7-4e69-8649-b05b125ce36e copp : [] enabled : true external_ids : {"neutron:availability_zone_hints"="", "neutron:gw_network_id"="c9d130bc-301d-45c0-9328-a6964af65579", "neutron:gw_port_id"="dbe08713-97e1-4bea-880b-70910e05180d", "neutron:revision_number"="16", "neutron:router_name"=R2-test-demo2} load_balancer : [] load_balancer_group : [] name : neutron-cbabcf4c-08a3-4e31-9485-a456237ef427 nat : [4bba0f50-6937-47cc-8771-2caef2aee7e6, 51f7f8fc-3b07-4a75-8dc3-32b0e2c4e02a, 663f6c59-4cc1-4802-b0ff-5ae34e83210e] options : {always_learn_from_arp_request="false", dynamic_neigh_routers="true"} policies : [] ports : [a9590024-feb2-4724-be7a-8bdb5fe3f9af, c1b94349-d320-4573-a2d5-2b1d3e91f679, ccae3d63-7203-4e39-8960-1e17df22fb31] static_routes : [8e89f98e-cf75-4ae4-bbb6-e459e6ae9a6c] ``` ## Resolution When we look at the Logical Router Port of the internal interface (the one attached to the vlan) we can see that options has the following. ``` name : lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc networks : ["192.168.0.1/24"] options : {reside-on-redirect-chassis="false"} ``` And on the External LRP we have the following. ``` mac : "fa:16:3e:fc:ba:cf" name : lrp-1a089d8f-d7a3-4116-a496-94cb87abe57f networks : ["1xx.xx.2xx.2xx/25"] options : {redirect-type=bridged, reside-on-redirect-chassis="false"} ``` My understanding is that `reside-on-redirect-chassis` is to force traffic to the gateway rather then DVR this should be `True` as Vlan networks will need to go through the chassis gateway for everything where geneve DVR can have this as false to allow for DVR. When I change this to true `ovn-nbctl set logical_router_port lrp-d6e063e5-d209-43ec-9da2-4ac9f9e8ccbc options:reside-on-redirect-chassis=true` on the VLAN LRP, packets flow through the chassis and I can ping outwards FIP's can now be attached to the VLAN network and we can connect with no problem. When looking at the merged https://review.opendev.org/c/openstack/neutron/+/879296 fix I don't understand what is meant to happen but the VLAN LRP is not been set to true which causes problems. the External LRP is been set correctly but VLANS need to be centralised.
2023-09-13 09:41:40 Graeme Moss summary VLAN networks for North / South Traffic Broken [OVN] VLAN networks for North / South Traffic Broken
2023-09-13 09:41:40 Bartosz Bezak bug added subscriber Bartosz Bezak
2023-09-14 20:08:10 Miro Tomaska tags ovn
2023-12-18 08:32:25 yatin bug watch added https://bugzilla.redhat.com/show_bug.cgi?id=2007120
2023-12-18 08:38:39 yatin bug added subscriber yatin
2024-01-16 13:36:59 Sven Kieske bug added subscriber Sven Kieske
2024-01-16 16:22:54 Dr. Jens Harbott bug added subscriber Dr. Jens Harbott
2024-02-06 00:56:11 Brian Haley bug added subscriber Brian Haley