As you can see there is only one flow in the table 4, there was already two instances of that network running in the compute node, one created a few days ago, and another a few minutes ago, the later never got a DHCP IP because there were no flows to reach the DHCP namespaces running on three different controllers. To fix this we have found two workarounds:
1. Add the flows manually
2. Modify the network so it will trigger a flow addition across computes running instances from that network.
We tried to the second option by adding a second subnet to the network, and we could see log events about update_fdb_entries for the DHCP agents, now flows looks like this, even after deleting the second subnet,
From the flows above we identify two things:
- They only take care of the communication to the DHCP agents, in other words none of the outputs of the vxlan tunnels from other computes running instances in that network are added in table 22.
- We are still missing a flow from one of the DHCP agents.
If we restart the instance that didn't get a DHCP lease, now it gets an IP. However once we restart the neutron openvswitch agent in the compute, we lose the flows from table 20 and table 22. Why we don't understand is why it always returns to the same unhealthy state with only one flow for table 4.
Just adding more information to the issue number[1] Gaëtan described in the initial comment:
These are the flows listed from a vxlan network segmentation id 97, or 61 in hexadecimal from one of our compute nodes:
# dex -u0 openvswitch_ vswitchd ovs-ofctl dump-flows br-tun | grep 0x61 0xde6f920d0d405 dbc, duration= 500977. 143s, table=4, n_packets=427, n_bytes=41999, priority= 1,tun_id= 0x61 actions= mod_vlan_ vid:8,resubmit( ,9)
cookie=
As you can see there is only one flow in the table 4, there was already two instances of that network running in the compute node, one created a few days ago, and another a few minutes ago, the later never got a DHCP IP because there were no flows to reach the DHCP namespaces running on three different controllers. To fix this we have found two workarounds:
1. Add the flows manually
2. Modify the network so it will trigger a flow addition across computes running instances from that network.
We tried to the second option by adding a second subnet to the network, and we could see log events about update_fdb_entries for the DHCP agents, now flows looks like this, even after deleting the second subnet,
# dex -u0 openvswitch_ vswitchd ovs-ofctl dump-flows br-tun | grep 0x61 0xde6f920d0d405 dbc, duration= 501247. 247s, table=4, n_packets=436, n_bytes=44143, priority= 1,tun_id= 0x61 actions= mod_vlan_ vid:8,resubmit( ,9) 0xde6f920d0d405 dbc, duration=15.405s, table=20, n_packets=1, n_bytes=42, priority= 2,dl_vlan= 8,dl_dst= fa:16:3e: 4c:67:28 actions= strip_vlan, load:0x61- >NXM_NX_ TUN_ID[ ],output: "vxlan- 0a83083e" 0xde6f920d0d405 dbc, duration=14.697s, table=20, n_packets=2, n_bytes=84, priority= 2,dl_vlan= 8,dl_dst= fa:16:3e: 5b:63:e3 actions= strip_vlan, load:0x61- >NXM_NX_ TUN_ID[ ],output: "vxlan- 0a83083d" 0xde6f920d0d405 dbc, duration=14.711s, table=22, n_packets=9, n_bytes=1818, priority= 1,dl_vlan= 8 actions= strip_vlan, load:0x61- >NXM_NX_ TUN_ID[ ],output: "vxlan- 0a83083d" ,output: "vxlan- 0a83083e"
cookie=
cookie=
cookie=
cookie=
From the flows above we identify two things:
- They only take care of the communication to the DHCP agents, in other words none of the outputs of the vxlan tunnels from other computes running instances in that network are added in table 22.
- We are still missing a flow from one of the DHCP agents.
If we restart the instance that didn't get a DHCP lease, now it gets an IP. However once we restart the neutron openvswitch agent in the compute, we lose the flows from table 20 and table 22. Why we don't understand is why it always returns to the same unhealthy state with only one flow for table 4.