#### Summary On an pre-production OpenStack deployment, we observed he following during a restart of neutron-openvswitch-agent: some active flows that the OVS firewall was letting through based on SG rules before the restart, become marked as CT_MARK(CT_MARK_INVALID) ; their traffic is then dropped for a period of time that extends beyond the restart. The clearing of rules with the previous cookie does not resolve the issue. Digging this issue has led me to consider the hypothesis that during a restart, where neutron OVS agent is adding new rules with a new cookies and ultimately removing rules from the previous run not marked with newer cookies, the assumption that the new rules do not interfere with the old ones was broken. Looking at how conjunction IDs are used has led me to see that: A) the code offers no guarantee that, on a restart, a conjunction ID used for some SG rule in the previous run does not end up being used for some other SG rule on the next run B) in a case where there is an unfortunate collision (same conj_id used for two different SGs over a restart) the way OVS rules are replaced leaves room for race conditions resulting in either legitimate traffic to be dropped or illegitimate traffic to be accepted (B) with "legitimate traffic to be dropped" matches the issue as we saw it on the deployment, and the restricted conditions on which (B) would occur. This bug report first provides details on the operational issue, but independently of the analysis of this case the design issue in neutron agent described in the second part is what this bug report really is about. Slawek and Rodolfo have already been exposed to the details explained here. ## Details on the issue observed on our deployment # Context: - Queens (RH OSP13 containers) - we focus on two compute nodes where VMs run to form a cluster (one VM per compute) - SG rules are in places to allow traffic to said VM (more on this below) - TCP traffic # Reproduction attempt neutron OVS agent restarted at 11:41:35 on hyp1001 traffic is impacted (cluster healthcheck failure in the application that runs the VM) (We hadn't taken much traces for this step, we were only checking that reloading the ovs agent with debug logs was working as intended) neutron OVS agent restarted at 12:28:35 on hyp1001 no impact on cluster traffic neutron OVS agent restarted at 12:34:48 on hyp12003 VM impacted starting from "12:35:12" (24s after start of new agent) What follows is about this second occurence where traffic was impacted. extract from VM logs (redacted): 2021-04-28 12:35:22,769 WARN messages lost for 10.1s 2021-04-28 12:35:32,775 WARN messages lost for 20.0s When neutron OVS agent restarts, it is supposed to (a) redefine all the OVS rules, and then (b) remove the one that were existing before it's startup. The time at which the issue starts (12:35:12) is between (a) and (b). Time of (b) events for the different OVS bridges: 2021-04-28T12:35:29 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-int flows 2021-04-28T12:35:32 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-ovs-cp-bond2 flows 2021-04-28T12:35:32 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-tun flows Here is the ovs-dpctl dump-flows trace where we see the flows being marked as CT_MARK_INVALID > ---------- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows ------------ > --- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:11+0000_dpctl.dropped-flows 2021-05-04 11:08:21.451676203 +0200 > +++ dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows 2021-05-04 11:08:21.455676231 +0200 > @@ -4,6 +4,7 @@ > recirc_id(0),tunnel(tun_id=0x40,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop > recirc_id(0),tunnel(tun_id=0x43,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop > recirc_id(0),tunnel(tun_id=0x44,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop > recirc_id(0),tunnel(tun_id=0x4e,src=172.11.156.14,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop > recirc_id(0),tunnel(tun_id=0x50,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop > recirc_id(0),tunnel(tun_id=0x52,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop > @@ -18,3 +19,5 @@ > recirc_id(0),tunnel(tun_id=0xb,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop > recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(+new-est-rel-rpl-inv+trk),ct_zone(0x1),ct_mark(0),eth_type(0x0800),ipv4(src=10.92.0.0/255.254.0.0,proto=6,frag=no),tcp(dst=32768/0x8000), flags:S, actions:drop > recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(-new-est-rel-rpl+inv+trk),eth_type(0x0800),ipv4(frag=no), flags:R., actions:drop > +recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop > +recirc_id(0xa503e),in_port(5),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop > > > See the two last dpctl dropped flows that appear between 12:35:11 and 12:35:16 ^^ > Details: > grep a503[0e] dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:32+0000_dpctl > recirc_id(0),in_port(5),ct_state(-trk),eth(src=44:ec:ce:33:3f:46,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:13, bytes:1923, used:7.456s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa503e) > recirc_id(0),in_port(4),ct_state(-trk),eth(src=fa:16:3e:53:6e:1c,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:5886, bytes:724376, used:7.112s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa5030) > recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), packets:14, bytes:6045, used:7.112s, flags:P., actions:drop I have dug a bit more, and it precisely appear that the conjunctions used to match the source IP against our remote SG of interest changes at a time corresponding to where the packet drop happen. **Before restart**, our remote group uses conjunction id 38 (this example focuses on egress direction): R1: table=72, priority=73,ct_state=+est-rel-rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(38,1/2) cookie=0x3321e58c437bf4aa (same for many other 10.94.x.y IPs) This conjunction 38 is used to accept traffic for the port: R2: table=72, priority=73,conj_id=38,ct_state=+est-rel-rpl,ip,reg5=0x8 actions=load:0x26->NXM_NX_REG7[],resubmit(,73) cookie=0x3321e58c437bf4aa (file is dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:11+0000_ofctl_br-int) **After the restart**, around the time where the drop start we see the above rule replaced by an action to set conjunction 86: R3: table=72, priority=73,ct_state=+est-rel-rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(86,1/2) cookie=0xecd4cf28852563cf actions to accept traffic on the port is also present: R4: table=72, priority=73,conj_id=86,ct_state=+est-rel-rpl,ip,reg5=0x8 actions=load:0x56->NXM_NX_REG7[],resubmit(,73) cookie=0xecd4cf28852563cf (file is dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:16+0000_ofctl_br-int) ## Comments on this operational issue The setup we have uses a lot of conjunction IDs rules for SGs, due to the use of a large range of TCP ports in a security group rule (that are broken into multiple masks in OVS flow rules appearing with many OVS conjunctions). Given thatthe traffic is dropped based on ct_mark(0x1), I speculate that, because the transition of the rules from conjunction 38 to conjunction 86 is not atomic (if I'm right nothing in the OVS firewall design guarantees that, and I see batches of 100 packets mentioned in ovs-vswitchd.log), there could perhaps be a time window where R3 has replaced R1, but R4 is not in place yet. Any packet of our flow hitting the pipeline at this point would not meet R2, would then hit the rule at the bottom of table 72 that marks the whole flow (including future packets) as invalid: table=72, priority=40,ct_state=+est,ip,reg5=0x8 actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[])) cookie=0xecd4cf28852563cf From then, of course, even if R4 is setup later on, traffic would not be accepted anymore. The fact that the traffic interruption isn't observed on a second restart is compatible with the idea that "conjunction ids collision" could be the cause of our issue: indeed on a second restart done a short time after the first one, the conjunction ID for a given SG flow is likely to be the same one as for the previous run. ## Rough analysis of the openvswitch firewall design and the conditions under which it may break Here are what are my personnal notes on a quick analysis of what races may exist in the code in relation with conjunction IDs reuse accross restarts. My apologies for the quality of redaction, I opted for posting them rather than postponing this to an unknown future date. Note that I haven't digged into the code to see which of the possible issue above can actually happen, which depends of the order in which the openvswitch agent code does updates of flow rules. This analysis really isn't thorough and may be wrong in some places. Hopefully the existence of any mistake will serve as an illustration that the reuse of conjunction IDs across restarts raises difficult questions... ;) # Notations used: SGFlow = traffic allowed between two SGs in a given direction (on a given ethertype) (sg_id, remote_sg_id, direction, ethertype) (tuples used by ConjIdMap) SGFlow1 = (SGA, SGB, ingress, icmp) SGFlow2 = (SGC, SGD, ingress, icmp) Looking at a single compute node... Compute has a local port N, with MAC = MAC(N), on network Net(N) N belongs to SGA conj_id X is allocated by ConjIdMap to SGFlow1 (SGA, SGB, ingress, icmp) OVS rules for this SGFlow1 are R1b-1, R1b-2 and R1b-3 (R1 for SGFLow1, 'b' like 'before restart'): - R1b-1: reg_net = Net(N), ip_src in SGB => set conjunction(X, 1/2) - R1b-2: reg_port = to_port(N), icmp => set conjunction(X, 2/2) - R1b-3: conj(X), to_port(N) => CT + output to N For instance (here conj_id X==11): R1b-1: cookie=0x5403be965377d469, duration=9682.923s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,ip,reg6=0x1,nw_src=10.1.1.236 actions=conjunction(11,1/2) R1b-2: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,icmp,reg5=0x6 actions=conjunction(11,2/2) R1b-3: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=2, n_bytes=196, idle_age=9612, priority=71,conj_id=11,ct_state=+new-est,ip,reg5=0x6 actions=load:0xb->NXM_NX_REG7[],ct(commit,zone=NXM_NX_REG6[0..15]),output:6,resubmit(,92) Compute also has a local port K with MAC = MAC(K), on network Net(K) K belongs to SGC conj_id Y allocated by ConjIdMap to SGFlow2 (SGC, SGD, ingress, icmp) OVS rules: R2b-1: reg_net = Net(K), ip_src in SGD => set conjunction(Y, 1/2) R2b-2: reg_port = to_port(K), icmp => set conjunction(Y, 2/2) R2b-3: conj(Y), to_port(K) => CT + output to K ** Let's now assume that the OVS agent is now restarted ** [Scenario 1] We assume that nothing has been done about port K and SGFlow2 at this point. SGFlow1 processing starts for port N: conj_id *Y* allocated by ConjIdMap to SGFlow1 (case of a reuse of a given conjunction id for two different flows, here SGFlow1 gets assigned the conjunction id that was the conj_id of SGFlow2 before the restart) new OVS rules ('a' like 'after restart'): R1a-1: reg_net = Net(N), ip_src (in SGB) => set conjunction(Y, 1/2) R1a-2: reg_port = to_port(N), icmp => set conjunction(Y, 2/2) R1a-3: conj(Y), to_port(N) => CT + output to N if R1a-1/2 is setup before R1-a3, we have: * R1a-1/R1a-2 mapping SGFlow1 to conjunction Y _replacing_ R1b-1/R1b-2 * R2b-3 mapping conj(Y), to_port(K) to K (ie. old, wrong port) is still here * Hence no flow is mapping SGFlow1 to port (N) anymore -> any packet processed now would be dropped and the flow marked invalid (unless port K == port N) * after R1a-3 is setup: SGFlow1 -> to port N , and now traffic can flow again, well, unless the whole connection has been marked invalid) if R1a-3 is setup before R1a-1/2 are, until R1a-1/2 are setup, we would have SGFLow2 mapped to Y (old state prior to restart), and rules in place so that traffic matching "to_port(N)+ conj Y" send to port N. In the general case, traffic matching SGFLow2 will not also match to_port(N) conditions, but this is messy to analyze further an be sure that no illegitimate traffic can go through. After theses potentials races internal to the processing of SGFlow1, let's look at the intermediate state we have before the agent starts to add rules for SGFlow2? Do the matches of the rules overlap ? (R1a-* + R2b-* + R1b-3 which hasn't been replaced) - R..-1: dest ports would have to be on same net (net(N)=net(K))+ same IP in remote SG => possible if remote SGB and SGD overlap => in that case R1a-1 has overwritten R2b-1 and conj(Y) 1/2 is computed based on SGFlow1 - R..-2: same port (MAC) and same ethertype => unlikely but not impossible to have same MAC - R..-3: no collision seems possible unless K == N ... what effect of a collision of R1a-1 and R2b-1 ? SGFlow1 remote SG matching has replaced SGFlow2 remote SG match in the computation of Y. But mapping SGFlow2 to Y still exists for 2/2, so SGFlow1 remote IP (1/2) + SGFlow2 dest port (K) match + protocol K still works. Traffic that was selected thanks to R2b-1 (remote IP in SGD) is not accepted anymore (for any remote IP present in SGD but not in SGB), if these flows apply to same Neutron network. Here my analysis really is incomplete due to the fact that when remote SG IP is in multiple SGs, it's mapped to multiple conjunctions. Now, even with no overlap... New rules for conj(Y) select traffic for SGFlow1, and based on this R1a-3 forwards to N (ok) and R2b-3 forwards to K, which would be a firewall violation, although it seems that to leverage this to inject traffic the destination MAC would have to be tailored. The last questions which I did not dig at all, are: • does something interesting happen when flow rules for SGFlow2 are added ? • are inconsistencies possible when flow rules for tboh SGFlow1 and SGFlow2 have been added, but before the old flow rules with old cookies have been purged ? ## Conclusion My impression is that the behavior of OVS firewall in neutron agent during a restart is very hard to predict because nothing prevents a given conjunction IDs to be reused for different SGs creating interferences between the old rules and the new rules during the window where they co-exist. The inconsistency may perhaps exist only for a short duration, but due to how connection tracking is done, the effects of a packet hitting the pipeline at a moment where rules are inconsistent, can perdure beyond the short time where rules are inconsistent. It is also worth noting that we've hit a case where the duration of the restart (30s) and the amount of flows (6k) is large enough to have created these conditions.