inconsistencies in OVS firewall on an agent restart

Bug #1934917 reported by Thomas Morin
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Security Advisory
Won't Fix
Undecided
Unassigned
neutron
Fix Released
High
Unassigned

Bug Description

#### Summary

On an pre-production OpenStack deployment, we observed he following during a restart of neutron-openvswitch-agent: some active flows that the OVS firewall was letting through based on SG rules before the restart, become marked as CT_MARK(CT_MARK_INVALID) ; their traffic is then dropped for a period of time that extends beyond the restart. The clearing of rules with the previous cookie does not resolve the issue.

Digging this issue has led me to consider the hypothesis that during a restart, where neutron OVS agent is adding new rules with a new cookies and ultimately removing rules from the previous run not marked with newer cookies, the assumption that the new rules do not interfere with the old ones was broken.

Looking at how conjunction IDs are used has led me to see that:
A) the code offers no guarantee that, on a restart, a conjunction ID used for some SG rule in the previous run does not end up being used for some other SG rule on the next run
B) in a case where there is an unfortunate collision (same conj_id used for two different SGs over a restart) the way OVS rules are replaced leaves room for race conditions resulting in either legitimate traffic to be dropped or illegitimate traffic to be accepted

(B) with "legitimate traffic to be dropped" matches the issue as we saw it on the deployment, and the restricted conditions on which (B) would occur.

This bug report first provides details on the operational issue, but independently of the analysis of this case
the design issue in neutron agent described in the second part is what this bug report really is about.

Slawek and Rodolfo have already been exposed to the details explained here.

## Details on the issue observed on our deployment

# Context:

- Queens (RH OSP13 containers)
- we focus on two compute nodes where VMs run to form a cluster (one VM per compute)
- SG rules are in places to allow traffic to said VM (more on this below)
- TCP traffic

# Reproduction attempt

neutron OVS agent restarted at 11:41:35 on hyp1001

    traffic is impacted (cluster healthcheck failure in the application that runs the VM)
    (We hadn't taken much traces for this step, we were only checking that reloading the ovs agent with debug logs was working as intended)

neutron OVS agent restarted at 12:28:35 on hyp1001

    no impact on cluster traffic

neutron OVS agent restarted at 12:34:48 on hyp12003

    VM impacted starting from "12:35:12" (24s after start of new agent)

What follows is about this second occurence where traffic was impacted.

extract from VM logs (redacted):

2021-04-28 12:35:22,769 WARN messages lost for 10.1s
2021-04-28 12:35:32,775 WARN messages lost for 20.0s

When neutron OVS agent restarts, it is supposed to (a) redefine all the OVS rules, and then (b) remove the one that were existing before it's startup.
The time at which the issue starts (12:35:12) is between (a) and (b).

Time of (b) events for the different OVS bridges:

2021-04-28T12:35:29 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-int flows
2021-04-28T12:35:32 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-ovs-cp-bond2 flows
2021-04-28T12:35:32 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Cleaning stale br-tun flows

Here is the ovs-dpctl dump-flows trace where we see the flows being marked as CT_MARK_INVALID

> ---------- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows ------------
> --- dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:11+0000_dpctl.dropped-flows 2021-05-04 11:08:21.451676203 +0200
> +++ dump_flows_i1-hyp12003.mgmt.oaas_2021-04-28T12:35:16+0000_dpctl.dropped-flows 2021-05-04 11:08:21.455676231 +0200
> @@ -4,6 +4,7 @@
> recirc_id(0),tunnel(tun_id=0x40,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x43,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x44,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x4e,src=172.11.156.14,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x50,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0),tunnel(tun_id=0x52,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> @@ -18,3 +19,5 @@
> recirc_id(0),tunnel(tun_id=0xb,src=172.11.156.20,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),eth_type(0x0800),ipv4(frag=no), actions:drop
> recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(+new-est-rel-rpl-inv+trk),ct_zone(0x1),ct_mark(0),eth_type(0x0800),ipv4(src=10.92.0.0/255.254.0.0,proto=6,frag=no),tcp(dst=32768/0x8000), flags:S, actions:drop
> recirc_id(0x6d330),tunnel(tun_id=0x73,src=172.11.156.21,dst=172.11.156.24,flags(-df-csum+key)),in_port(2),ct_state(-new-est-rel-rpl+inv+trk),eth_type(0x0800),ipv4(frag=no), flags:R., actions:drop
> +recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop
> +recirc_id(0xa503e),in_port(5),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), flags:P., actions:drop
>
>
> See the two last dpctl dropped flows that appear between 12:35:11 and 12:35:16 ^^

> Details:
> grep a503[0e] dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:32+0000_dpctl
> recirc_id(0),in_port(5),ct_state(-trk),eth(src=44:ec:ce:33:3f:46,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:13, bytes:1923, used:7.456s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa503e)
> recirc_id(0),in_port(4),ct_state(-trk),eth(src=fa:16:3e:53:6e:1c,dst=fa:16:3e:ba:2b:4e),eth_type(0x8100),vlan(vid=3103,pcp=0),encap(eth_type(0x0800),ipv4(proto=6,frag=no)), packets:5886, bytes:724376, used:7.112s, flags:P., actions:pop_vlan,ct(zone=5),recirc(0xa5030)
> recirc_id(0xa5030),in_port(4),ct_state(-new+est-rel+rpl-inv+trk),ct_mark(0x1),eth_type(0x0800),ipv4(frag=no), packets:14, bytes:6045, used:7.112s, flags:P., actions:drop

I have dug a bit more, and it precisely appear that the conjunctions used to match the source IP against our remote SG of interest changes at a time corresponding to where the packet drop happen.

**Before restart**, our remote group uses conjunction id 38 (this example focuses on egress direction):

    R1: table=72, priority=73,ct_state=+est-rel-rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(38,1/2) cookie=0x3321e58c437bf4aa

(same for many other 10.94.x.y IPs)

This conjunction 38 is used to accept traffic for the port:

    R2: table=72, priority=73,conj_id=38,ct_state=+est-rel-rpl,ip,reg5=0x8 actions=load:0x26->NXM_NX_REG7[],resubmit(,73) cookie=0x3321e58c437bf4aa

(file is dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:11+0000_ofctl_br-int)

**After the restart**, around the time where the drop start we see the above rule replaced by an action to set conjunction 86:

    R3: table=72, priority=73,ct_state=+est-rel-rpl,ip,reg6=0x5,nw_dst=10.94.128.196 actions=conjunction(86,1/2) cookie=0xecd4cf28852563cf

actions to accept traffic on the port is also present:

    R4: table=72, priority=73,conj_id=86,ct_state=+est-rel-rpl,ip,reg5=0x8 actions=load:0x56->NXM_NX_REG7[],resubmit(,73) cookie=0xecd4cf28852563cf

(file is dump_flows_i1-hyp12003.mgmt._2021-04-28T12:35:16+0000_ofctl_br-int)

## Comments on this operational issue

The setup we have uses a lot of conjunction IDs rules for SGs, due to the use of a large range of TCP ports in a security group rule (that are broken into multiple masks in OVS flow rules appearing with many OVS conjunctions).

Given thatthe traffic is dropped based on ct_mark(0x1), I speculate that, because the transition of the rules from conjunction 38 to conjunction 86 is not atomic (if I'm right nothing in the OVS firewall design guarantees that, and I see batches of 100 packets mentioned in ovs-vswitchd.log), there could perhaps be a time window where R3 has replaced R1, but R4 is not in place yet. Any packet of our flow hitting the pipeline at this point would not meet R2, would then hit the rule at the bottom of table 72 that marks the whole flow (including future packets) as invalid:

     table=72, priority=40,ct_state=+est,ip,reg5=0x8 actions=ct(commit,zone=NXM_NX_REG6[0..15],exec(load:0x1->NXM_NX_CT_MARK[])) cookie=0xecd4cf28852563cf

From then, of course, even if R4 is setup later on, traffic would not be accepted anymore.

The fact that the traffic interruption isn't observed on a second restart is compatible with the idea that "conjunction ids collision" could be the cause of our issue: indeed on a second restart done a short time after the first one, the conjunction ID for a given SG flow is likely to be the same one as for the previous run.

## Rough analysis of the openvswitch firewall design and the conditions under which it may break

Here are what are my personnal notes on a quick analysis of what races may exist in the code in relation with conjunction IDs reuse accross restarts.
My apologies for the quality of redaction, I opted for posting them rather than postponing this to an unknown future date.

Note that I haven't digged into the code to see which of the possible issue above can actually happen, which depends of the order in which the openvswitch agent code does updates of flow rules.

This analysis really isn't thorough and may be wrong in some places. Hopefully the existence of any mistake will serve as an illustration that the reuse of conjunction IDs across restarts raises difficult questions... ;)

# Notations used:

SGFlow = traffic allowed between two SGs in a given direction (on a given ethertype)
(sg_id, remote_sg_id, direction, ethertype) (tuples used by ConjIdMap)

SGFlow1 = (SGA, SGB, ingress, icmp)
SGFlow2 = (SGC, SGD, ingress, icmp)

Looking at a single compute node...

Compute has a local port N, with MAC = MAC(N), on network Net(N)
N belongs to SGA

conj_id X is allocated by ConjIdMap to SGFlow1 (SGA, SGB, ingress, icmp)

OVS rules for this SGFlow1 are R1b-1, R1b-2 and R1b-3 (R1 for SGFLow1, 'b' like 'before restart'):
- R1b-1: reg_net = Net(N), ip_src in SGB => set conjunction(X, 1/2)
- R1b-2: reg_port = to_port(N), icmp => set conjunction(X, 2/2)
- R1b-3: conj(X), to_port(N) => CT + output to N

For instance (here conj_id X==11):
R1b-1: cookie=0x5403be965377d469, duration=9682.923s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,ip,reg6=0x1,nw_src=10.1.1.236 actions=conjunction(11,1/2)
R1b-2: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=0, n_bytes=0, idle_age=9797, priority=71,ct_state=+new-est,icmp,reg5=0x6 actions=conjunction(11,2/2)
R1b-3: cookie=0x5403be965377d469, duration=9662.260s, table=82, n_packets=2, n_bytes=196, idle_age=9612, priority=71,conj_id=11,ct_state=+new-est,ip,reg5=0x6 actions=load:0xb->NXM_NX_REG7[],ct(commit,zone=NXM_NX_REG6[0..15]),output:6,resubmit(,92)

Compute also has a local port K with MAC = MAC(K), on network Net(K)
K belongs to SGC

conj_id Y allocated by ConjIdMap to SGFlow2 (SGC, SGD, ingress, icmp)

OVS rules:
R2b-1: reg_net = Net(K), ip_src in SGD => set conjunction(Y, 1/2)
R2b-2: reg_port = to_port(K), icmp => set conjunction(Y, 2/2)
R2b-3: conj(Y), to_port(K) => CT + output to K

** Let's now assume that the OVS agent is now restarted **

[Scenario 1]
We assume that nothing has been done about port K and SGFlow2 at this point.

SGFlow1 processing starts for port N:
conj_id *Y* allocated by ConjIdMap to SGFlow1
(case of a reuse of a given conjunction id for two different flows, here SGFlow1 gets assigned the conjunction id that was the conj_id of SGFlow2 before the restart)

new OVS rules ('a' like 'after restart'):
R1a-1: reg_net = Net(N), ip_src (in SGB) => set conjunction(Y, 1/2)
R1a-2: reg_port = to_port(N), icmp => set conjunction(Y, 2/2)
R1a-3: conj(Y), to_port(N) => CT + output to N

if R1a-1/2 is setup before R1-a3, we have:
* R1a-1/R1a-2 mapping SGFlow1 to conjunction Y _replacing_ R1b-1/R1b-2
* R2b-3 mapping conj(Y), to_port(K) to K (ie. old, wrong port) is still here
* Hence no flow is mapping SGFlow1 to port (N) anymore -> any packet processed now would be dropped and the flow marked invalid (unless port K == port N)
* after R1a-3 is setup: SGFlow1 -> to port N , and now traffic can flow again, well, unless the whole connection has been marked invalid)

if R1a-3 is setup before R1a-1/2 are, until R1a-1/2 are setup, we would have SGFLow2 mapped to Y (old state prior to restart), and rules in place so that traffic matching "to_port(N)+ conj Y" send to port N.
In the general case, traffic matching SGFLow2 will not also match to_port(N) conditions, but this is messy to analyze further an be sure that no illegitimate traffic can go through.

After theses potentials races internal to the processing of SGFlow1, let's look at the intermediate state we have before the agent starts to add rules for SGFlow2?

Do the matches of the rules overlap ? (R1a-* + R2b-* + R1b-3 which hasn't been replaced)
- R..-1: dest ports would have to be on same net (net(N)=net(K))+ same IP in remote SG => possible if remote SGB and SGD overlap
   => in that case R1a-1 has overwritten R2b-1 and conj(Y) 1/2 is computed based on SGFlow1
- R..-2: same port (MAC) and same ethertype => unlikely but not impossible to have same MAC
- R..-3: no collision seems possible unless K == N

... what effect of a collision of R1a-1 and R2b-1 ?

SGFlow1 remote SG matching has replaced SGFlow2 remote SG match in the computation of Y.
But mapping SGFlow2 to Y still exists for 2/2, so SGFlow1 remote IP (1/2) + SGFlow2 dest port (K) match + protocol K still works.
Traffic that was selected thanks to R2b-1 (remote IP in SGD) is not accepted anymore (for any remote IP present in SGD but not in SGB), if these flows apply to same Neutron network.

Here my analysis really is incomplete due to the fact that when remote SG IP is in multiple SGs, it's mapped to multiple conjunctions.

Now, even with no overlap...

New rules for conj(Y) select traffic for SGFlow1, and based on this R1a-3 forwards to N (ok) and R2b-3 forwards to K, which would be a firewall violation, although it seems that to leverage this to inject traffic the destination MAC would have to be tailored.

The last questions which I did not dig at all, are:
• does something interesting happen when flow rules for SGFlow2 are added ?
• are inconsistencies possible when flow rules for tboh SGFlow1 and SGFlow2 have been added, but before the old flow rules with old cookies have been purged ?

## Conclusion

My impression is that the behavior of OVS firewall in neutron agent during a restart is very hard to predict because nothing prevents a given conjunction IDs to be reused for different SGs creating interferences between the old rules and the new rules during the window where they co-exist. The inconsistency may perhaps exist only for a short duration, but due to how connection tracking is done, the effects of a packet hitting the pipeline at a moment where rules are inconsistent, can perdure beyond the short time where rules are inconsistent. It is also worth noting that we've hit a case where the duration of the restart (30s) and the amount of flows (6k) is large enough to have created these conditions.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Since this report concerns a possible security risk, an incomplete
security advisory task has been added while the core security
reviewers for the affected project or projects confirm the bug and
discuss the scope of any vulnerability along with potential
solutions.

description: updated
Changed in ossa:
status: New → Incomplete
Revision history for this message
Jeremy Stanley (fungi) wrote :

What i'm missing here to be able to determine whether this represents an exploitable security vulnerability is some indication of how this would be caused or predicted by an attacker. All the scenarios you describe seem to result in additional traffic being blocked or established connections getting dropped, so at most this could represent a denial of service condition I suppose, assuming an unprivileged or otherwise unintended party can force it to occur.

If this doesn't seem to be directly exploitable, we should switch the report to public at the earliest opportunity.

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

I won't try to do what would be a half-baked assessment.

Looking at how the code behaves and the lack of structural guarantee around reuse of conjunction IDs, additionally to the possibility of having legitimate traffic be dropped (which I would say is established given what we saw on this deployment), the code is written in a way that seem to let open the possibility of having illegitimate traffic be accepted.

In the rough and incomplete analysis above, I mentions cases where I think this may happen:

> if R1a-3 is setup before R1a-1/2 are, until R1a-1/2 are setup, we would have SGFLow2 mapped to Y (old state prior to restart), and rules in place so that traffic matching "to_port(N)+ conj Y" send to port N.
In the general case, traffic matching SGFLow2 will not also match to_port(N) conditions, but this is messy to analyze further and be sure that no illegitimate traffic can go through.

> SGFlow1 remote SG matching has replaced SGFlow2 remote SG match in the computation of Y.

> New rules for conj(Y) select traffic for SGFlow1, and based on this R1a-3 forwards to N (ok) and R2b-3 forwards to K, which would be a firewall violation, although it seems that to leverage this to inject traffic the destination MAC would have to be tailored.

Again, my analysis is incomplete, digging further is needed and may lead to discover that there are more cases where firewall violations may happen, or lead to conclude that despite lack of structural guarantees, the code would avoid the problematic cases.

(e.g. I didn't dig the possible intermediate state where traffic would hit the flow rules in a state where R1a-1 has been setup but R1a-2 hasn't been setup yet - this combination would lead to traffic be accepted based on wrong source IP classification, and quite an obvious case of accepting illegitimate traffic -- but maybe the code does the rule insertion in an order that prevents this from happening)

There are additional conditions that for sure would need to be met for an attacker to exploit this:
* be sending traffic when the OVS agent restart occurs (always sending traffic can allow to meet this condition)
* hit a race condition (an attacker may "encourage" this by using lots os SGs, or SGs with large port ranges to increase the amount of OVS flow rule and increase the restart time)

So, while none of what I saw leads to conclude that this is easy to exploit, I would say that without a deeper analysis, it's not easy to conclude what are the cases where an attacker could manage to have illegitimate traffic be accepted.

Revision history for this message
Miguel Lavalle (minsel) wrote :

I went over this bug report. I agree with Thomas that we need to take a closer look. I am adding Rodolfo Alonso to the bug. He is the best person to deal with the firewall

Miguel Lavalle (minsel)
Changed in neutron:
importance: Undecided → High
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

First, a bit of context. When the OVS agent is restarted, the conjunction IDs are not stored (same as, for example, the internal VLAN IDs for the different networks); these are internal variables.

When a port SG is updated (for example during the OVS agent restart), the former rules are overwritten. This is because OVS cannot have two rules with the same condition. That means we can change the conjunction ID of a port rule during the OVS agent restart. E.g.:

(conj_id=1008)
 cookie=0x1, duration=6299.950s, table=82, n_packets=0, n_bytes=0, idle_age=6434, priority=70,conj_id=1008,ct_state=+est-rel-rpl,ip,reg5=0xc actions=load:0x3f0->NXM_NX_REG7[],output:12

will be overwritten by:

(conj_id=2008)
 cookie=0x2, duration=7000.000s, table=82, n_packets=0, n_bytes=0, idle_age=6434, priority=70,conj_id=2008,ct_state=+est-rel-rpl,ip,reg5=0xc actions=load:0x3f0->NXM_NX_REG7[],output:12

However, as exposed by Thomas, if a port SG rule receives a new conjunction ID matching a previous one and (1) the match rules (those matching the traffic parameters) are set and (2) the action rules (those matching the conjunction ID) are not, then we'll have the issue presented in this bug. In other words: we'll match for traffic A and we'll apply a rule for traffic B.

We can do two actions (both independently will help):
1) When (re)starting the OVS agent, dump the flows of br-int. Then parse all rules looking for the biggest conjunction ID and then limit the minimum new conjunction ID to be provided in the OVS agent instance to a bigger number. Of course, we should consider the maximum conjunction ID number (unsigned 32bit [1]). The method generating the conjunction IDs should take care of it, same as the method calculating the minimum conjunction ID of the instance (based on the current OF rules).

2) As Thomas commented, the flows are written in batches [2]. The maximum number of flows to be written is 100. If the traffic match flows and the actions flows (those ones sending the traffic to the correct port) are written in different batches, we can end in the same scenario. All flows related to one port should be written in the same batch, regardless of the number of flows.

About the embargo of this bug, it is very unlikely that a VM user could use this behaviour to inject traffic to other destination. The user cannot know when the OVS agent is restarted and in case of hitting this issue, what the destination is. I suggest to make it public.

Regards.

[1]https://www.openvswitch.org/support/dist-docs-2.5/ovs-ofctl.8.html
[2]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/agent/common/ovs_lib.py#L471

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

I've pushed the first patch (without referring to this bug): https://review.opendev.org/c/openstack/neutron/+/804236

Next one will be a bit more complex.

BTW, again I suggest to make this bug public.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Rodolfo: Thanks for working on this one, and for the details. Based on your analysis, I'll also suggest that this probably does not meet the risk level necessary to warrant publishing a security advisory. I propose that we consider it class C1 (an impractical vulnerability) per our report taxonomy: https://security.openstack.org/vmt-process.html#report-taxonomy

Accordingly, I'm switching it to a regular Public bug, but we can continue to discuss publicly whether this might warrant an advisory once (and if) fixes can be backported to maintained stable branches (so at least as far back as stable/ussuri currently).

description: updated
information type: Private Security → Public
tags: added: security
Changed in ossa:
status: Incomplete → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/806246

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/804236
Committed: https://opendev.org/openstack/neutron/commit/0634dcc6d0f08c18c69a2c360a2c5c0581ec7bb6
Submitter: "Zuul (22348)"
Branch: master

commit 0634dcc6d0f08c18c69a2c360a2c5c0581ec7bb6
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 12:55:40 2021 +0000

    [OVS][FW] Initialize ConjIdMap._max_id depending on the current OFs

    The OVS Firewall has a singleton class that manages the conjuction IDs
    to used in the OF rules. Those conjuntions are used to group rules
    related to remote security group IDs.

    Now each time the OVS agent is started, the OVS Firewall initial
    conjunction ID is calculated based on the present OF rules. This value
    and the next one used won't clash with any present rule in the
    integration bridge during the initial transient period.

    Related-Bug: #1934917

    Change-Id: Ie2e4441f766947a2164dec2d1555c7049428903f

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/819439

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/819440

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/819441

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/819442

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/neutron/+/819443

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819439
Committed: https://opendev.org/openstack/neutron/commit/d49797b1f4fd4b032403debfc6eb04801e78a5be
Submitter: "Zuul (22348)"
Branch: stable/xena

commit d49797b1f4fd4b032403debfc6eb04801e78a5be
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 12:55:40 2021 +0000

    [OVS][FW] Initialize ConjIdMap._max_id depending on the current OFs

    The OVS Firewall has a singleton class that manages the conjuction IDs
    to used in the OF rules. Those conjuntions are used to group rules
    related to remote security group IDs.

    Now each time the OVS agent is started, the OVS Firewall initial
    conjunction ID is calculated based on the present OF rules. This value
    and the next one used won't clash with any present rule in the
    integration bridge during the initial transient period.

    Related-Bug: #1934917

    Change-Id: Ie2e4441f766947a2164dec2d1555c7049428903f
    (cherry picked from commit 0634dcc6d0f08c18c69a2c360a2c5c0581ec7bb6)

tags: added: in-stable-xena
tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819440
Committed: https://opendev.org/openstack/neutron/commit/4baa225e75ba2ac543f893204477c897d21ae390
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 4baa225e75ba2ac543f893204477c897d21ae390
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 12:55:40 2021 +0000

    [OVS][FW] Initialize ConjIdMap._max_id depending on the current OFs

    The OVS Firewall has a singleton class that manages the conjuction IDs
    to used in the OF rules. Those conjuntions are used to group rules
    related to remote security group IDs.

    Now each time the OVS agent is started, the OVS Firewall initial
    conjunction ID is calculated based on the present OF rules. This value
    and the next one used won't clash with any present rule in the
    integration bridge during the initial transient period.

    Related-Bug: #1934917

    Change-Id: Ie2e4441f766947a2164dec2d1555c7049428903f
    (cherry picked from commit 0634dcc6d0f08c18c69a2c360a2c5c0581ec7bb6)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819441
Committed: https://opendev.org/openstack/neutron/commit/e7d4b41cf3d3495443afb4e7c4804a9b5732c297
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit e7d4b41cf3d3495443afb4e7c4804a9b5732c297
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 12:55:40 2021 +0000

    [OVS][FW] Initialize ConjIdMap._max_id depending on the current OFs

    The OVS Firewall has a singleton class that manages the conjuction IDs
    to used in the OF rules. Those conjuntions are used to group rules
    related to remote security group IDs.

    Now each time the OVS agent is started, the OVS Firewall initial
    conjunction ID is calculated based on the present OF rules. This value
    and the next one used won't clash with any present rule in the
    integration bridge during the initial transient period.

    Related-Bug: #1934917

    Change-Id: Ie2e4441f766947a2164dec2d1555c7049428903f
    (cherry picked from commit 0634dcc6d0f08c18c69a2c360a2c5c0581ec7bb6)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819442
Committed: https://opendev.org/openstack/neutron/commit/cfc77635060b8b09944df02c7039dc4a708a76f4
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit cfc77635060b8b09944df02c7039dc4a708a76f4
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 12:55:40 2021 +0000

    [OVS][FW] Initialize ConjIdMap._max_id depending on the current OFs

    The OVS Firewall has a singleton class that manages the conjuction IDs
    to used in the OF rules. Those conjuntions are used to group rules
    related to remote security group IDs.

    Now each time the OVS agent is started, the OVS Firewall initial
    conjunction ID is calculated based on the present OF rules. This value
    and the next one used won't clash with any present rule in the
    integration bridge during the initial transient period.

    Related-Bug: #1934917

    Change-Id: Ie2e4441f766947a2164dec2d1555c7049428903f
    (cherry picked from commit 0634dcc6d0f08c18c69a2c360a2c5c0581ec7bb6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/train)

Change abandoned by "Rodolfo Alonso <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/neutron/+/819443

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/837726

Revision history for this message
Krzysztof Tomaszewski (labedz) wrote (last edit ):

In parallel of the Rodolfo fix https://review.opendev.org/c/openstack/neutron/+/806246
maybe we can try to fix race condition between applying deferred flows and
removing conntrack entries?

That would fix an issue with having a bad conntrack entries for
invalid connections created during openflow batches apply?

Something like: https://review.opendev.org/c/openstack/neutron/+/837726

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/819443
Committed: https://opendev.org/openstack/neutron/commit/57b63954c6fab419bdb5cf493b2d9ce63e9227e6
Submitter: "Zuul (22348)"
Branch: stable/train

commit 57b63954c6fab419bdb5cf493b2d9ce63e9227e6
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Aug 11 12:55:40 2021 +0000

    [OVS][FW] Initialize ConjIdMap._max_id depending on the current OFs

    The OVS Firewall has a singleton class that manages the conjuction IDs
    to used in the OF rules. Those conjuntions are used to group rules
    related to remote security group IDs.

    Now each time the OVS agent is started, the OVS Firewall initial
    conjunction ID is calculated based on the present OF rules. This value
    and the next one used won't clash with any present rule in the
    integration bridge during the initial transient period.

    Related-Bug: #1934917

    Change-Id: Ie2e4441f766947a2164dec2d1555c7049428903f
    (cherry picked from commit 0634dcc6d0f08c18c69a2c360a2c5c0581ec7bb6)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-tempest-plugin (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/840319

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/840442

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/806246
Committed: https://opendev.org/openstack/neutron/commit/ab84b7fb2b6febc9dfd9b0767be90fcb3277c192
Submitter: "Zuul (22348)"
Branch: master

commit ab84b7fb2b6febc9dfd9b0767be90fcb3277c192
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Thu Aug 26 16:54:13 2021 +0000

    Allow to process FW OF rules belonging to a port in a single operation

    This patch adds a new configuration variable to control the OVS
    OpenFlow rule processing operations:

    * ``openflow_processed_per_port``: by default "False". If enabled,
      all OpenFlow rules associated to a port will be processed at once,
      in one single transaction. If disabled, the flows will be processed
      in batches of "AGENT_RES_PROCESSING_STEP=100" number of OpenFlow
      rules.

    With ``openflow_processed_per_port`` enabled, all Firewall
    OpenFlow rules related to a port are processed in one transaction
    (executed in one single command). That ensures the rules are written
    atomically and apply all of them at the same time.

    That means all needed rules to handle the ingress and egress traffic
    of a port using the Open vSwitch Firewall, are committed in the OVS
    DB at the same time. That will prevent from partially applied OpenFlow
    sets in the Firewall and inconsistencies when applying new SG rules or
    during the OVS agent restart.

    That will override, if needed, the hard limit of
    "AGENT_RES_PROCESSING_STEP=100" OpenFlow rules that could be
    processed in OVS at once.

    If the default configuration values are not modified, the behaviour of
    the OVS library does not change.

    Closes-Bug: #1934917

    Change-Id: If4984dece266a789d607725f8497f1aac3d73d23

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-tempest-plugin (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-tempest-plugin/+/840319
Committed: https://opendev.org/openstack/neutron-tempest-plugin/commit/68596bed942646b8c5d78bee08c08687d9b76bb9
Submitter: "Zuul (22348)"
Branch: master

commit 68596bed942646b8c5d78bee08c08687d9b76bb9
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue May 3 14:51:54 2022 +0000

    [OVS] Add "openflow_processed_per_port=True" to OVS scenario

    Add "openflow_processed_per_port=True" option to
    "neutron-tempest-plugin-scenario-openvswitch" job definition. This
    option will process all OpenFlow rules related to a single port
    in a single transaction.

    Related-Bug: #1934917

    Depends-On: https://review.opendev.org/c/openstack/neutron/+/806246

    Change-Id: Ib43c66971bb0148e8c32cd5de6d85da26bc8ff9f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/840442
Committed: https://opendev.org/openstack/neutron/commit/430c220d05308da29d9caf991f7e82005410001e
Submitter: "Zuul (22348)"
Branch: master

commit 430c220d05308da29d9caf991f7e82005410001e
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Sat Apr 23 23:03:52 2022 +0000

    [OVS] Add "Open Flow processing considerations" doc section

    Added "Open Flow processing considerations" in the OVS firewall
    driver page. This section describes the [OVS] configuration option
    "openflow_processed_per_port". A script to meassure the Open Flow
    rule processing time is provided.

    Related-Bug: #1934917

    Change-Id: Id4936b6d36abb6edc250aaf1cbc7e9af3fd0fae5

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/837726
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.0.0.0rc1

This issue was fixed in the openstack/neutron 21.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.