On high loaded systems conntrack table could be filled with ovn connection

Bug #1978806 reported by Nikita Koltsov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron Open vSwitch Charm
New
Undecided
Unassigned
charm-ovn-chassis
Fix Released
Undecided
Mustafa Kemal Gilor
charm-ovn-dedicated-chassis
Fix Released
Undecided
Mustafa Kemal Gilor

Bug Description

conntrack kernel module when loaded tracking state on all connections on the node.
For highly loaded ovn networks this could fill conntrack table and leads to package losses.

It would be good to add an option to stop tracking ovn connections.
This could be achieved by adding the following iptables rules:
iptables -t raw -A PREROUTING -p udp --dport 6081 -j NOTRACK
iptables -t raw -A OUTPUT -p udp --dport 6081 -j NOTRACK

Tags: sts
Revision history for this message
Trent Lloyd (lathiat) wrote :

The problem here is that because OVN loads the nf_conntrack module - it starts tracking connections for all interfaces on the host even though there are 0 iptables/nftables rules. These connections thus take up space in the tables that is needed by OVN.

For normal host connections it is not normally a big problem however for both GENEVE and VXLAN overlayed packets they set the SOURCE port of every packet to a different port in the range 1-65534.

It assigns a static port to each "tunelled flow" by hashing the tunneled packets (src_ip, dst_ip, src_port, dst_port) tuple with the intent that the GENEVE/VXLAN packets will get distributed over multiple links of an LACP or bonded link - however the same tunneled flow will always have the port and thus take the same path and preserve network packet ordering. If they used the same source/destination port then all traffic between the same 2 HVs would not get distributed over LACP and if the port was random it would not reserve network packet order as packets of the same connection may go over multiple links.

Reference: https://datatracker.ietf.org/doc/html/draft-gross-geneve-01/#section-3.3

Thus when you have a very busy cloud you potentially have (NUMBER_OF_HYPERVISORS ^ 2) * 65534 potential conntrack entries per hypervisor. FOr 40 hypervisors that is about 2.5 million entries. The default nf_conntrack_max in ovn-chassis 1,000,000.

In practice in production we saw hundreds of thousands of conntrack entries normally and in excess of 1 million in some cases leading to dropped connections as the nf_conntrack_max was exceeded. This is notably worse in a public internet facing application which speaks to many different remote Internet IPs as well.

There is precedence for this change in other projects:
https://bugs.launchpad.net/tripleo/+bug/1885551
https://github.com/antrea-io/antrea/issues/1133
https://bugzilla.redhat.com/show_bug.cgi?id=1985336

Changed in charm-ovn-chassis:
status: New → Confirmed
tags: added: sts
Changed in charm-ovn-chassis:
assignee: nobody → Mustafa Kemal Gilor (mustafakemalgilor)
Changed in charm-ovn-chassis:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ovn-chassis (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/x/charm-ovn-chassis/+/854607

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ovn-chassis (master)

Change abandoned by "Mustafa Kemal Gilor <email address hidden>" on branch: master
Review: https://review.opendev.org/c/x/charm-ovn-chassis/+/854607
Reason: Will re-target the changes to base charm & submit a simpler patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Mustafa Kemal Gilor <email address hidden>" on branch: master
Review: https://review.opendev.org/c/x/charm-ovn-chassis/+/854607
Reason: Superseded, see: https://github.com/openstack-charmers/charm-layer-ovn/pull/67

Changed in charm-ovn-dedicated-chassis:
status: New → In Progress
assignee: nobody → Mustafa Kemal Gilor (mustafakemalgilor)
Frode Nordahl (fnordahl)
Changed in charm-ovn-chassis:
status: In Progress → Fix Committed
Changed in charm-ovn-dedicated-chassis:
status: In Progress → Fix Committed
Revision history for this message
Frode Nordahl (fnordahl) wrote :

The fix is now available in the 22.03/stable channel.

Changed in charm-ovn-chassis:
status: Fix Committed → Fix Released
Changed in charm-ovn-dedicated-chassis:
status: Fix Committed → Fix Released
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Also affects neutron-openvswitch charm for non-ovn deployments (vxlan/gre)

Revision history for this message
Brian Haley (brian-haley) wrote :

Just wanted to put a link to the change that was merged to fix this for OVN.

https://github.com/openstack-charmers/charm-layer-ovn/pull/67

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.