flowtable: fix TCP flow teardown

Bug #1975649 reported by Bodong Wang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-bluefield (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

* Explain the feature

This patch addresses three possible problems:

1. ct gc may race to undo the timeout adjustment of the packet path, leaving
   the conntrack entry in place with the internal offload timeout (one day).

2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE
   timeout is reached before the flow offload del.

3. tcp ct is always set to ESTABLISHED with a very long timeout
   in flow offload teardown/delete even though the state might be already
   CLOSED. Also as a remark we cannot assume that the FIN or RST packet
   is hitting flow table teardown as the packet might get bumped to the
   slow path in nftables.

This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so
conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN
state transition.

Moreover, return the connection's ownership to conntrack upon teardown
by clearing the offload flag and fixing the established timeout value.
The flow table GC thread will asynchonrnously free the flow table and
hardware offload entries.

Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on
which is also misleading since the flow is back to classic conntrack
path.

If nf_ct_delete() removes the entry from the conntrack table, then it
calls nf_ct_put() which decrements the refcnt. This is not a problem
because the flowtable holds a reference to the conntrack object from
flow_offload_alloc() path which is released via flow_offload_free().

This patch also updates nft_flow_offload to skip packets in SYN_RECV
state. Since we might miss or bump packets to slow path, we do not know
what will happen there while we are still in SYN_RECV, this patch
postpones offload up to the next packet which also aligns to the
existing behaviour in tc-ct.

flow_offload_teardown() does not reset the existing tcp state from
flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow
path might have already update the state to CLOSE/FIN.

* How to test
Adding the following flows to the OVS bridge in DPU OS:
# ovs-ofctl add-flow ovsbr1 "table=0, ip,ct_state=-trk, actions=ct(table=1)"
# ovs-ofctl add-flow ovsbr1 "table=1, ip,ct_state=+new, actions=ct(commit),normal"
# ovs-ofctl add-flow ovsbr1 "table=1, ip,ct_state=-new, actions=normal"

Start netserver on SUT:
# netserver -p 5007

Start multiple TCP_CRR tests on peer:
# count=1;while [ $count -lt 10 ]; do screen -d -m netperf -t TCP_CRR -H 11.0.0.2 -l 360 -- -r 1 -O " MIN_LAETENCY, MAX_LATENCY, MEAN_LATENCY, P90_LATENCY, P99_LATENCY ,P999_LATENCY,P9999_LATENCY,STDDEV_LATENCY ,THROUGHPUT ,THROUGHPUT_UNITS "; count=`expr $count + 1`; done
A huge number of connections will be established and tear down. After the tests, some of them are not aged out:
# From /proc/net/nf_conntrack in DPU OS
ipv4 2 tcp 6 86354 LAST_ACK src=11.0.0.1 dst=11.0.0.2 sport=35862 dport=46797 src=11.0.0.2 dst=11.0.0.1 sport=46797 dport=35862 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 86354 LAST_ACK src=11.0.0.1 dst=11.0.0.2 sport=35862 dport=46797 src=11.0.0.2 dst=11.0.0.1 sport=46797 dport=35862 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 86354 LAST_ACK src=11.0.0.1 dst=11.0.0.2 sport=35862 dport=46797 src=11.0.0.2 dst=11.0.0.1 sport=46797 dport=35862 [ASSURED] mark=0 zone=0 use=2
The issue is usually reproduced after running the for several times.

* What it could break.
N/A

CVE References

Changed in linux-bluefield (Ubuntu):
status: New → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-bluefield/5.4.0-1042.47 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.8 KiB)

This bug was fixed in the package linux-bluefield - 5.4.0-1042.47

---------------
linux-bluefield (5.4.0-1042.47) focal; urgency=medium

  * focal/linux-bluefield: 5.4.0-1042.47 -proposed tracker (LP: #1979463)

  * Focal update: upstream stable patchset v5.4.192 (LP: #1979014)
    - [Config] bluefield: update configs for NVM

  * flowtable: fix TCP flow teardown (LP: #1975649)
    - netfilter: flowtable: fix TCP flow teardown

  * remove offload_pickup sysctl again (LP: #1975820)
    - netfilter: conntrack: remove offload_pickup sysctl again

  * mlxbf_gige: remove driver-managed interrupt counts (LP: #1975749)
    - mlxbf_gige: remove driver-managed interrupt counts

  [ Ubuntu: 5.4.0-122.138 ]

  * focal/linux: 5.4.0-122.138 -proposed tracker (LP: #1979489)
  * Remove SAUCE patches from test_vxlan_under_vrf.sh in net of
    ubuntu_kernel_selftests (LP: #1975691)
    - Revert "UBUNTU: SAUCE: selftests: net: Don't fail test_vxlan_under_vrf on
      xfail"
    - Revert "UBUNTU: SAUCE: selftests: net: Make test for VXLAN underlay in non-
      default VRF an expected failure"
  * Enable Asus USB-BT500 Bluetooth dongle(0b05:190e) (LP: #1976613)
    - Bluetooth: btusb: Add flag to define wideband speech capability
    - Bluetooth: btrtl: Add support for RTL8761B
    - Bluetooth: btusb: Add 0x0b05:0x190e Realtek 8761BU (ASUS BT500) device.
  * [UBUNTU 20.04] rcu stalls with many storage key guests (LP: #1975582)
    - s390/gmap: voluntarily schedule during key setting
    - s390/mm: use non-quiescing sske for KVM switch to keyed guest
  * Ubuntu 5.4.0-117.132-generic 5.4.189 has BUG: kernel NULL pointer
    dereference, address: 0000000000000034 (LP: #1978719)
    - mm: rmap: explicitly reset vma->anon_vma in unlink_anon_vmas()
  * Focal update: upstream stable patchset v5.4.192 (LP: #1979014)
    - floppy: disable FDRAWCMD by default
    - [Config] updateconfigs for BLK_DEV_FD_RAWCMD
    - hamradio: defer 6pack kfree after unregister_netdev
    - hamradio: remove needs_free_netdev to avoid UAF
    - lightnvm: disable the subsystem
    - [Config] updateconfigs for NVM, NVM_PBLK
    - usb: mtu3: fix USB 3.0 dual-role-switch from device to host
    - USB: quirks: add a Realtek card reader
    - USB: quirks: add STRING quirk for VCOM device
    - USB: serial: whiteheat: fix heap overflow in WHITEHEAT_GET_DTR_RTS
    - USB: serial: cp210x: add PIDs for Kamstrup USB Meter Reader
    - USB: serial: option: add support for Cinterion MV32-WA/MV32-WB
    - USB: serial: option: add Telit 0x1057, 0x1058, 0x1075 compositions
    - xhci: stop polling roothubs after shutdown
    - xhci: increase usb U3 -> U0 link resume timeout from 100ms to 500ms
    - iio: dac: ad5592r: Fix the missing return value.
    - iio: dac: ad5446: Fix read_raw not returning set value
    - iio: magnetometer: ak8975: Fix the error handling in ak8975_power_on()
    - usb: misc: fix improper handling of refcount in uss720_probe()
    - usb: typec: ucsi: Fix role swapping
    - usb: gadget: uvc: Fix crash when encoding data for usb request
    - usb: gadget: configfs: clear deactivation flag in
      configfs_composite_unbind()
    - usb: dwc3: core: Fix tx/rx threshold settings
    - ...

Changed in linux-bluefield (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.