Fix nf_conntrack races when dealing with same origin requests in NAT environments

Bug #1836816 reported by Matthew Ruffell on 2019-07-16
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
Medium
Matthew Ruffell
Cosmic
Medium
Unassigned

Bug Description

BugLink: https://bugs.launchpad.net/bugs/1836816

[Impact]

There are a number of races in nf_conntrack when multiple packets are sent from the same origin to the same destination at the same time, where there is no pre-existing confirmed conntrack entry, in a NAT environment.

This is most frequently seen in users with kubernetes workloads, where dns
requests are sent from the same socket at the same time, from different threads. One request is for a A record, the other AAAA. A conntrack race happens, which leads to one request being dropped, which leads to 5 second dns timeouts.

This problem is not specific to kubernetes, as any multi-threaded process
which sends UDP packets from the same socket at the same time from different
threads is affected.

Note that since UDP is connectionless, no packet is sent when connect() is
called, so no conntrack entries are created until a packet is first sent.

In the scenario where two UDP packets are sent at the same time from the same
socket on different threads, there are the following possible races:

1.) Neither of the packets find a confirmed conntrack entry. This leads to two conntrack entries being created with the same tuples.
2.) Same as 1), but a conntrack entry is confirmed for one of the packets before the other calls get_unique_tuple(). The other packet gets a different reply tuple with the source port changed.

The outcomes of the races are the same, one of the packets is dropped when it
comes time to confirm conntrack entries with __nf_conntrack_confirm().

For more information, read the technical blog post:
https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts

[Fix]

The fix to race 1) was included in 4.19 upstream by this commit:

netfilter: nf_conntrack: resolve clash for matching conntracks
commit ed07d9a021df6da53456663a76999189badc432a
Author: Martynas Pumputis <email address hidden>
Date: Mon Jul 2 16:52:14 2018 +0200

This commit adds an extra check to see if the two entries are the same and to
merge the entries if they are.

The fix to race 2) was included in 5.0 upstream by this commit:

netfilter: nf_nat: skip nat clash resolution for same-origin entries
commit 4e35c1cb9460240e983a01745b5f29fe3a4d8e39
Author: Martynas Pumputis <email address hidden>
Date: Tue Jan 29 15:51:42 2019 +0100

This commit ensures that a new source port is not allocated to a duplicated
entry, and forwards the duplicated entries to be resolved later on.

Note that 4e35c1cb9460240e983a01745b5f29fe3a4d8e39 was also backported to stable releases 4.9.163, 4.14.106, 4.19.29, 4.20.16.

Both commits cherry-pick to 4.15 bionic cleanly. Please cherry-pick both commits to all bionic kernels.

[Testcase]

The reproducer has been provided by the author of the fixes, and is best viewed here:

https://github.com/brb/conntrack-race

The dmesg logs and packet traces in the above github repo are best viewed while reading upstream discussion:

https://patchwork.ozlabs.org/patch/937963/

I have built a test kernel for xenial HWE, which can be found here:
https://launchpad.net/~mruffell/+archive/ubuntu/sf00227747-test-kernel

The test kernel has been tested with a kubernetes workload and aided with
resolution of kubernetes problems.

[Regression Potential]

As noted previously, 4e35c1cb9460240e983a01745b5f29fe3a4d8e39 has been
backported to stable releases 4.9.163, 4.14.106, 4.19.29, 4.20.16 and has no
changes to the primary commit. This commit is well tested and considered stable by the community.

Both commits have also been present in the -azure kernels since 4.15.0-1030.31 and 4.18.0-1004.4, as per LP #1795493, http://bugs.launchpad.net/bugs/1795493
They have received wide testing, and no problems have been reported.

The changes are specifically limited to resolving clashes between duplicate
conntrack entries, and any regressions will be limited to that scenario. The
overall risk of regression is very low.

[Notes]
The commits in the -azure kernels will need to be reverted before importing the generic bionic updates, since they are the pre-upstream commits, and have had several revisions since they were imported. Replacing them with the
actual upstream commits will make things easier in the future to maintain.

http://bugs.launchpad.net/bugs/1795493

For race 1):
Azure: https://patchwork.ozlabs.org/patch/937963/
Upstream: ed07d9a021df6da53456663a76999189badc432a
Commits are the same.

For race 2):
Azure: https://patchwork.ozlabs.org/patch/952939/
Upstream: 4e35c1cb9460240e983a01745b5f29fe3a4d8e39
Note the difference, old commit must be reverted first.

Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Cosmic):
status: New → Won't Fix
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Cosmic):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
assignee: nobody → Matthew Ruffell (mruffell)
description: updated
tags: added: sts

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1836816

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Matthew Ruffell (mruffell) wrote :

I installed 4.15.0-56-generic #62~16.04.1-Ubuntu xenial HWE kernel, and
I followed the reproducer instructions at https://github.com/brb/conntrack-race,
specifically loading in the NAT iptables rules, enabling debug output of the
conntrack file and running the programs server and client.

Looking at dmesg output, I see that conntrack collisions are found and resolved,
and duplicate conntrack entries are de-allocated and returned to the slab.

This kernel is also being tested in a kubernetes test cluster and I will update
this bug if any problems arise. At the moment there isn't any.

Since one of the patches are from upstream -stable, and I have spent some time
validating, I am happy to mark this as verified.

tags: added: verification-done-bionic
removed: verification-needed-bionic

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Matthew Ruffell (mruffell) wrote :

I installed linux-hwe 4.15.0-57 #63~16.04.1-Ubuntu on xenial.

I have run the reproducers on both bionic and xenial, and both have positive results. This has been tested in the kubernetes environment which had issues, and is performing well.

I am happy to mark this as verified for xenial.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (171.3 KiB)

This bug was fixed in the package linux - 4.15.0-58.64

---------------
linux (4.15.0-58.64) bionic; urgency=medium

  * unable to handle kernel NULL pointer dereference at 000000000000002c (IP:
    iget5_locked+0x9e/0x1f0) (LP: #1838982)
    - Revert "ovl: set I_CREATING on inode being created"
    - Revert "new primitive: discard_new_inode()"

linux (4.15.0-57.63) bionic; urgency=medium

  * CVE-2019-1125
    - x86/cpufeatures: Carve out CQM features retrieval
    - x86/cpufeatures: Combine word 11 and 12 into a new scattered features word
    - x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
    - x86/speculation: Enable Spectre v1 swapgs mitigations
    - x86/entry/64: Use JMP instead of JMPQ
    - x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS

  * Packaging resync (LP: #1786013)
    - update dkms package versions

linux (4.15.0-56.62) bionic; urgency=medium

  * bionic/linux: 4.15.0-56.62 -proposed tracker (LP: #1837626)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] update helper scripts

  * CVE-2019-2101
    - media: uvcvideo: Fix 'type' check leading to overflow

  * hibmc-drm Causes Unreadable Display for Huawei amd64 Servers (LP: #1762940)
    - [Config] Set CONFIG_DRM_HISI_HIBMC to arm64 only
    - SAUCE: Make CONFIG_DRM_HISI_HIBMC depend on ARM64

  * Bionic: support for Solarflare X2542 network adapter (sfc driver)
    (LP: #1836635)
    - sfc: make mem_bar a function rather than a constant
    - sfc: support VI strides other than 8k
    - sfc: add Medford2 (SFC9250) PCI Device IDs
    - sfc: improve PTP error reporting
    - sfc: update EF10 register definitions
    - sfc: populate the timer reload field
    - sfc: update MCDI protocol headers
    - sfc: support variable number of MAC stats
    - sfc: expose FEC stats on Medford2
    - sfc: expose CTPIO stats on NICs that support them
    - sfc: basic MCDI mapping of 25/50/100G link speeds
    - sfc: support the ethtool ksettings API properly so that 25/50/100G works
    - sfc: add bits for 25/50/100G supported/advertised speeds
    - sfc: remove tx and MCDI handling from NAPI budget consideration
    - sfc: handle TX timestamps in the normal data path
    - sfc: add function to determine which TX timestamping method to use
    - sfc: use main datapath for HW timestamps if available
    - sfc: only enable TX timestamping if the adapter is licensed for it
    - sfc: MAC TX timestamp handling on the 8000 series
    - sfc: on 8000 series use TX queues for TX timestamps
    - sfc: only advertise TX timestamping if we have the license for it
    - sfc: simplify RX datapath timestamping
    - sfc: support separate PTP and general timestamping
    - sfc: support second + quarter ns time format for receive datapath
    - sfc: support Medford2 frequency adjustment format
    - sfc: add suffix to large constant in ptp
    - sfc: mark some unexported symbols as static
    - sfc: update MCDI protocol headers
    - sfc: support FEC configuration through ethtool
    - sfc: remove ctpio_dmabuf_start from stats
    - sfc: stop the TX queue before pushing new buffers

  * [18.04 FEAT] zKVM: Add hardwar...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers