qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP tx csum offload

Bug #1909062 reported by Manish Chopra
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Committed
Medium
Matthew Ruffell
Focal
Fix Released
Medium
Matthew Ruffell
Groovy
Fix Released
Medium
Matthew Ruffell
Hirsute
Won't Fix
Medium
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/1909062

[Impact]

For users with QLogic QL41xxx series NICs, such as the FastLinQ QL41000 Series 10/25/40/50GbE Controller, when they upgrade from the 4.15 kernel to the 5.4 kernel, Kubernetes Internal DNS requests will fail, due to these packets getting corrupted.

Kubernetes uses IPIP tunnelled packets for internal DNS resolution, and this particular packet type is not supported for hardware tx checksum offload, and the packets end up corrupted when the qede driver attempts to checksum them.

This only affects internal Kubernetes DNS, as regular DNS lookups to regular external domains will succeed, due to them not using IPIP packet types.

[Fix]

Marvell has developed a fix for the qede driver, which checks the packet type, and if it is IPPROTO_IPIP, then csum offloads are disabled for socket buffers of type IPIP.

commit 5d5647dad259bb416fd5d3d87012760386d97530
Author: Manish Chopra <email address hidden>
Date: Mon Dec 21 06:55:30 2020 -0800
Subject: qede: fix offload for IPIP tunnel packets
Link: https://github.com/torvalds/linux/commit/5d5647dad259bb416fd5d3d87012760386d97530

This commit landed in mainline in 5.11-rc3. The commit was accepted into upstream stable 4.14.215, 4.19.167, 5.4.89 and 5.10.7.

Note, this SRU isn't targeted for Bionic due to tx csum offload support only landing in 5.0 and onward, meaning the 4.15 kernel still works even without this patch. Because of this, Bionic can pick the patch up naturally from upstream stable.

[Testcase]

The system must have a QLogic QL41xxx series NIC fitted, and needs to be a part of a Kubernetes cluster.

Firstly, get a list of all devices in the system:

$ sudo ifconfig

Next, set all devices down with:

$ sudo ifconfig <device> down

Next, bring up the QLogic QL41xxx device:

$ sudo ifconfig <qlogic nic device> up

Then, attempt to lookup an internal Kubernetes domain:

$ nslookup <internal kubernetes domain address>

Without the patch, the connection will time out:

;; connection timed out; no servers could be reached

If we look at packet traces with tcpdump, we see it leaves the source, but never arrives at the destination.

There is a test kernel available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf297772-test

If you install it, then Kubernetes internal DNS lookups will succeed.

[Where problems could occur]

If a regression were to occur, then users of the qede driver would be affected. This is limited to those with QLogic QL41xxx series NICs. The patch explicitly checks for IPIP type packets, so only those particular packets would be affected.

Since IPIP type packets are uncommon, it would not cause a total outage on regression, since most packets are not IPIP tunnelled. It could potentially cause problems for users who frequently handle VPN or Kubernetes internal DNS traffic.

A workaround would be to use ethtool to disable tx csum offload for all packet types, or to revert to an older kernel.

Chris Guiver (guiverc)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1909062

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: focal
Revision history for this message
Manish Chopra (chopramanish1988) wrote : Re: Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS failure

We don't have any logs to post, attached wire traces which we got from customer as we were only interested in that. Having said that, this bug is resolved with the above fix posted upstream, I guess we don't need any further logs.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
description: updated
Changed in linux (Ubuntu Focal):
status: New → In Progress
Changed in linux (Ubuntu Groovy):
status: New → In Progress
Changed in linux (Ubuntu Focal):
importance: Undecided → Medium
Changed in linux (Ubuntu Groovy):
importance: Undecided → Medium
Changed in linux (Ubuntu Focal):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in linux (Ubuntu Groovy):
assignee: nobody → Matthew Ruffell (mruffell)
summary: - Ubuntu kernel 5.x QL41xxx NIC (qede driver) Kubernetes internal DNS
- failure
+ qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting
+ IPIP tx csum offload
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Thanks Manish! I am building a test kernel now, and I will let you know once it is ready to test.

If we get good test results, I will submit the patch for SRU to the Ubuntu kernels once the patch has hit mainline.

description: updated
tags: added: sts
description: updated
description: updated
Changed in linux (Ubuntu Hirsute):
status: Confirmed → In Progress
importance: Undecided → Medium
assignee: nobody → Matthew Ruffell (mruffell)
description: updated
Changed in linux (Ubuntu Groovy):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Hirsute):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-groovy' to 'verification-done-groovy'. If the problem still exists, change the tag 'verification-needed-groovy' to 'verification-failed-groovy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-groovy
Revision history for this message
Manish Chopra (chopramanish1988) wrote :

Hi Matthew,

We have verified the fix with proposed kernel.
I hope that I have corrected the "tags" appropriately.

Thanks,
Manish

tags: added: verification-done-groovy
removed: verification-needed-groovy
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Manish,

Thanks for testing the Groovy 5.8 kernel in -proposed!

The Focal 5.4 / Bionic 5.4 HWE kernel will be in -proposed next week sometime, if you are looking to verify that as well.

I will also ask our customer in common to verify the Bionic HWE kernel when it becomes available.

Thanks for making the patch!
Matthew

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Performing verification for Focal.

The affected user enabled -proposed and installed 5.4.0-66-generic to the system with a QLogic FastLinQ QL41000 Series 10/25/40/50GbE Controller.

They then set all interfaces down, and brought up the QLogic NIC only.

#‌ uname -rv
5.4.0-66-generic #‌74~18.04.2-Ubuntu SMP Fri Feb 5 11:17:31 UTC 2021

#‌ nslookup internal.kubernetes.domain.example 10.1.0.10
Server: 10.1.0.10
Address: 10.1.0.10#‌53
Name: internal.kubernetes.domain.example
Address: 10.48.24.11

#‌ ethtool -k eno1 | grep tx-checksumming
tx-checksumming: on
#‌ ethtool -k enp94s0f0 | grep tx-checksumming
tx-checksumming: on

DNS lookup to an internal kubernetes domain with IPIP type DNS lookups work as intended, with tx checksumming enabled.

The kernel in -proposed fixes the issue, marking as verified.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (60.8 KiB)

This bug was fixed in the package linux - 5.4.0-66.74

---------------
linux (5.4.0-66.74) focal; urgency=medium

  * focal/linux: 5.4.0-66.74 -proposed tracker (LP: #1913152)

  * Add support for selective build of special drivers (LP: #1912789)
    - [Packaging] Add support for ODM drivers
    - [Packaging] Turn on ODM support for amd64

  * Packaging resync (LP: #1786013)
    - update dkms package versions
    - update dkms package versions

  * Introduce the new NVIDIA 460-server series and update the 460 series
    (LP: #1913200)
    - [Config] dkms-versions -- drop NVIDIA 435 455 and 440-server
    - [Config] dkms-versions -- add the 460-server nvidia driver

  * Enable mute and micmute LED on HP EliteBook 850 G7 (LP: #1910102)
    - ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7

  * SYNA30B4:00 06CB:CE09 Mouse on HP EliteBook 850 G7 not working at all
    (LP: #1908992)
    - HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device

  * HD Audio Device PCI ID for the Intel Cometlake-R platform (LP: #1912427)
    - SAUCE: ALSA: hda: Add Cometlake-R PCI ID

  * switch to an autogenerated nvidia series based core via dkms-versions
    (LP: #1912803)
    - [Packaging] nvidia -- use dkms-versions to define versions built
    - [Packaging] update-version-dkms -- maintain flags fields
    - [Config] dkms-versions -- add transitional/skip information for nvidia
      packages

  * udpgro.sh in net from ubuntu_kernel_selftests seems not reflecting sub-test
    result (LP: #1908499)
    - selftests: fix the return value for UDP GRO test

  * qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP
    tx csum offload (LP: #1909062)
    - qede: fix offload for IPIP tunnel packets

  * Use DCPD to control HP DreamColor panel (LP: #1911001)
    - SAUCE: drm/dp: Another HP DreamColor panel brigntness fix

  * kvm: Windows 2k19 with Hyper-v role gets stuck on pending hypervisor
    requests on cascadelake based kvm hosts (LP: #1911848)
    - KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set

  * Ubuntu 20.10 four needed fixes to 'Add driver for Mellanox Connect-IB
    adapters' (LP: #1905574)
    - net/mlx5: Fix a race when moving command interface to polling mode

  * Fix right sounds and mute/micmute LEDs for HP ZBook Fury 15/17 G7 Mobile
    Workstation (LP: #1910561)
    - ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machines

  * Ubuntu 20.04 - multicast counter is not increased in ip -s (LP: #1901842)
    - net/mlx5e: Fix multicast counter not up-to-date in "ip -s"

  * eeh-basic.sh in powerpc from ubuntu_kernel_selftests timeout with 5.4 P8 /
    P9 (LP: #1882503)
    - selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic

  * DMI entry syntax fix for Pegatron / ByteSpeed C15B (LP: #1910639)
    - Input: i8042 - unbreak Pegatron C15B

  * CVE-2020-29372
    - mm: check that mm is still valid in madvise()

  * update ENA driver, incl. new ethtool stats (LP: #1910291)
    - net: ena: Change WARN_ON expression in ena_del_napi_in_range()
    - net: ena: ethtool: convert stat_offset to 64 bit resolution
    - net: ena: eth...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (129.8 KiB)

This bug was fixed in the package linux - 5.8.0-44.50

---------------
linux (5.8.0-44.50) groovy; urgency=medium

  * groovy/linux: 5.8.0-44.50 -proposed tracker (LP: #1914805)

  * Packaging resync (LP: #1786013)
    - update dkms package versions
    - update dkms package versions

  * Introduce the new NVIDIA 460-server series and update the 460 series
    (LP: #1913200)
    - [Config] dkms-versions -- drop NVIDIA 435 455 and 440-server
    - [Config] dkms-versions -- add the 460-server nvidia driver

  * [SRU][G/H/U/OEM-5.10] re-enable s0ix of e1000e (LP: #1910541)
    - Revert "UBUNTU: SAUCE: e1000e: bump up timeout to wait when ME un-configure
      ULP mode"
    - e1000e: Only run S0ix flows if shutdown succeeded
    - Revert "e1000e: disable s0ix entry and exit flows for ME systems"
    - e1000e: Export S0ix flags to ethtool

  * suspend only works once on ThinkPad X1 Carbon gen 7 (LP: #1865570) //
    [SRU][G/H/U/OEM-5.10] re-enable s0ix of e1000e (LP: #1910541)
    - e1000e: bump up timeout to wait when ME un-configures ULP mode

  * Cannot probe sata disk on sata controller behind VMD: ata1.00: failed to
    IDENTIFY (I/O error, err_mask=0x4) (LP: #1894778)
    - PCI: vmd: Offset Client VMD MSI-X vectors

  * Enable mute and micmute LED on HP EliteBook 850 G7 (LP: #1910102)
    - ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7

  * SYNA30B4:00 06CB:CE09 Mouse on HP EliteBook 850 G7 not working at all
    (LP: #1908992)
    - HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device

  * HD Audio Device PCI ID for the Intel Cometlake-R platform (LP: #1912427)
    - SAUCE: ALSA: hda: Add Cometlake-R PCI ID

  * switch to an autogenerated nvidia series based core via dkms-versions
    (LP: #1912803)
    - [Packaging] nvidia -- use dkms-versions to define versions built
    - [Packaging] update-version-dkms -- maintain flags fields
    - [Config] dkms-versions -- add transitional/skip information for nvidia
      packages

  * udpgro.sh in net from ubuntu_kernel_selftests seems not reflecting sub-test
    result (LP: #1908499)
    - selftests: fix the return value for UDP GRO test

  * [UBUNTU 21.04] vfio: pass DMA availability information to userspace
    (LP: #1907421)
    - vfio/type1: Refactor vfio_iommu_type1_ioctl()
    - vfio iommu: Add dma available capability

  * qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP
    tx csum offload (LP: #1909062)
    - qede: fix offload for IPIP tunnel packets

  * Use DCPD to control HP DreamColor panel (LP: #1911001)
    - SAUCE: drm/dp: Another HP DreamColor panel brigntness fix

  * Fix right sounds and mute/micmute LEDs for HP ZBook Fury 15/17 G7 Mobile
    Workstation (LP: #1910561)
    - ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machines

  * Ubuntu 20.04 - multicast counter is not increased in ip -s (LP: #1901842)
    - net/mlx5e: Fix multicast counter not up-to-date in "ip -s"

  * eeh-basic.sh in powerpc from ubuntu_kernel_selftests timeout with 5.4 P8 /
    P9 (LP: #1882503)
    - selftests/powerpc/eeh: disable kselftest timeout setting for eeh-basic

  * DMI entry syntax fix for Pegatron /...

Changed in linux (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote :

The Hirsute Hippo has reached End of Life, so this bug will not be fixed for that release.

Changed in linux (Ubuntu Hirsute):
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.