[linux-azure] IP forwarding issue in netvsc

Bug #1902531 reported by Joseph Salisbury
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
Undecided
Marcelo Cerri
Bionic
Invalid
Undecided
Unassigned
Focal
Fix Released
Undecided
Marcelo Cerri
linux-azure-4.15 (Ubuntu)
New
Undecided
Unassigned
Bionic
Fix Released
Undecided
Marcelo Cerri
Focal
Invalid
Undecided
Unassigned

Bug Description

[Impact]

We identified an issue with the Linux netvsc driver when used in IP forwarding mode. The problem is that the RSS hash value is not propagated to the outgoing packet, and so such packets go out on channel 0. This produces an imbalance across outgoing channels, and a possible overload on the single host-side CPU that is processing channel 0. The problem does not occur when Accelerated Networking is used because the packets go out through the Mellanox driver. Because it is tied to IP forwarding, the problem is presumably most likely to be visible in a virtual appliance device that is doing network load balancing or other kinds of packet filtering and redirection.

We would like to request fixes to this issue in 16.04, 18.04 and 20.04.

Two fixes are already in the upstream v5.5+, so they’re already in 5.8.0-1011.11.

For 5.4.0-1031.32, the 2 fixes can apply cleanly:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1fac7ca4e63bf935780cc632ccb6ba8de5f22321
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f3aeb1ba05d41320e6cf9a60f698d9c4e44348e

For 5.0.0-1036.38, we need 1 more patch applied first, so the list is:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b441f79532ec13dc82d05c55badc4da1f62a6141
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1fac7ca4e63bf935780cc632ccb6ba8de5f22321
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f3aeb1ba05d41320e6cf9a60f698d9c4e44348e

For 4.15.0-1098.109~16.04.1, the 2 patches can not apply cleanly, so Dexuan backported them here:
https://github.com/dcui/linux/commit/4ed58762a56cccfd006e633fac63311176508795
https://github.com/dcui/linux/commit/40ad7849a6365a5a485f05453e10e3541025e25a
(The 2 patches are on the branch https://github.com/dcui/linux/commits/decui/ubuntu_16.04/linux-azure/Ubuntu-azure-4.15.0-1098.109_16.04.1)

[Test Case]

As described in https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1902531/comments/6

[Where problems could occur]

A potential regression would affect Azure instance using netvsc without accelerated networking.

CVE References

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Thanks for reporting the issue and for providing the back ported fixes.

Do you have any instructions that we can use to reproduce and test the problem?

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

The 5.0 linux-azure kernel is not maintained anymore. Do you have any scenario that is still using this kernel?

Revision history for this message
Dexuan Cui (decui) wrote :

Since the 5.0 linux-azure kernel is not maintained anymore, IMO we don't have to fix this bug for it.

Revision history for this message
Dexuan Cui (decui) wrote :

I'll provide the instructions to reproduce the bug on Azure.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It would be great if we can get the fixes into the next SRU cycle, which is the last one for 2020.

Revision history for this message
Dexuan Cui (decui) wrote :

Here is how I reproduce the bug:

Create 3 Ubuntu 16.04 VMs (VM-1, VM gateway, VM-2) on Azure in the same Resource Group. The kernel should be the linux-azure kernel 4.15.0-1098.109~16.04.1 (or newer). I use Gen1 VM but Gen2 should also has the same issue; I use the "East US 2" region, but the issue should reproduce in any region.

Note: none of the VMs use Accelerated Networking, i.e. all of the 3 VMs use the software NIC "NetVSC".

In my setup, VM-1 and VM-2 are "Standard D8s v3 (8 vcpus, 32 GiB memory)", and VM-gateway is "Standard DS3 v2 (4 vcpus, 14 GiB memory)". I happen to name the gateway VM "decui-dpdk", but here actually DPDK is not used at all (I do intend to use this setup for DPDK in future).

The gateway VM has 3 NICs:
 The main NIC (10.0.0.4) is not used in ip-forwarding.
 NIC-1's IP is 192.168.80.5
 NIC-2's IP is 192.168.81.5.
 The gateway VM receives packets from VM-1(192.168.80.4) and forwards the packets to VM-2 (192.168.81.4).
 No firewall rule is used.

The client VM (VM-1, 192.168.80.4) has 1 NIC. It's running iperf2 client.
The server VM (VM-2, 192.168.81.4) has 1 NIC. It's running iperf2 server: "nohup iperf -s &"

The client VM is sending traffic, through the gateway VM (192.168.80.5, 192.168.81.5), to the server VM.
Note: all the 3 subnets here are in the same VNET(Virtual Net) and 2 Azure UDR (User Defined Routing) rules must be used to force the traffic to go through the gateway VM. The IP-forwarding of the gateway VM's NIC-1 and NIC-2 must be enabled from Azure portal (the setting can only changed when the VM is "stopped"), and IP-forwarding must be enabled in the gateway VM (i.e. echo 1 > /proc/sys/net/ipv4/ip_forward). I'll attach some screenshots showing the network topology and the configuration.

iperf2 uses 512 TCP connections and I limit the bandwidth used by iperf to <=70% of the per-VM bandwith limit (Note: if the VM uses >70% of the limit, even with the 2 patches, the ping latency between VM-1 and VM-2 can still easily go very high, e.g. >200ms -- we'll try to further investigate that).

It looks the per-VM bandwithd limit of the gateway VM (DS3_v2) is 2.6Gbps, so 70% of it is 1.8Gbps.

In the client VM, run something like:
    iperf -c 192.168.81.4 -b 3.5m -t 120 -P512
    (-b means the per-TCP-connection limit; -P512 means 512 connections, so the total throughput should be around 3.5*512 = 1792 Mbps; "-t 120" means the test lasts for 2 minutes. we can abort the test any time by Ctrl+C.)

In the "Server VM, run:
    nohup iperf -s &
    ping 192.168.80.4 (we can terminate the program by Ctrl+C), and observe the latency.

In the gateway VM, run "nload" to check the current throughput (if the current device is not the NIC we want to check, press Right Arrow and Left Allow), and run "top" to check the CPU utilization (when there are 512 connections, the utilization should be still low, e.g. <25%).

When the iperf2 test is running, the ping latency between VM-1 and VM-2 can easily exceed 100ms or even >300ms, but with the 2 patches applied, the latency typically should be <20ms.

Revision history for this message
Dexuan Cui (decui) wrote :
Revision history for this message
Dexuan Cui (decui) wrote :

This is the network config. Let me know if you need more info.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

If possible, can we also get a test kernel for this bug when avaiable? Folks would like to confirm the two patches fix the issue prior to release of the image.

Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Bionic):
status: New → Invalid
Changed in linux-azure-4.15 (Ubuntu Focal):
status: New → Invalid
Changed in linux-azure (Ubuntu Focal):
status: New → In Progress
Changed in linux-azure-4.15 (Ubuntu Bionic):
status: New → In Progress
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Focal):
assignee: nobody → Marcelo Cerri (mhcerri)
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Hi, Joseph.

I've built test kernels for xenial:linux-azure, bionic:linux-azure-4.15 and focal:linux-azure. You can find them at:

https://kernel.ubuntu.com/~mhcerri/azure/lp1902531/

Inside each directory, I included all the kernel .deb files and also a tarball containing the same .debs. The patches used for each kernel version is also inside each directory.

Let me know if they work as expected.

Thank you.

Revision history for this message
Dexuan Cui (decui) wrote :

Thanks, Marcelo! I tested all the 3 kernels and they worked as we expected.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Marcelo, will these fixes be avaiable in the last SRU cycle of 2020?

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Following the regular SRU, I would expected it to be included to the next cycle with release date planned for the week of Jan 25. Please, if you can, let me know today if this date works for you. Thanks!

Marcelo Cerri (mhcerri)
description: updated
Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-azure-4.15 (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (32.4 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1035.36

---------------
linux-azure (5.4.0-1035.36) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1035.36 -proposed tracker (LP: #1907589)

  * [linux-azure] IP forwarding issue in netvsc (LP: #1902531)
    - hv_netvsc: record hardware hash in skb
    - hv_netvsc: make recording RSS hash depend on feature flag

  [ Ubuntu: 5.4.0-59.65 ]

  * focal/linux: 5.4.0-59.65 -proposed tracker (LP: #1907604)
  * focal: selftests/bpf build broken: test_map_init.skel.h: No such file or
    directory (LP: #1906866)
    - SAUCE: Revert selftests/ "bpf: Zero-fill re-used per-cpu map element"
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * memory is leaked when tasks are moved to net_prio (LP: #1886859)
    - netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  * Focal update: v5.4.78 upstream stable release (LP: #1905618)
    - drm/i915/gem: Flush coherency domains on first set-domain-ioctl
    - time: Prevent undefined behaviour in timespec64_to_ns()
    - nbd: don't update block size after device is started
    - KVM: arm64: Force PTE mapping on fault resulting in a device mapping
    - PCI: qcom: Make sure PCIe is reset before init for rev 2.1.0
    - usb: dwc3: gadget: Continue to process pending requests
    - usb: dwc3: gadget: Reclaim extra TRBs after request completion
    - btrfs: tracepoints: output proper root owner for trace_find_free_extent()
    - btrfs: sysfs: init devices outside of the chunk_mutex
    - btrfs: reschedule when cloning lots of extents
    - ASoC: Intel: kbl_rt5663_max98927: Fix kabylake_ssp_fixup function
    - genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY
    - hv_balloon: disable warning when floor reached
    - net: xfrm: fix a race condition during allocing spi
    - ASoC: codecs: wcd9335: Set digital gain range correctly
    - xfs: set xefi_discard when creating a deferred agfl free log intent item
    - netfilter: use actual socket sk rather than skb sk when routing harder
    - netfilter: nf_tables: missing validation from the abort path
    - netfilter: ipset: Update byte and packet counters regardless of whether they
      match
    - powerpc/eeh_cache: Fix a possible debugfs deadlock
    - perf trace: Fix segfault when trying to trace events by cgroup
    - perf tools: Add missing swap for ino_generation
    - ALSA: hda: prevent undefined shift in snd_hdac_ext_bus_get_link()
    - iommu/vt-d: Fix a bug for PDP check in prq_event_thread
    - afs: Fix warning due to unadvanced marshalling pointer
    - can: rx-offload: don't call kfree_skb() from IRQ context
    - can: dev: can_get_echo_skb(): prevent call to kfree_skb() in hard IRQ
      context
    - can: dev: __can_get_echo_skb(): fix real payload length return value for RTR
      frames
    - can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()
    - can: j1939: swap addr and pgn in the send example
    - can: j1939: j1939_sk_bind(): return failure if netdev is down
    - can: ti_hecc: ti_hecc_probe(): add missed clk_disable_unprepare() in error
      path
    - can: xilinx_can: handle failure cases of pm_runtime_get_sync
    - can: peak_u...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.0 KiB)

This bug was fixed in the package linux-azure-4.15 - 4.15.0-1103.114

---------------
linux-azure-4.15 (4.15.0-1103.114) bionic; urgency=medium

  * bionic/linux-azure-4.15: 4.15.0-1103.114 -proposed tracker (LP: #1907621)

  * [linux-azure] IP forwarding issue in netvsc (LP: #1902531)
    - hv_netvsc: record hardware hash in skb
    - hv_netvsc: make recording RSS hash depend on feature flag

  [ Ubuntu: 4.15.0-129.132 ]

  * bionic/linux: 4.15.0-129.132 -proposed tracker (LP: #1907635)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * Ubuntu 18.04- call trace in kernel buffer when unloading ib_ipoib module
    (LP: #1904848)
    - SAUCE: net/mlx5e: IPoIB, initialize update_stat_work for ipoib devices
  * memory is leaked when tasks are moved to net_prio (LP: #1886859)
    - netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  * s390: dbginfo.sh triggers kernel panic, reading from
    /sys/kernel/mm/page_idle/bitmap (LP: #1904884)
    - mm/page_idle.c: skip offline pages
  * Bionic update: upstream stable patchset 2020-11-23 (LP: #1905333)
    - drm/i915: Break up error capture compression loops with cond_resched()
    - tipc: fix use-after-free in tipc_bcast_get_mode
    - gianfar: Replace skb_realloc_headroom with skb_cow_head for PTP
    - gianfar: Account for Tx PTP timestamp in the skb headroom
    - net: usb: qmi_wwan: add Telit LE910Cx 0x1230 composition
    - sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian platforms
    - sfp: Fix error handing in sfp_probe()
    - Blktrace: bail out early if block debugfs is not configured
    - i40e: Fix of memory leak and integer truncation in i40e_virtchnl.c
    - Fonts: Replace discarded const qualifier
    - ALSA: usb-audio: Add implicit feedback quirk for Qu-16
    - lib/crc32test: remove extra local_irq_disable/enable
    - kthread_worker: prevent queuing delayed work from timer_fn when it is being
      canceled
    - mm: always have io_remap_pfn_range() set pgprot_decrypted()
    - gfs2: Wake up when sd_glock_disposal becomes zero
    - ftrace: Fix recursion check for NMI test
    - ftrace: Handle tracing when switching between context
    - tracing: Fix out of bounds write in get_trace_buf
    - futex: Handle transient "ownerless" rtmutex state correctly
    - ARM: dts: sun4i-a10: fix cpu_alert temperature
    - x86/kexec: Use up-to-dated screen_info copy to fill boot params
    - of: Fix reserved-memory overlap detection
    - blk-cgroup: Fix memleak on error path
    - blk-cgroup: Pre-allocate tree node on blkg_conf_prep
    - scsi: core: Don't start concurrent async scan on same host
    - vsock: use ns_capable_noaudit() on socket create
    - drm/vc4: drv: Add error handding for bind
    - ACPI: NFIT: Fix comparison to '-ENXIO'
    - vt: Disable KD_FONT_OP_COPY
    - fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent
    - serial: 8250_mtk: Fix uart_get_baud_rate warning
    - serial: txx9: add missing platform_driver_unregister() on error in
      serial_txx9_init
    - USB: serial: cyberjack: fix write-URB completion race
    - USB: serial: option: add Quectel EC200T module support
    - USB: serial: option: add LE910...

Changed in linux-azure-4.15 (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.1 KiB)

This bug was fixed in the package linux-azure - 4.15.0-1103.114~16.04.1

---------------
linux-azure (4.15.0-1103.114~16.04.1) xenial; urgency=medium

  * xenial/linux-azure: 4.15.0-1103.114~16.04.1 -proposed tracker (LP: #1907619)

  [ Ubuntu: 4.15.0-1103.114 ]

  * bionic/linux-azure-4.15: 4.15.0-1103.114 -proposed tracker (LP: #1907621)
  * [linux-azure] IP forwarding issue in netvsc (LP: #1902531)
    - hv_netvsc: record hardware hash in skb
    - hv_netvsc: make recording RSS hash depend on feature flag
  * bionic/linux: 4.15.0-129.132 -proposed tracker (LP: #1907635)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * Ubuntu 18.04- call trace in kernel buffer when unloading ib_ipoib module
    (LP: #1904848)
    - SAUCE: net/mlx5e: IPoIB, initialize update_stat_work for ipoib devices
  * memory is leaked when tasks are moved to net_prio (LP: #1886859)
    - netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  * s390: dbginfo.sh triggers kernel panic, reading from
    /sys/kernel/mm/page_idle/bitmap (LP: #1904884)
    - mm/page_idle.c: skip offline pages
  * Bionic update: upstream stable patchset 2020-11-23 (LP: #1905333)
    - drm/i915: Break up error capture compression loops with cond_resched()
    - tipc: fix use-after-free in tipc_bcast_get_mode
    - gianfar: Replace skb_realloc_headroom with skb_cow_head for PTP
    - gianfar: Account for Tx PTP timestamp in the skb headroom
    - net: usb: qmi_wwan: add Telit LE910Cx 0x1230 composition
    - sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian platforms
    - sfp: Fix error handing in sfp_probe()
    - Blktrace: bail out early if block debugfs is not configured
    - i40e: Fix of memory leak and integer truncation in i40e_virtchnl.c
    - Fonts: Replace discarded const qualifier
    - ALSA: usb-audio: Add implicit feedback quirk for Qu-16
    - lib/crc32test: remove extra local_irq_disable/enable
    - kthread_worker: prevent queuing delayed work from timer_fn when it is being
      canceled
    - mm: always have io_remap_pfn_range() set pgprot_decrypted()
    - gfs2: Wake up when sd_glock_disposal becomes zero
    - ftrace: Fix recursion check for NMI test
    - ftrace: Handle tracing when switching between context
    - tracing: Fix out of bounds write in get_trace_buf
    - futex: Handle transient "ownerless" rtmutex state correctly
    - ARM: dts: sun4i-a10: fix cpu_alert temperature
    - x86/kexec: Use up-to-dated screen_info copy to fill boot params
    - of: Fix reserved-memory overlap detection
    - blk-cgroup: Fix memleak on error path
    - blk-cgroup: Pre-allocate tree node on blkg_conf_prep
    - scsi: core: Don't start concurrent async scan on same host
    - vsock: use ns_capable_noaudit() on socket create
    - drm/vc4: drv: Add error handding for bind
    - ACPI: NFIT: Fix comparison to '-ENXIO'
    - vt: Disable KD_FONT_OP_COPY
    - fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent
    - serial: 8250_mtk: Fix uart_get_baud_rate warning
    - serial: txx9: add missing platform_driver_unregister() on error in
      serial_txx9_init
    - USB: serial: cyberjack: fix write-URB completion race
    - USB:...

Changed in linux-azure (Ubuntu):
status: New → Fix Released
Revision history for this message
Mihai (mihaime) wrote (last edit ):

Hi everyone,

I noticed a behavior that I'm not sure if it's to be expected after this patch or not.

SRC VM --> Router VM (with accelerated networking / Mellanox driver) --> 2 tunnel links ECMP --> destination VM

Seems that inside a single flow:
- SYN from SRC VM gets hashed on tunnel link one on Router VM
- ACK (part of the same flow/tcp handshake) gets hashed on tunnel link two on Router VM

RX hash (RSS hash) is on.

I would expect that the hashing and equal load balancing of traffic would happen per flow, not ~ per packet.

Revision history for this message
Dexuan Cui (decui) wrote :

I guess the SYN/ACK will get hashed on the same tunnel link if you disable accelerated networking.

When accelerated networking is enabled on Azure, all incoming TCP SYN packets are still received through the software NIC netvsc and get the RX hash values from Hyper-V virtual switch, but the ACK packets are received through the Mellanox NIC (and then passed on to the hv_netvsc driver) so the ACK packets get hash values from the Mellanox NIC and the values may be different from the ones got for the SYN packets.

In the future (maybe it's already available on some nodes), when accelerated networking is enabled, the incoming SYN packets will also be received through hardware NIC rather than netvsc.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.