[linux-azure] IP forwarding issue in netvsc
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| linux-azure (Ubuntu) |
Undecided
|
Marcelo Cerri | ||
| Bionic |
Undecided
|
Unassigned | ||
| Focal |
Undecided
|
Marcelo Cerri | ||
| linux-azure-4.15 (Ubuntu) |
Undecided
|
Unassigned | ||
| Bionic |
Undecided
|
Marcelo Cerri | ||
| Focal |
Undecided
|
Unassigned |
Bug Description
[Impact]
We identified an issue with the Linux netvsc driver when used in IP forwarding mode. The problem is that the RSS hash value is not propagated to the outgoing packet, and so such packets go out on channel 0. This produces an imbalance across outgoing channels, and a possible overload on the single host-side CPU that is processing channel 0. The problem does not occur when Accelerated Networking is used because the packets go out through the Mellanox driver. Because it is tied to IP forwarding, the problem is presumably most likely to be visible in a virtual appliance device that is doing network load balancing or other kinds of packet filtering and redirection.
We would like to request fixes to this issue in 16.04, 18.04 and 20.04.
Two fixes are already in the upstream v5.5+, so they’re already in 5.8.0-1011.11.
For 5.4.0-1031.32, the 2 fixes can apply cleanly:
https:/
https:/
For 5.0.0-1036.38, we need 1 more patch applied first, so the list is:
https:/
https:/
https:/
For 4.15.0-
https:/
https:/
(The 2 patches are on the branch https:/
[Test Case]
As described in https:/
[Where problems could occur]
A potential regression would affect Azure instance using netvsc without accelerated networking.
CVE References
Marcelo Cerri (mhcerri) wrote : | #1 |
Marcelo Cerri (mhcerri) wrote : | #2 |
The 5.0 linux-azure kernel is not maintained anymore. Do you have any scenario that is still using this kernel?
Dexuan Cui (decui) wrote : | #3 |
Since the 5.0 linux-azure kernel is not maintained anymore, IMO we don't have to fix this bug for it.
Dexuan Cui (decui) wrote : | #4 |
I'll provide the instructions to reproduce the bug on Azure.
Joseph Salisbury (jsalisbury) wrote : | #5 |
It would be great if we can get the fixes into the next SRU cycle, which is the last one for 2020.
Dexuan Cui (decui) wrote : | #6 |
Here is how I reproduce the bug:
Create 3 Ubuntu 16.04 VMs (VM-1, VM gateway, VM-2) on Azure in the same Resource Group. The kernel should be the linux-azure kernel 4.15.0-
Note: none of the VMs use Accelerated Networking, i.e. all of the 3 VMs use the software NIC "NetVSC".
In my setup, VM-1 and VM-2 are "Standard D8s v3 (8 vcpus, 32 GiB memory)", and VM-gateway is "Standard DS3 v2 (4 vcpus, 14 GiB memory)". I happen to name the gateway VM "decui-dpdk", but here actually DPDK is not used at all (I do intend to use this setup for DPDK in future).
The gateway VM has 3 NICs:
The main NIC (10.0.0.4) is not used in ip-forwarding.
NIC-1's IP is 192.168.80.5
NIC-2's IP is 192.168.81.5.
The gateway VM receives packets from VM-1(192.168.80.4) and forwards the packets to VM-2 (192.168.81.4).
No firewall rule is used.
The client VM (VM-1, 192.168.80.4) has 1 NIC. It's running iperf2 client.
The server VM (VM-2, 192.168.81.4) has 1 NIC. It's running iperf2 server: "nohup iperf -s &"
The client VM is sending traffic, through the gateway VM (192.168.80.5, 192.168.81.5), to the server VM.
Note: all the 3 subnets here are in the same VNET(Virtual Net) and 2 Azure UDR (User Defined Routing) rules must be used to force the traffic to go through the gateway VM. The IP-forwarding of the gateway VM's NIC-1 and NIC-2 must be enabled from Azure portal (the setting can only changed when the VM is "stopped"), and IP-forwarding must be enabled in the gateway VM (i.e. echo 1 > /proc/sys/
iperf2 uses 512 TCP connections and I limit the bandwidth used by iperf to <=70% of the per-VM bandwith limit (Note: if the VM uses >70% of the limit, even with the 2 patches, the ping latency between VM-1 and VM-2 can still easily go very high, e.g. >200ms -- we'll try to further investigate that).
It looks the per-VM bandwithd limit of the gateway VM (DS3_v2) is 2.6Gbps, so 70% of it is 1.8Gbps.
In the client VM, run something like:
iperf -c 192.168.81.4 -b 3.5m -t 120 -P512
(-b means the per-TCP-connection limit; -P512 means 512 connections, so the total throughput should be around 3.5*512 = 1792 Mbps; "-t 120" means the test lasts for 2 minutes. we can abort the test any time by Ctrl+C.)
In the "Server VM, run:
nohup iperf -s &
ping 192.168.80.4 (we can terminate the program by Ctrl+C), and observe the latency.
In the gateway VM, run "nload" to check the current throughput (if the current device is not the NIC we want to check, press Right Arrow and Left Allow), and run "top" to check the CPU utilization (when there are 512 connections, the utilization should be still low, e.g. <25%).
When the iperf2 test is running, the ping latency between VM-1 and VM-2 can easily exceed 100ms or even >300ms, but with the 2 patches applied, the latency typically should be <20ms.
Dexuan Cui (decui) wrote : | #7 |
To use Azure UDR, I referred to this page: https:/
Dexuan Cui (decui) wrote : | #8 |
This is the network config. Let me know if you need more info.
Joseph Salisbury (jsalisbury) wrote : | #9 |
If possible, can we also get a test kernel for this bug when avaiable? Folks would like to confirm the two patches fix the issue prior to release of the image.
Changed in linux-azure (Ubuntu): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Changed in linux-azure (Ubuntu Bionic): | |
status: | New → Invalid |
Changed in linux-azure-4.15 (Ubuntu Focal): | |
status: | New → Invalid |
Changed in linux-azure (Ubuntu Focal): | |
status: | New → In Progress |
Changed in linux-azure-4.15 (Ubuntu Bionic): | |
status: | New → In Progress |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Changed in linux-azure (Ubuntu Focal): | |
assignee: | nobody → Marcelo Cerri (mhcerri) |
Marcelo Cerri (mhcerri) wrote : | #10 |
Hi, Joseph.
I've built test kernels for xenial:linux-azure, bionic:
https:/
Inside each directory, I included all the kernel .deb files and also a tarball containing the same .debs. The patches used for each kernel version is also inside each directory.
Let me know if they work as expected.
Thank you.
Dexuan Cui (decui) wrote : | #11 |
Thanks, Marcelo! I tested all the 3 kernels and they worked as we expected.
Joseph Salisbury (jsalisbury) wrote : | #12 |
Marcelo, will these fixes be avaiable in the last SRU cycle of 2020?
Marcelo Cerri (mhcerri) wrote : | #13 |
Following the regular SRU, I would expected it to be included to the next cycle with release date planned for the week of Jan 25. Please, if you can, let me know today if this date works for you. Thanks!
description: | updated |
Changed in linux-azure (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in linux-azure-4.15 (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Launchpad Janitor (janitor) wrote : | #14 |
This bug was fixed in the package linux-azure - 5.4.0-1035.36
---------------
linux-azure (5.4.0-1035.36) focal; urgency=medium
* focal/linux-azure: 5.4.0-1035.36 -proposed tracker (LP: #1907589)
* [linux-azure] IP forwarding issue in netvsc (LP: #1902531)
- hv_netvsc: record hardware hash in skb
- hv_netvsc: make recording RSS hash depend on feature flag
[ Ubuntu: 5.4.0-59.65 ]
* focal/linux: 5.4.0-59.65 -proposed tracker (LP: #1907604)
* focal: selftests/bpf build broken: test_map_
directory (LP: #1906866)
- SAUCE: Revert selftests/ "bpf: Zero-fill re-used per-cpu map element"
* Packaging resync (LP: #1786013)
- update dkms package versions
* memory is leaked when tasks are moved to net_prio (LP: #1886859)
- netprio_cgroup: Fix unlimited memory leak of v2 cgroups
* Focal update: v5.4.78 upstream stable release (LP: #1905618)
- drm/i915/gem: Flush coherency domains on first set-domain-ioctl
- time: Prevent undefined behaviour in timespec64_to_ns()
- nbd: don't update block size after device is started
- KVM: arm64: Force PTE mapping on fault resulting in a device mapping
- PCI: qcom: Make sure PCIe is reset before init for rev 2.1.0
- usb: dwc3: gadget: Continue to process pending requests
- usb: dwc3: gadget: Reclaim extra TRBs after request completion
- btrfs: tracepoints: output proper root owner for trace_find_
- btrfs: sysfs: init devices outside of the chunk_mutex
- btrfs: reschedule when cloning lots of extents
- ASoC: Intel: kbl_rt5663_
- genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_
- hv_balloon: disable warning when floor reached
- net: xfrm: fix a race condition during allocing spi
- ASoC: codecs: wcd9335: Set digital gain range correctly
- xfs: set xefi_discard when creating a deferred agfl free log intent item
- netfilter: use actual socket sk rather than skb sk when routing harder
- netfilter: nf_tables: missing validation from the abort path
- netfilter: ipset: Update byte and packet counters regardless of whether they
match
- powerpc/eeh_cache: Fix a possible debugfs deadlock
- perf trace: Fix segfault when trying to trace events by cgroup
- perf tools: Add missing swap for ino_generation
- ALSA: hda: prevent undefined shift in snd_hdac_
- iommu/vt-d: Fix a bug for PDP check in prq_event_thread
- afs: Fix warning due to unadvanced marshalling pointer
- can: rx-offload: don't call kfree_skb() from IRQ context
- can: dev: can_get_echo_skb(): prevent call to kfree_skb() in hard IRQ
context
- can: dev: __can_get_
frames
- can: can_create_
- can: j1939: swap addr and pgn in the send example
- can: j1939: j1939_sk_bind(): return failure if netdev is down
- can: ti_hecc: ti_hecc_probe(): add missed clk_disable_
path
- can: xilinx_can: handle failure cases of pm_runtime_get_sync
- can: peak_u...
Changed in linux-azure (Ubuntu Focal): | |
status: | Fix Committed → Fix Released |
Launchpad Janitor (janitor) wrote : | #15 |
This bug was fixed in the package linux-azure-4.15 - 4.15.0-1103.114
---------------
linux-azure-4.15 (4.15.0-1103.114) bionic; urgency=medium
* bionic/
* [linux-azure] IP forwarding issue in netvsc (LP: #1902531)
- hv_netvsc: record hardware hash in skb
- hv_netvsc: make recording RSS hash depend on feature flag
[ Ubuntu: 4.15.0-129.132 ]
* bionic/linux: 4.15.0-129.132 -proposed tracker (LP: #1907635)
* Packaging resync (LP: #1786013)
- update dkms package versions
* Ubuntu 18.04- call trace in kernel buffer when unloading ib_ipoib module
(LP: #1904848)
- SAUCE: net/mlx5e: IPoIB, initialize update_stat_work for ipoib devices
* memory is leaked when tasks are moved to net_prio (LP: #1886859)
- netprio_cgroup: Fix unlimited memory leak of v2 cgroups
* s390: dbginfo.sh triggers kernel panic, reading from
/sys/
- mm/page_idle.c: skip offline pages
* Bionic update: upstream stable patchset 2020-11-23 (LP: #1905333)
- drm/i915: Break up error capture compression loops with cond_resched()
- tipc: fix use-after-free in tipc_bcast_get_mode
- gianfar: Replace skb_realloc_
- gianfar: Account for Tx PTP timestamp in the skb headroom
- net: usb: qmi_wwan: add Telit LE910Cx 0x1230 composition
- sctp: Fix COMM_LOST/
- sfp: Fix error handing in sfp_probe()
- Blktrace: bail out early if block debugfs is not configured
- i40e: Fix of memory leak and integer truncation in i40e_virtchnl.c
- Fonts: Replace discarded const qualifier
- ALSA: usb-audio: Add implicit feedback quirk for Qu-16
- lib/crc32test: remove extra local_irq_
- kthread_worker: prevent queuing delayed work from timer_fn when it is being
canceled
- mm: always have io_remap_
- gfs2: Wake up when sd_glock_disposal becomes zero
- ftrace: Fix recursion check for NMI test
- ftrace: Handle tracing when switching between context
- tracing: Fix out of bounds write in get_trace_buf
- futex: Handle transient "ownerless" rtmutex state correctly
- ARM: dts: sun4i-a10: fix cpu_alert temperature
- x86/kexec: Use up-to-dated screen_info copy to fill boot params
- of: Fix reserved-memory overlap detection
- blk-cgroup: Fix memleak on error path
- blk-cgroup: Pre-allocate tree node on blkg_conf_prep
- scsi: core: Don't start concurrent async scan on same host
- vsock: use ns_capable_
- drm/vc4: drv: Add error handding for bind
- ACPI: NFIT: Fix comparison to '-ENXIO'
- vt: Disable KD_FONT_OP_COPY
- fork: fix copy_process(
- serial: 8250_mtk: Fix uart_get_baud_rate warning
- serial: txx9: add missing platform_
serial_
- USB: serial: cyberjack: fix write-URB completion race
- USB: serial: option: add Quectel EC200T module support
- USB: serial: option: add LE910...
Changed in linux-azure-4.15 (Ubuntu Bionic): | |
status: | Fix Committed → Fix Released |
Launchpad Janitor (janitor) wrote : | #16 |
This bug was fixed in the package linux-azure - 4.15.0-
---------------
linux-azure (4.15.0-
* xenial/linux-azure: 4.15.0-
[ Ubuntu: 4.15.0-1103.114 ]
* bionic/
* [linux-azure] IP forwarding issue in netvsc (LP: #1902531)
- hv_netvsc: record hardware hash in skb
- hv_netvsc: make recording RSS hash depend on feature flag
* bionic/linux: 4.15.0-129.132 -proposed tracker (LP: #1907635)
* Packaging resync (LP: #1786013)
- update dkms package versions
* Ubuntu 18.04- call trace in kernel buffer when unloading ib_ipoib module
(LP: #1904848)
- SAUCE: net/mlx5e: IPoIB, initialize update_stat_work for ipoib devices
* memory is leaked when tasks are moved to net_prio (LP: #1886859)
- netprio_cgroup: Fix unlimited memory leak of v2 cgroups
* s390: dbginfo.sh triggers kernel panic, reading from
/sys/
- mm/page_idle.c: skip offline pages
* Bionic update: upstream stable patchset 2020-11-23 (LP: #1905333)
- drm/i915: Break up error capture compression loops with cond_resched()
- tipc: fix use-after-free in tipc_bcast_get_mode
- gianfar: Replace skb_realloc_
- gianfar: Account for Tx PTP timestamp in the skb headroom
- net: usb: qmi_wwan: add Telit LE910Cx 0x1230 composition
- sctp: Fix COMM_LOST/
- sfp: Fix error handing in sfp_probe()
- Blktrace: bail out early if block debugfs is not configured
- i40e: Fix of memory leak and integer truncation in i40e_virtchnl.c
- Fonts: Replace discarded const qualifier
- ALSA: usb-audio: Add implicit feedback quirk for Qu-16
- lib/crc32test: remove extra local_irq_
- kthread_worker: prevent queuing delayed work from timer_fn when it is being
canceled
- mm: always have io_remap_
- gfs2: Wake up when sd_glock_disposal becomes zero
- ftrace: Fix recursion check for NMI test
- ftrace: Handle tracing when switching between context
- tracing: Fix out of bounds write in get_trace_buf
- futex: Handle transient "ownerless" rtmutex state correctly
- ARM: dts: sun4i-a10: fix cpu_alert temperature
- x86/kexec: Use up-to-dated screen_info copy to fill boot params
- of: Fix reserved-memory overlap detection
- blk-cgroup: Fix memleak on error path
- blk-cgroup: Pre-allocate tree node on blkg_conf_prep
- scsi: core: Don't start concurrent async scan on same host
- vsock: use ns_capable_
- drm/vc4: drv: Add error handding for bind
- ACPI: NFIT: Fix comparison to '-ENXIO'
- vt: Disable KD_FONT_OP_COPY
- fork: fix copy_process(
- serial: 8250_mtk: Fix uart_get_baud_rate warning
- serial: txx9: add missing platform_
serial_
- USB: serial: cyberjack: fix write-URB completion race
- USB:...
Changed in linux-azure (Ubuntu): | |
status: | New → Fix Released |
Thanks for reporting the issue and for providing the back ported fixes.
Do you have any instructions that we can use to reproduce and test the problem?