[hyper-v] reloading netvsc issue on linux-azure 4.13.0-1001.1

Bug #1735546 reported by Chris Valean
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Committed
Undecided
Marcelo Cerri
Xenial
Fix Released
Undecided
Marcelo Cerri
linux-azure-edge (Ubuntu)
Fix Committed
Undecided
Marcelo Cerri
Xenial
Fix Committed
Undecided
Marcelo Cerri

Bug Description

On proposed 4.13.0-1001.1 linux-azure kernel reloading the netvsc module will cause the VM network to stop working on WS2016.

Manual repro is easy, start a VM with that kernel and run:
modprobe -r hv_netvsc
modprobe hv_netvsc

At this point the command will hang and after 2 minutes the hung task messages will appear.

Test env info:
- ubuntu 16.04.3
- Affected platforms:
a) WS2016 and WS2016 fall update - 1709
b) WS2012R2

4.11 series linux-azure are not showing this behavior.

Tags: hyper-v
Revision history for this message
Chris Valean (chvale) wrote :

Adding this issue here, as in the past we saw linked issues between netvsc reload and mtu change.

The same hung task failure occurs when changing the MTU on the interface.

Repro steps:
ip link set dev eth0 mtu 3000
ip link set dev eth0 mtu 4500

At this point the command will hang and then run into the hung messages after 120 seconds.

description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-azure (Ubuntu):
status: New → Confirmed
Revision history for this message
Dexuan Cui (decui) wrote :

I can't reproduce the issue with 4.13.0-1004-azure-edge (https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/commit/?h=azure-edge-next&id=21d8a99f88af972684618521cf19adafe24dc566)

It looks this bug has been fixed by some patch between linux-azure 4.13.0-1001.1 and 4.13.0-1004-azure-edge.

Revision history for this message
Dexuan Cui (decui) wrote :

BTW, I tested "modprobe -r hv_netvsc; modprobe hv_netvsc" and "ip link set dev eth0 mtu 3000; ip link set dev eth0 mtu 4500" with 4.13.0-1004-azure-edge on WS 2016 (Version 1607, OS build 14393:1943).

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

After having had a chat with Josh Poulson @Microsoft, it is believed that the pull request in bu https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1736283 should likely resolve the issue seen here. We will post a test kernel in bug 1736283.

Revision history for this message
Dexuan Cui (decui) wrote :

@leann Actually I'm not sure if the pull request in bug 1736283 can fix this bug. It looks 4.13.0-1004-azure-edge has already fixed this bug somehow.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hrm, interesting. Thanks for the note, I'll circle around with the team here to see what the discrepancy might be.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

This is indeed very odd. Examining both code bases, we don't see any significant difference. The source code is the same, as well as the configs and the module inclusion lists. We're checking if we can reproduce bug 1736283.

@jpoulson probably knows the answer, but were both linux-azure and linux-azure-edge tested in the same manner?

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Dexuan,

Are you using gen1 or gen2 VMs?

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Also, which specific builds of linux-azure and linux-azure-edge are you using? Are you using any versions from http://kernel.ubuntu.com/~mhcerri/azure/?

Revision history for this message
Joshua R. Poulson (jrp) wrote :

Right now only Generation 1 VMs are used in Azure.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

I couldn't properly verify the problem on azure. If I execute:

sudo sh -c 'modprobe -r hv_netvsc; modprobe hv_netvsc'

on an azure instance I lost SSH connection, but I don't see the hung task messages on the boot diagnostics windows. The problem is probably happening and maybe I'm not getting the messages.

When I run on a local hyper-v gen2 VM, I can trigger the problem when I do the following sequence:

modprobe -r hv_netvsc
modprobe hv_netvsc
modprobe -r hv_netvsc

And I can reproduce the issue with linux-azure 4.13.0-1001.1, with linux-azure-edge 4.13.0-1004.4 and even with linux-azure 4.13.0-1001.1 with the pull request changes.

I will re-test it using a gen1 VM.

Revision history for this message
Dexuan Cui (decui) wrote :

@mhcerri
I only tested Gen1 VM (4.13.0-1004-azure-edge) on my local Hyper-V host (WS 2016), and couldn't repro the issue, i.e. reloading hv_netvsc and changing MTU worked fine.

As I understand, I think the bug was originally reported against 4.13.0-1001.1 (I did not test this version). That's why I think somehow the bug was fixed in 4.13.0-1004-azure-edge.

Ubuntu-azure-edge-4.13.0-1001.1 is 567ef14ee13c5c4e336121106cc19733800d618e, and
Ubuntu-azure-edge-4.13.0-1004.4 is 21d8a99f88af972684618521cf19adafe24dc566.

There are non-trivial changes between them:

root@decui-1604:/opt/linux-azure# git diff 567ef14ee13c5c4e336121106cc19733800d618e 21d8a99f88af972684618521cf19adafe24dc566 -- drivers/hv/ drivers/net/hyperv/| wc -l
2353
root@decui-1604:/opt/linux-azure# git diff 567ef14ee13c5c4e336121106cc19733800d618e 21d8a99f88af972684618521cf19adafe24dc566 -- drivers/hv/ drivers/net/hyperv/| grep ^diff
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
diff --git a/drivers/hv/hv_kvp.c b/drivers/hv/hv_kvp.c
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
root@decui-1604:/opt/linux-azure# git diff 567ef14ee13c5c4e336121106cc19733800d618e 21d8a99f88af972684618521cf19adafe24dc566 -- drivers/hv/ drivers/net/hyperv/| wc -l
2353

Revision history for this message
Dexuan Cui (decui) wrote :
Revision history for this message
Dexuan Cui (decui) wrote :

@mhcerri
FYI: I'm not using any kernel from http://kernel.ubuntu.com/~mhcerri/azure/.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

@decui

The repos are right. I'm asking because some of the kernel under http://kernel.ubuntu.com/~mhcerri/azure/ are testing kernels.

linux-azure and linux-azure-edge version numbers are not equivalent. Thus, linux-azure 4.13.0-1001.1 is not equivalent to linux-azure-edge 4.13.0-1001.1. In fact, linux-azure 4.13.0-1001.1 should be equivalent to linux-azure-edge 4.13.0-1004.4.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

@decui

Are you building the kernels directly from the git repo or are you using binary packages from somewhere?

Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Ok. I've tested it again using gen1 VMs and I got the same results. Both linux-azure and linux-azure-edge show the problem, but for both cases I need to unload, load and unload the driver again to trigger the issue.

I also tested the test kernel with the pull request and it shows the same problem. I will double check that build to make sure everything was correctly applied.

I'm using Hyper-V on a Win10Pro machine.

Revision history for this message
Dexuan Cui (decui) wrote :

@mhcerri
Thanks for the explanation! I built the kernels directly from the git repo and I didn't use the binary packages (I thought they should be the same). What's your repro-rate? Yesterday as Chris tested linux-azure-edge 4.13.0-1004.4 + my pull request, the repro rate was only ~1%, meaning we have to do "unload/reload netvsc" several hundred times to repro the hang or the calltrace. Previously without the pull request, Chris mentioned it was much easier to repro the issue. So it looks to me the bug is timing-sensitive, and somehow my pull request made it difficult to repro it.

Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu):
status: Confirmed → Fix Committed
Changed in linux-azure (Ubuntu Xenial):
status: New → Fix Committed
Changed in linux-azure (Ubuntu):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Xenial):
assignee: nobody → Marcelo Cerri (mhcerri)
Marcelo Cerri (mhcerri)
Changed in linux-azure-edge (Ubuntu):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux-azure-edge (Ubuntu Xenial):
assignee: nobody → Marcelo Cerri (mhcerri)
Changed in linux-azure-edge (Ubuntu):
status: New → Fix Committed
Changed in linux-azure-edge (Ubuntu Xenial):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (107.7 KiB)

This bug was fixed in the package linux-azure - 4.13.0-1005.7

---------------
linux-azure (4.13.0-1005.7) xenial; urgency=low

  * linux-azure: 4.13.0-1005.7 -proposed tracker (LP: #1741957)

  * CVE-2017-5754
    - Revert "UBUNTU: [Config] azure: updateconfigs to enable PTI"
    - [Config] azure: Enable PTI with UNWINDER_FRAME_POINTER

  [ Ubuntu: 4.13.0-25.29 ]

  * linux: 4.13.0-25.29 -proposed tracker (LP: #1741955)
  * CVE-2017-5754
    - Revert "UBUNTU: [Config] updateconfigs to enable PTI"
    - [Config] Enable PTI with UNWINDER_FRAME_POINTER

linux-azure (4.13.0-1004.6) xenial; urgency=low

  * linux-azure: 4.13.0-1004.6 -proposed tracker (LP: #1741747)

  [ Ubuntu: 4.13.0-24.28 ]

  * linux: 4.13.0-24.28 -proposed tracker (LP: #1741745)
  * CVE-2017-5754
    - x86/cpu, x86/pti: Do not enable PTI on AMD processors

linux-azure (4.13.0-1003.5) xenial; urgency=low

  * linux-azure: 4.13.0-1003.5 -proposed tracker (LP: #1741557)

  * CVE-2017-5754
    - [Config] azure: updateconfigs to enable PTI

  [ Ubuntu: 4.13.0-23.27 ]

  * linux: 4.13.0-23.27 -proposed tracker (LP: #1741556)
  * CVE-2017-5754
    - x86/mm: Add the 'nopcid' boot option to turn off PCID
    - x86/mm: Enable CR4.PCIDE on supported systems
    - x86/mm: Document how CR4.PCIDE restore works
    - x86/entry/64: Refactor IRQ stacks and make them NMI-safe
    - x86/entry/64: Initialize the top of the IRQ stack before switching stacks
    - x86/entry/64: Add unwind hint annotations
    - xen/x86: Remove SME feature in PV guests
    - x86/xen/64: Rearrange the SYSCALL entries
    - irq: Make the irqentry text section unconditional
    - x86/xen/64: Fix the reported SS and CS in SYSCALL
    - x86/paravirt/xen: Remove xen_patch()
    - x86/traps: Simplify pagefault tracing logic
    - x86/idt: Unify gate_struct handling for 32/64-bit kernels
    - x86/asm: Replace access to desc_struct:a/b fields
    - x86/xen: Get rid of paravirt op adjust_exception_frame
    - x86/paravirt: Remove no longer used paravirt functions
    - x86/entry: Fix idtentry unwind hint
    - x86/mm/64: Initialize CR4.PCIDE early
    - objtool: Add ORC unwind table generation
    - objtool, x86: Add facility for asm code to provide unwind hints
    - x86/unwind: Add the ORC unwinder
    - x86/kconfig: Consolidate unwinders into multiple choice selection
    - objtool: Upgrade libelf-devel warning to error for CONFIG_ORC_UNWINDER
    - x86/ldt/64: Refresh DS and ES when modify_ldt changes an entry
    - x86/mm: Give each mm TLB flush generation a unique ID
    - x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
    - x86/mm: Rework lazy TLB mode and TLB freshness tracking
    - x86/mm: Implement PCID based optimization: try to preserve old TLB entries
      using PCID
    - x86/mm: Factor out CR3-building code
    - x86/mm/64: Stop using CR3.PCID == 0 in ASID-aware code
    - x86/mm: Flush more aggressively in lazy TLB mode
    - Revert "x86/mm: Stop calling leave_mm() in idle code"
    - kprobes/x86: Set up frame pointer in kprobe trampoline
    - x86/tracing: Introduce a static key for exception tracing
    - x86/boot: Add early cmdline parsing for options with arguments
    - mm, x86/mm...

Changed in linux-azure (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.