Networking hangs on azure using hv_netvsc; bisected

Bug #1508706 reported by Jay Vosburgh
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Vivid
Fix Released
High
Unassigned

Bug Description

Running Ubuntu instances on azure, testing basic networking between two instances. This involves configuring VXLAN between the two instances and running iperf and rsync of the kernel tree between the instances, e.g.,

ip link add vxlan0 type vxlan id 999 local 10.88.0.12 remote 10.88.0.11 dev eth0
ip l set vxlan0 up
ip addr add 242.0.0.12/8 dev vxlan0

After some time (sometimes instantly, sometimes up to 30 minutes of activity), the networking will hang. This hang takes two forms: a complete loss of connectivity (all network, even the ssh session used to log in), or just a loss of connectivity between instances (the ssh session remains active). Sometimes for the latter case, the ssh session will then later hang.

This first appeared when testing with the Ubuntu 3.19 kernel, and I subsequently bisected this to:

commit effa2012d207f78cbc5a8360e62d420a8860b7e9
Author: KY Srinivasan <email address hidden>
Date: Mon May 11 15:39:46 2015 -0700

    hv_netvsc: Use the xmit_more skb flag to optimize signaling the host

    BugLink: http://bugs.launchpad.net/bugs/1454892

    Based on the information given to this driver (via the xmit_more skb flag),
    we can defer signaling the host if more packets are on the way. This will help
    make the host more efficient since it can potentially process a larger batch of
    packets. Implement this optimization.

    Signed-off-by: K. Y. Srinivasan <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>
    Acked-by: Tim Gardner <email address hidden>
    Acked-by: Brad Figg <email address hidden>
    Signed-off-by: Brad Figg <email address hidden>

I also tested the mainline kernel (net-next); it fails with the equivalent commit:

commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a
Author: KY Srinivasan <email address hidden>
Date: Mon May 11 15:39:46 2015 -0700

    hv_netvsc: Use the xmit_more skb flag to optimize signaling the host

For both kernel trees, I also tested the prior commit and it did not
exhibit the failure after many hours. For ubuntu, this was

commit a4aeb290bd75af5e16a6144a418291476ac6140c
Author: K. Y. Srinivasan <email address hidden>
Date: Wed Mar 18 12:29:29 2015 -0700

    Drivers: hv: vmbus: Export the vmbus_sendpacket_pagebuffer_ctl()

and for mainline it was

commit 9eea92226407e7a117ef1ceef45380ebd000a0e2
Author: Alexei Starovoitov <email address hidden>
Date: Mon May 11 15:19:48 2015 -0700

    pktgen: fix packet generation

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1508706

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
penalvch (penalvch)
tags: added: bisect-done
tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key kernel-hyper-v
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Joshua R. Poulson (jrp) wrote :

We are investigating this report and will update the bug when we know more.

Revision history for this message
Alex Ng (alexng-v) wrote :

Hi Jay,

Does this issue occur even when VXLAN is not configured between the two instances?

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

Yes, it did, although it seemed to be easier to reproduce with vxlan configured.

Revision history for this message
Stephen A. Zarkos (stevez) wrote :

KY has pushed a patch to LKML that resolves this issue: https://lkml.org/lkml/2015/11/18/690

This patch needs to be applied to the Vivid kernel and beyond. This is a critical patch, can you please take this patch as sauce for now so we can get it out as quickly as possible?

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

We are testing this patch immediately (overnight US time) and will report our results as soon as they are available

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

I have tested the patch referenced in comment #5 and it appears to resolve the network hang.

I first built and tested the Ubuntu LTS 3.19.0-31.36~14.04.1 kernel and reproduced the issue using the methodology described in the original bug description. This is commit

commit 15e42c329445b4e0f0aecefc39e205c44755c2ba
Author: Luis Henriques <email address hidden>
Date: Thu Oct 8 10:26:57 2015 +0100

    UBUNTU: Ubuntu-lts-3.19.0-31.36~14.04.1

in the lts-backport-vivid branch of git://kernel.ubuntu.com/ubuntu/ubuntu-trusty.git

I then applied the referenced patch and tested again and was unable to reproduce the issue after roughly an hour of testing.

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

SRU Justification:

Impact:

 Bug causes easily reproducible freeze of networking on affected
systems when under moderate to high network load. Ordinary benchmark
tools such as iperf induce the problem without difficulty. Affected
systems are virtual machine instances running on Azure, utilizing the
hv_netvsc network device driver.

Fix:

 Fix is to apply patch provided by Microsoft:

http://marc.info/?l=linux-kernel&m=144787522532687&w=2

Testcase:

 Tested as described in Bug Description.

Revision history for this message
K Y Srinivasan (kys) wrote :

Yes, I submitted v2 version of the patch yesterday. The two versions are functionally equivalent and obviously I would prefer you pick up the second version. However if the kernel is built and ready to go, I would go with the first version. Andy is right in that there will be one more patch that will use the functionality exposed here but that will be in netvsc code and that is not a correctness issue. I want this patch to be applied to stable and the patch could not be more than 100 lines. Once Greg commits this patch, I will be submitting the follow on patch in netvsc.

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Got another confirmation that the patch (first one) fixes the issue.

Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

The v1 version of KY's patch has been applied to Vivid[0] and is building now as (3.19.0-37.42). We intend to replace it with the mainline version in the future, once the dust settles.

[0] http://kernel.ubuntu.com/git/ubuntu/ubuntu-vivid.git/commit/?h=master-next&id=0b599c6174684f18f8bd635cb94f483c7682c4f8

Luis Henriques (henrix)
Changed in linux (Ubuntu Vivid):
status: New → Fix Committed
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
Revision history for this message
Ovidiu Rusu (orusu) wrote :

I've tested this version of kernel from proposed: 3.19.0-37-generic and everything works good.

tags: added: verification-done-vivid
removed: verification-needed-vivid
Revision history for this message
Joshua R. Poulson (jrp) wrote :

We've done broad testing on this kernel and it looks good, thanks!

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-37.42

---------------
linux (3.19.0-37.42) vivid; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1518406

  [ K. Y. Srinivasan ]

  * SAUCE: Drivers: hv: vmbus: Fix a Host signaling bug
    - LP: #1508706

 -- Kamal Mostafa <email address hidden> Fri, 20 Nov 2015 09:49:10 -0800

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Triaged → Fix Released
Changed in linux (Ubuntu Vivid):
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.