[Hyper-V] do not lose pending heartbeat vmbus packets

Bug #1632786 reported by Joshua R. Poulson on 2016-10-12
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Joseph Salisbury
Precise
Medium
Joseph Salisbury
Trusty
Medium
Joseph Salisbury
Vivid
Medium
Joseph Salisbury
Xenial
Medium
Joseph Salisbury
Yakkety
Medium
Joseph Salisbury

Bug Description

Hyper-V hosts can continue sending heartbeat packets to guests independent of whether earlier packets have responses, which led to a potential issue of these packets being dropped when responses took too long to process. Lost heartbeats will lead to the host diagnosing that the guest is dead and should be shut down and restarted.

The following patch was submitted upstream but has not yet been accepted. I will add the upstream commit ID once the patch goes into linux-next:

From: Long Li <email address hidden>

The host keeps sending heartbeat packets independent of the
guest responding to them. Even though we respond to the heartbeat messages at
interrupt level, we can have situations where there maybe multiple heartbeat
messages pending that have not been responded to. For instance this occurs when the
VM is paused and the host continues to send the heartbeat messages.
Address this issue by draining and responding to all
the heartbeat messages that maybe pending.

Signed-off-by: Long Li <email address hidden>
Signed-off-by: K. Y. Srinivasan <email address hidden>
CC: Stable <email address hidden>
---
        V2: Submit the patch to stable as well - Joshua R. Poulson <email address hidden>

 drivers/hv/hv_util.c | 10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
index 4aa3cb6..bcd0630 100644
--- a/drivers/hv/hv_util.c
+++ b/drivers/hv/hv_util.c
@@ -314,10 +314,14 @@ static void heartbeat_onchannelcallback(void *context)
        u8 *hbeat_txf_buf = util_heartbeat.recv_buffer;
        struct icmsg_negotiate *negop = NULL;

- vmbus_recvpacket(channel, hbeat_txf_buf,
- PAGE_SIZE, &recvlen, &requestid);
+ while (1) {
+
+ vmbus_recvpacket(channel, hbeat_txf_buf,
+ PAGE_SIZE, &recvlen, &requestid);
+
+ if (!recvlen)
+ break;

- if (recvlen > 0) {
                icmsghdrp = (struct icmsg_hdr *)&hbeat_txf_buf[
                                sizeof(struct vmbuspipe_hdr)];

Joshua R. Poulson (jrp) wrote :

This affects 16.10, 16.04, 14.04, and 12.04. Please apply this patch to the HWE kernels for the LTS releases.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1632786

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key kernel-hyper-v
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Xenial):
status: New → Triaged
Changed in linux (Ubuntu Trusty):
status: New → Triaged
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
Tim Gardner (timg-tpi) wrote :

Joshua - can you attach this patch in a usable form ? It appears to be white space damaged.

Joshua R. Poulson (jrp) wrote :

From: Long Li <email address hidden>

The host keeps sending heartbeat packets independent of the
guest responding to them. Even though we respond to the heartbeat messages at
interrupt level, we can have situations where there maybe multiple heartbeat
messages pending that have not been responded to. For instance this occurs when the
VM is paused and the host continues to send the heartbeat messages.
Address this issue by draining and responding to all
the heartbeat messages that maybe pending.

Signed-off-by: Long Li <email address hidden>
Signed-off-by: K. Y. Srinivasan <email address hidden>
---
 drivers/hv/hv_util.c | 10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
index 4aa3cb6..bcd0630 100644
--- a/drivers/hv/hv_util.c
+++ b/drivers/hv/hv_util.c
@@ -314,10 +314,14 @@ static void heartbeat_onchannelcallback(void *context)
        u8 *hbeat_txf_buf = util_heartbeat.recv_buffer;
        struct icmsg_negotiate *negop = NULL;

- vmbus_recvpacket(channel, hbeat_txf_buf,
- PAGE_SIZE, &recvlen, &requestid);
+ while (1) {
+
+ vmbus_recvpacket(channel, hbeat_txf_buf,
+ PAGE_SIZE, &recvlen, &requestid);
+
+ if (!recvlen)
+ break;

- if (recvlen > 0) {
                icmsghdrp = (struct icmsg_hdr *)&hbeat_txf_buf[
                                sizeof(struct vmbuspipe_hdr)];

Changed in linux (Ubuntu Trusty):
status: Triaged → Confirmed
Changed in linux (Ubuntu Xenial):
status: Triaged → Confirmed
Changed in linux (Ubuntu Yakkety):
status: Triaged → Confirmed
Joshua R. Poulson (jrp) wrote :

Also affects: Precise

Changed in linux (Ubuntu Precise):
status: New → Confirmed
importance: Undecided → High
importance: High → Medium
Joshua R. Poulson (jrp) wrote :

Steps to Reproduce:
- Cluster setup configuration (4x2)
- Windows server 2016
- Setup array for cluster
- Installed Failover Cluster and HyperV on all nodes.
- Created VMs and IOs volumes on both CSV and non-CSV (Cluster Shared Volume).
- Started I/O on VMs.
- Started graceful node failover test, by pausing and drain all the VMs from the a node.
- All VMs transferred without any issue, the node is in paused state.
- Rebooted the paused node.
- After the node came back to optimal, resume the node, VMs started to transfer back.

Joseph Salisbury (jsalisbury) wrote :

I built a Yakkety test kernel with that patch, which can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1632786/yakkety/

Can you test this kernel and see if it resolves this bug?

I'll also build P, T and X test kernels with this patch and post links to them.

Joseph Salisbury (jsalisbury) wrote :
Joseph Salisbury (jsalisbury) wrote :

s/not/now

Joshua R. Poulson (jrp) wrote :

Chris, have these test kernels been tried?

Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Precise):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
Changed in linux (Ubuntu Trusty):
status: Confirmed → In Progress
Changed in linux (Ubuntu Precise):
status: Confirmed → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Joshua R. Poulson (jrp) wrote :

Upstream kernel commit ID 407a3aee6ee2d2cb46d9ba3fc380bc29f35d020c

Chris Valean (cvalean) wrote :

Hi Joe,

I finished the verification of test kernels provided in comment #8 and the reported heartbeat issue is resolved with the patch included.
Please continue with marking this for proposed & release.
Thank you!

Luis Henriques (henrix) on 2016-11-15
Changed in linux (Ubuntu Precise):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Vivid):
status: New → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Joshua R. Poulson (jrp) wrote :

The test kernels were okay, should we be testing -proposed?

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Vivid):
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

The patch is in fact in -proposed. The bug bot should come along and asked to test -proposed once we are in the SRU verification phase.

Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'. If the problem still exists, change the tag 'verification-needed-precise' to 'verification-failed-precise'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-precise
tags: added: verification-needed-trusty
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'. If the problem still exists, change the tag 'verification-needed-trusty' to 'verification-failed-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'. If the problem still exists, change the tag 'verification-needed-vivid' to 'verification-failed-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Chris Valean (cvalean) on 2016-12-10
tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Chris Valean (cvalean) on 2016-12-11
tags: added: verification-done-trusty
removed: verification-needed-trusty
Brad Figg (brad-figg) on 2016-12-12
tags: added: verification-done-vivid
removed: verification-needed-vivid
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Chris Valean (cvalean) on 2016-12-15
tags: added: verification-done-xenial
removed: verification-needed-xenial
Chris Valean (cvalean) on 2016-12-15
tags: added: verification-done-precise
removed: verification-needed-precise
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.2.0-119.162

---------------
linux (3.2.0-119.162) precise; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1647713

  * CVE-2016-7916
    - proc: prevent accessing /proc/<PID>/environ until it's ready

  * [Hyper-V] do not lose pending heartbeat vmbus packets (LP: #1632786)
    - hv: do not lose pending heartbeat vmbus packets

 -- Luis Henriques <email address hidden> Tue, 06 Dec 2016 13:30:22 +0000

Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.13.0-106.153

---------------
linux (3.13.0-106.153) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1647749

  * CVE-2016-7916
    - proc: prevent accessing /proc/<PID>/environ until it's ready

  * CVE-2016-6213
    - mnt: Add a per mount namespace limit on the number of mounts

  * aio completions are dropped (LP: #1641129)
    - aio: fix reqs_available handling

  * [Hyper-V] do not lose pending heartbeat vmbus packets (LP: #1632786)
    - hv: do not lose pending heartbeat vmbus packets

  * ipv6: connected routes are missing after a down/up cycle on the loopback
    (LP: #1634545)
    - ipv6: reallocate addrconf router for ipv6 address when lo device up
    - ipv6: correctly add local routes when lo goes up

  * audit: prevent a new auditd to stop an old auditd still alive (LP: #1633404)
    - audit: stop an old auditd being starved out by a new auditd

  * Setting net.ipv4.neigh.default.gc_thresh1/2/3 on 3.13.0-97.144 or later
    causes 'invalid argument' error (LP: #1634892)
    - neigh: fix setting of default gc_* values

  * move nvme driver to linux-image (LP: #1640275)
    - [Config] Add nvme to the generic inclusion list

 -- Luis Henriques <email address hidden> Tue, 06 Dec 2016 15:00:27 +0000

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-78.86

---------------
linux (3.19.0-78.86) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1647787

  * CVE-2016-7916
    - proc: prevent accessing /proc/<PID>/environ until it's ready

  * CVE-2016-6213
    - mnt: Add a per mount namespace limit on the number of mounts

  * [Hyper-V] do not lose pending heartbeat vmbus packets (LP: #1632786)
    - hv: do not lose pending heartbeat vmbus packets

  * ipv6: connected routes are missing after a down/up cycle on the loopback
    (LP: #1634545)
    - ipv6: correctly add local routes when lo goes up

 -- Luis Henriques <email address hidden> Tue, 06 Dec 2016 16:25:45 +0000

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (17.0 KiB)

This bug was fixed in the package linux - 4.4.0-57.78

---------------
linux (4.4.0-57.78) xenial; urgency=low

  * Release Tracking Bug
    - LP: #1648867

  * Miscellaneous Ubuntu changes
    - SAUCE: Do not build the xr-usb-serial driver for s390

linux (4.4.0-56.77) xenial; urgency=low

  * Release Tracking Bug
    - LP: #1648867

  * Release Tracking Bug
    - LP: #1648579

  * CONFIG_NR_CPUS=256 is too low (LP: #1579205)
    - [Config] Increase the NR_CPUS to 512 for amd64 to support systems with a
      large number of cores.

  * NVMe drives in Amazon AWS instance fail to initialize (LP: #1648449)
    - SAUCE: (no-up) NVMe: only setup MSIX once

linux (4.4.0-55.76) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1648503

  * NVMe driver accidentally reverted to use GSI instead of MSIX (LP: #1647887)
    - (fix) NVMe: restore code to always use MSI/MSI-x interrupts

linux (4.4.0-54.75) xenial; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1648017

  * Update hio driver to 2.1.0.28 (LP: #1646643)
    - SAUCE: hio: update to Huawei ES3000_V2 (2.1.0.28)

  * linux: Enable live patching for all supported architectures (LP: #1633577)
    - [Config] CONFIG_LIVEPATCH=y for s390x

  * Botched backport breaks level triggered EOIs in QEMU guests with --machine
    kernel_irqchip=split (LP: #1644394)
    - kvm/irqchip: kvm_arch_irq_routing_update renaming split

  * Xenial update to v4.4.35 stable release (LP: #1645453)
    - x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems
    - KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
    - KVM: Disable irq while unregistering user notifier
    - fuse: fix fuse_write_end() if zero bytes were copied
    - mfd: intel-lpss: Do not put device in reset state on suspend
    - can: bcm: fix warning in bcm_connect/proc_register
    - i2c: mux: fix up dependencies
    - kbuild: add -fno-PIE
    - scripts/has-stack-protector: add -fno-PIE
    - x86/kexec: add -fno-PIE
    - kbuild: Steal gcc's pie from the very beginning
    - ext4: sanity check the block and cluster size at mount time
    - crypto: caam - do not register AES-XTS mode on LP units
    - drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5)
    - clk: mmp: pxa910: fix return value check in pxa910_clk_init()
    - clk: mmp: pxa168: fix return value check in pxa168_clk_init()
    - clk: mmp: mmp2: fix return value check in mmp2_clk_init()
    - rtc: omap: Fix selecting external osc
    - iwlwifi: pcie: fix SPLC structure parsing
    - mfd: core: Fix device reference leak in mfd_clone_cell
    - uwb: fix device reference leaks
    - PM / sleep: fix device reference leak in test_suspend
    - PM / sleep: don't suspend parent when async child suspend_{noirq, late}
      fails
    - IB/mlx4: Check gid_index return value
    - IB/mlx4: Fix create CQ error flow
    - IB/mlx5: Use cache line size to select CQE stride
    - IB/mlx5: Fix fatal error dispatching
    - IB/core: Avoid unsigned int overflow in sg_alloc_table
    - IB/uverbs: Fix leak of XRC target QPs
    - IB/cm: Mark stale CM id's whenever the mad agent was unregistered
    - netfilter: nft_dynset: fix element timeou...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (25.5 KiB)

This bug was fixed in the package linux - 4.8.0-32.34

---------------
linux (4.8.0-32.34) yakkety; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1649358

  * Vulnerability picked up from 4.8.10 stable kernel (LP: #1648662)
    - net: handle no dst on skb in icmp6_send

linux (4.8.0-31.33) yakkety; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1648034

  * Update hio driver to 2.1.0.28 (LP: #1646643)
    - SAUCE: hio: update to Huawei ES3000_V2 (2.1.0.28)

  * Yakkety update to v4.8.11 stable release (LP: #1645421)
    - x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems
    - KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
    - KVM: Disable irq while unregistering user notifier
    - arm64: KVM: pmu: Fix AArch32 cycle counter access
    - KVM: arm64: Fix the issues when guest PMCCFILTR is configured
    - ftrace: Ignore FTRACE_FL_DISABLED while walking dyn_ftrace records
    - ftrace: Add more checks for FTRACE_FL_DISABLED in processing ip records
    - genirq: Use irq type from irqdata instead of irqdesc
    - fuse: fix fuse_write_end() if zero bytes were copied
    - IB/rdmavt: rdmavt can handle non aligned page maps
    - IB/hfi1: Fix rnr_timer addition
    - mfd: intel-lpss: Do not put device in reset state on suspend
    - mfd: stmpe: Fix RESET regression on STMPE2401
    - can: bcm: fix warning in bcm_connect/proc_register
    - gpio: do not double-check direction on sleeping chips
    - ALSA: usb-audio: Fix use-after-free of usb_device at disconnect
    - ALSA: hda - add a new condition to check if it is thinkpad
    - ALSA: hda - Fix mic regression by ASRock mobo fixup
    - i2c: mux: fix up dependencies
    - i2c: i2c-mux-pca954x: fix deselect enabling for device-tree
    - Disable the __builtin_return_address() warning globally after all
    - kbuild: add -fno-PIE
    - scripts/has-stack-protector: add -fno-PIE
    - x86/kexec: add -fno-PIE
    - kbuild: Steal gcc's pie from the very beginning
    - ext4: sanity check the block and cluster size at mount time
    - ARM: dts: imx53-qsb: Fix regulator constraints
    - crypto: caam - do not register AES-XTS mode on LP units
    - powerpc/64: Fix setting of AIL in hypervisor mode
    - drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5)
    - drm/i915: Refresh that status of MST capable connectors in ->detect()
    - drm/i915: Assume non-DP++ port if dvo_port is HDMI and there's no AUX ch
      specified in the VBT
    - virtio-net: drop legacy features in virtio 1 mode
    - clk: mmp: pxa910: fix return value check in pxa910_clk_init()
    - clk: mmp: pxa168: fix return value check in pxa168_clk_init()
    - clk: mmp: mmp2: fix return value check in mmp2_clk_init()
    - clk: imx: fix integer overflow in AV PLL round rate
    - rtc: omap: Fix selecting external osc
    - iwlwifi: pcie: fix SPLC structure parsing
    - iwlwifi: pcie: mark command queue lock with separate lockdep class
    - iwlwifi: mvm: fix netdetect starting/stopping for unified images
    - iwlwifi: mvm: fix d3_test with unified D0/D3 images
    - iwlwifi: mvm: wake the wait queue when the RX sync counter is zero
    - mfd: cor...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers