[Xenial] Customer can not SSH to Linux VM due to "VSC State Unhealthy"

Bug #1826416 reported by Joseph Salisbury on 2019-04-25
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned

Bug Description

[Impact]

A mutual customer is reporting ssh is not working on a Xenial based VM. This VM is running the 4.4 based Xenial kernel and not a custom linux-azure kernel.

After an investigation this is a known old signaling issue. The 4.4 based Xenial kernels require the following patch:

vmbus: fix missing signaling in hv_signal_on_read() (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=13c5e97701091f9b02ded0c68809f8a6b08c747a).

The patch is not in the upstream stable 4.4 tree, because it’s not needed there.

Compared to the upstream 4.4 tree, Ubuntu 4.4.0-124-generic integrated more hv patches from the mainline kernel, e.g. Drivers: hv: vmbus: finally fix hv_need_to_signal_on_read(), so it must pick up the above patch.

We checked the latest Ubuntu 4.4 kernel (https://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/?h=Ubuntu-4.4.0-146.172) and the patch is also absent there.

The patch (13c5e97701091f9b02ded0c68809f8a6b08c747a) can be cleanly cherry-picked into Ubuntu-4.4.0-146.172.

[Test Case]

When the issue happens, there is no error message in dmesg or syslog, and it's just the host side NetVSP driver stops reading from the guest-to-host ring, and the guest network stops working. So we don't really have any logs to provide here.

[Regression Potential]

Low risk since that's a simple patch from stable upstream touching only Hyper-V specific code.

summary: - [Xenial] Customer could not SSH to Linux VM due to "VSC State Unhealthy"
+ [Xenial] Customer can not SSH to Linux VM due to "VSC State Unhealthy"

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1826416

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
Dexuan Cui (decui) wrote :

When the issue happens, there is no error message in dmesg or syslog, and it's just the host side NetVSP driver stops reading from the guest-to-host ring, and the guest network stops working. So we don't really have any logs to provide here.

Marcelo Cerri (mhcerri) on 2019-05-13
description: updated
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Dexuan Cui (decui) wrote :

I can confirm the fix is included in the kernel "4.4.0-152.179": https://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/include/linux/hyperv.h?h=Ubuntu-4.4.0-152.179

I installed the kernel, did some network tests, and it worked fine for me:

#apt-get install linux-image-4.4.0-152-generic
#reboot

# uname -a
Linux localhost 4.4.0-152-generic #179-Ubuntu SMP Thu Jun 13 10:05:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (30.5 KiB)

This bug was fixed in the package linux - 4.4.0-157.185

---------------
linux (4.4.0-157.185) xenial; urgency=medium

  * linux: 4.4.0-157.185 -proposed tracker (LP: #1837476)

  * systemd 229-4ubuntu21.22 ADT test failure with linux 4.4.0-156.183 (storage)
    (LP: #1837235)
    - Revert "block/bio: Do not zero user pages"
    - Revert "block: Clear kernel memory before copying to user"
    - Revert "bio_copy_from_iter(): get rid of copying iov_iter"

linux (4.4.0-156.183) xenial; urgency=medium

  * linux: 4.4.0-156.183 -proposed tracker (LP: #1836880)

  * BCM43602 802.11ac Wireless regression - PCI ID 14e4:43ba (LP: #1836801)
    - brcmfmac: add eth_type_trans back for PCIe full dongle

linux (4.4.0-155.182) xenial; urgency=medium

  * linux: 4.4.0-155.182 -proposed tracker (LP: #1834918)

  * Geneve tunnels don't work when ipv6 is disabled (LP: #1794232)
    - geneve: correctly handle ipv6.disable module parameter

  * Kernel modules generated incorrectly when system is localized to a non-
    English language (LP: #1828084)
    - scripts: override locale from environment when running recordmcount.pl

  * Handle overflow in proc_get_long of sysctl (LP: #1833935)
    - sysctl: handle overflow in proc_get_long

  * Xenial update: 4.4.181 upstream stable release (LP: #1832661)
    - x86/speculation/mds: Revert CPU buffer clear on double fault exit
    - x86/speculation/mds: Improve CPU buffer clear documentation
    - ARM: exynos: Fix a leaked reference by adding missing of_node_put
    - crypto: vmx - fix copy-paste error in CTR mode
    - crypto: crct10dif-generic - fix use via crypto_shash_digest()
    - crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest()
    - ALSA: usb-audio: Fix a memory leak bug
    - ALSA: hda/hdmi - Consider eld_valid when reporting jack event
    - ALSA: hda/realtek - EAPD turn on later
    - ASoC: max98090: Fix restore of DAPM Muxes
    - ASoC: RT5677-SPI: Disable 16Bit SPI Transfers
    - mm/mincore.c: make mincore() more conservative
    - ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
    - mfd: da9063: Fix OTP control register names to match datasheets for
      DA9063/63L
    - tty/vt: fix write/write race in ioctl(KDSKBSENT) handler
    - ext4: actually request zeroing of inode table after grow
    - ext4: fix ext4_show_options for file systems w/o journal
    - Btrfs: do not start a transaction at iterate_extent_inodes()
    - bcache: fix a race between cache register and cacheset unregister
    - bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
    - ipmi:ssif: compare block number correctly for multi-part return messages
    - crypto: gcm - Fix error return code in crypto_gcm_create_common()
    - crypto: gcm - fix incompatibility between "gcm" and "gcm_base"
    - crypto: chacha20poly1305 - set cra_name correctly
    - crypto: salsa20 - don't access already-freed walk.iv
    - crypto: arm/aes-neonbs - don't access already-freed walk.iv
    - writeback: synchronize sync(2) against cgroup writeback membership switches
    - fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going
      into workqueue when umount
    - ALSA: hda/realtek - Fix for Lenovo B...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers