[Hyper-V] KVP daemon fails to start on first boot of disco VM

Bug #1820063 reported by Francis Ginther on 2019-03-14
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Seth Forshee
Xenial
Undecided
Unassigned
Bionic
Undecided
Unassigned
Disco
Undecided
Unassigned

Bug Description

SRU Justification

Impact: The KVP daemon fails to start on first boot due to being started before the hv_kvp device appears.

Fix: Update the hv-kvp-daemon service file to start the daemon after device node appears.

Regression Potential: The changes are only to the hv-kvp-daemon service file and adding a udev rule, so the worst case regression would be that the service does not start. In testing the service did start as expected.

Test Case: See comment #15.

---

Launching a recent daily image of disco on azure results in a VM in which the hv-kvp-daemon.service fails to start:

$ systemctl status -o cat hv-kvp-daemon.service
● hv-kvp-daemon.service - Hyper-V KVP Protocol Daemon
   Loaded: loaded (/lib/systemd/system/hv-kvp-daemon.service; enabled; vendor pr
   Active: failed (Result: exit-code) since Thu 2019-03-14 13:07:15 UTC; 11min a
 Main PID: 219 (code=exited, status=1/FAILURE)

Started Hyper-V KVP Protocol Daemon.
KVP starting; pid is:219
open /dev/vmbus/hv_kvp failed; error: 2 No such file or directory
hv-kvp-daemon.service: Main process exited, code=exited, status=1/FAILURE
hv-kvp-daemon.service: Failed with result 'exit-code'.

The instance was created with:
$ az vm create --resource-group [redacted] --image Canonical:UbuntuServer:19.04-DAILY:19.04.201903130 --size Standard_D2_v2 --name disco-0313

As best as I can tell, the /dev/vmbus/hv_kvp isn't available when the hv-kvp-daemon.service starts, but it is available a few seconds later. Manually starting the daemon once I can ssh in works.
---
ProblemType: Bug
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.10-0ubuntu23
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
CRDA: N/A
DistroRelease: Ubuntu 19.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Microsoft Corporation Virtual Machine
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB: 0 hyperv_fb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.18.0-1011-azure root=PARTUUID=11894199-2ca2-4912-9c41-d28128744d57 ro console=tty1 console=ttyS0 panic=-1
ProcVersionSignature: User Name 4.18.0-1011.11-azure 4.18.20
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-1011-azure N/A
 linux-backports-modules-4.18.0-1011-azure N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: disco uec-images
Uname: Linux 4.18.0-1011-azure x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy netdev plugdev sudo video
_MarkForUpload: True
dmi.bios.date: 06/02/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 090007
dmi.board.name: Virtual Machine
dmi.board.vendor: Microsoft Corporation
dmi.board.version: 7.0
dmi.chassis.asset.tag: 7783-7084-3265-9085-8269-3286-77
dmi.chassis.type: 3
dmi.chassis.vendor: Microsoft Corporation
dmi.chassis.version: 7.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr090007:bd06/02/2017:svnMicrosoftCorporation:pnVirtualMachine:pvr7.0:rvnMicrosoftCorporation:rnVirtualMachine:rvr7.0:cvnMicrosoftCorporation:ct3:cvr7.0:
dmi.product.name: Virtual Machine
dmi.product.uuid: 3b0f2160-7fc4-a646-904c-4248f04792d4
dmi.product.version: 7.0
dmi.sys.vendor: Microsoft Corporation

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1820063

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: disco

apport information

tags: added: apport-collected uec-images
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-hyper-v
Tobias Koch (tobijk) wrote :

This also happens in Bionic minimal images and seems to be a simple race where the daemon is launched when the device is not yet visible.

It seems Fedora had the same or a similar issue:

https://bugzilla.redhat.com/show_bug.cgi?id=1195029

From the conversation in that bug report, my understanding is that this daemon supports features that can be enabled/disabled while the system is live and that the daemon should therefore be launched/stopped dynamically when the /dev/vmbus/hv_kvp device node comes or goes.

Since this race condition happening or not happening may depend on host, hypervisor and boot speed, it could pop up also on existing releases and thus should probably be treated with a certain amount of urgency.

Tobias Koch (tobijk) on 2019-03-27
tags: added: bionic cosmic xenial
Seth Forshee (sforshee) wrote :

I agree with comment #11, that this seems like a service problem. I took a quick look at the kernel side. Normally I might suggest that we make the relevant driver built-in, but in this case it already is. It looks like the device gets created in response to a message from the host (afaict, I'm unfamiliar with hyperv), and I'm not sure how we would speed that up. In any case it seems like modifying the service to only start once its dependencies are available is a good idea.

Dimitri John Ledkov (xnox) wrote :

It's funny how long it took them to figure it out, but yes

https://bugzilla.redhat.com/show_bug.cgi?id=1195029#c22

is the right answer. Please use that patch / combination of tagging the device with tag&wants using udev rule + making the service bind to it.

Seth Forshee (sforshee) on 2019-04-03
Changed in linux (Ubuntu):
assignee: nobody → Seth Forshee (sforshee)
status: Confirmed → In Progress
Seth Forshee (sforshee) wrote :

Please try the attached package. I believe you should be able to install it alone; none of the other packages look to be locked to a specific version of linux-cloud-tools-common.

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Francis Ginther (fginther) wrote :

This fixed the problem that I'm seeing on disco with a 20190402 serial. My test was to recreate the problem by removing /dev/vmbus/hv_kvp then rebooting. This simple test showed that the service came up again in a failed state with 'systemctl status hv-kvp-daemon.service'.

Then I installed this updated package, and redid the same experiment. This time the service was running with no errors.

Seth Forshee (sforshee) on 2019-04-10
description: updated
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Seth Forshee (sforshee) on 2019-04-10
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (50.5 KiB)

This bug was fixed in the package linux - 5.0.0-11.12

---------------
linux (5.0.0-11.12) disco; urgency=medium

  * linux: 5.0.0-11.12 -proposed tracker (LP: #1824383)

  * hns3: PPU_PF_ABNORMAL_INT_ST over_8bd_no_fe found [error status=0x1]
    (LP: #1824194)
    - net: hns3: fix for not calculating tx bd num correctly

  * disco: unable to use iptables/enable ufw under -virtual kernel
    (LP: #1823862)
    - [Packaging] add bpfilter to linux-modules

  * Make shiftfs a module rather than built-in (LP: #1824354)
    - [Config] CONFIG_SHIFT_FS=m

  * shiftfs: chown sets untranslated ids in lower fs (LP: #1824350)
    - SAUCE: shiftfs: use translated ids when chaning lower fs attrs

  * [Hyper-V] KVP daemon fails to start on first boot of disco VM (LP: #1820063)
    - [Packaging] bind hv_kvp_daemon startup to hv_kvp device

linux (5.0.0-10.11) disco; urgency=medium

  * linux: 5.0.0-10.11 -proposed tracker (LP: #1823936)

  * Apparmor enforcement failure in lxc selftests (LP: #1823379)
    - SAUCE: apparmor: Restore Y/N in /sys for apparmor's "enabled"

  * systemd cause kernel trace "BUG: unable to handle kernel paging request at
    6db23a14" on Cosmic i386 (LP: #1813244)
    - openvswitch: fix flow actions reallocation

linux (5.0.0-9.10) disco; urgency=medium

  * linux: 5.0.0-9.10 -proposed tracker (LP: #1823228)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * Huawei Hi1822 NIC has poor performance (LP: #1820187)
    - net-next/hinic: replace disable_irq_nosync/enable_irq

  * Add uid shifting overlay filesystem (shiftfs) (LP: #1823186)
    - shiftfs: uid/gid shifting bind mount
    - shiftfs: rework and extend
    - shiftfs: support some btrfs ioctls
    - [Config] enable shiftfs

  * Cannot boot or install - have to use nomodeset (LP: #1821820)
    - Revert "drm/i915/fbdev: Actually configure untiled displays"

  * Disco update: v5.0.6 upstream stable release (LP: #1823060)
    - netfilter: nf_tables: fix set double-free in abort path
    - dccp: do not use ipv6 header for ipv4 flow
    - genetlink: Fix a memory leak on error path
    - gtp: change NET_UDP_TUNNEL dependency to select
    - ipv6: make ip6_create_rt_rcu return ip6_null_entry instead of NULL
    - mac8390: Fix mmio access size probe
    - mISDN: hfcpci: Test both vendor & device ID for Digium HFC4S
    - net: aquantia: fix rx checksum offload for UDP/TCP over IPv6
    - net: datagram: fix unbounded loop in __skb_try_recv_datagram()
    - net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec
    - net: phy: meson-gxl: fix interrupt support
    - net: rose: fix a possible stack overflow
    - net: stmmac: fix memory corruption with large MTUs
    - net-sysfs: call dev_hold if kobject_init_and_add success
    - net: usb: aqc111: Extend HWID table by QNAP device
    - packets: Always register packet sk in the same order
    - rhashtable: Still do rehash when we get EEXIST
    - sctp: get sctphdr by offset in sctp_compute_cksum
    - sctp: use memdup_user instead of vmemdup_user
    - tcp: do not use ipv6 header for ipv4 flow
    - tipc: allow servic...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Terry Rudd (terrykrudd) on 2019-12-03
Changed in linux (Ubuntu Disco):
status: New → Fix Released
Changed in linux (Ubuntu):
status: Fix Released → Incomplete
Marcelo Cerri (mhcerri) on 2019-12-11
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: New → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
tags: added: id-5c9d168ed6c5704cac05e595
Launchpad Janitor (janitor) wrote :
Download full text (42.4 KiB)

This bug was fixed in the package linux - 4.15.0-74.84

---------------
linux (4.15.0-74.84) bionic; urgency=medium

  * bionic/linux: 4.15.0-74.84 -proposed tracker (LP: #1856749)

  * [Hyper-V] KVP daemon fails to start on first boot of disco VM (LP: #1820063)
    - [Packaging] bind hv_kvp_daemon startup to hv_kvp device

  * Unrevert "arm64: Use firmware to detect CPUs that are not affected by
    Spectre-v2" (LP: #1854207)
    - arm64: Get rid of __smccc_workaround_1_hvc_*
    - arm64: Use firmware to detect CPUs that are not affected by Spectre-v2

  * Bionic kernel panic on Cavium ThunderX CN88XX (LP: #1853485)
    - SAUCE: irqchip/gic-v3-its: Add missing return value in
      its_irq_domain_activate()

linux (4.15.0-73.82) bionic; urgency=medium

  * bionic/linux: 4.15.0-73.82 -proposed tracker (LP: #1854819)

  * CVE-2019-14901
    - SAUCE: mwifiex: Fix heap overflow in mmwifiex_process_tdls_action_frame()

  * CVE-2019-14896 // CVE-2019-14897
    - SAUCE: libertas: Fix two buffer overflows at parsing bss descriptor

  * CVE-2019-14895
    - SAUCE: mwifiex: fix possible heap overflow in mwifiex_process_country_ie()

  * CVE-2019-18660: patches for Ubuntu (LP: #1853142) // CVE-2019-18660
    - powerpc/64s: support nospectre_v2 cmdline option
    - powerpc/book3s64: Fix link stack flush on context switch
    - KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel

  * Please add patch fixing RK818 ID detection (LP: #1853192)
    - SAUCE: mfd: rk808: Fix RK818 ID template

  * [SRU][B/OEM-B/OEM-OSP1/D] Enable new Elan touchpads which are not in current
    whitelist (LP: #1853246)
    - HID: quirks: Fix keyboard + touchpad on Lenovo Miix 630
    - Input: elan_i2c - export the device id whitelist
    - HID: quirks: Refactor ELAN 400 and 401 handling

  * Lenovo dock MAC Address pass through doesn't work in Ubuntu (LP: #1827961)
    - r8152: Add macpassthru support for ThinkPad Thunderbolt 3 Dock Gen 2

  * s390/dasd: reduce the default queue depth and nr of hardware queues
    (LP: #1852257)
    - s390/dasd: reduce the default queue depth and nr of hardware queues

  * External microphone can't work on some dell machines with the codec alc256
    or alc236 (LP: #1853791)
    - SAUCE: ALSA: hda/realtek - Move some alc256 pintbls to fallback table
    - SAUCE: ALSA: hda/realtek - Move some alc236 pintbls to fallback table

  * Memory leak in net/xfrm/xfrm_state.c - 8 pages per ipsec connection
    (LP: #1853197)
    - xfrm: Fix memleak on xfrm state destroy

  * CVE-2019-19083
    - drm/amd/display: memory leak

  * update ENA driver for DIMLIB dynamic interrupt moderation (LP: #1853180)
    - net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
    - net: ena: switch to dim algorithm for rx adaptive interrupt moderation
    - net: ena: reimplement set/get_coalesce()
    - net: ena: enable the interrupt_moderation in driver_supported_features
    - net: ena: remove code duplication in
      ena_com_update_nonadaptive_moderation_interval _*()
    - net: ena: remove old adaptive interrupt moderation code from ena_netdev
    - net: ena: remove ena_restore_ethtool_params() and relevant fields
    - net: ena: remov...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Francis Ginther (fginther) wrote :

I was unable to reproduce this failure on either a xenial minimal or standard image. I went ahead and ran through the testing documented in comment #15 with no issue. Each time, the device file was present as expected and the hv-kvp-daemon.service was running normally. Even though I couldn't reproduce the issue as originally discovered, I'm marking this as verification-done-xenial as the problem no longer occurs with the 4.15.0-1067-azure kernel.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (32.9 KiB)

This bug was fixed in the package linux - 4.4.0-173.203

---------------
linux (4.4.0-173.203) xenial; urgency=medium

  * xenial/linux: 4.4.0-173.203 -proposed tracker (LP: #1859718)

  * CVE-2019-14615
    - drm/i915/gen9: Clear residual context state on context switch

linux (4.4.0-172.202) xenial; urgency=medium

  * xenial/linux: 4.4.0-172.202 -proposed tracker (LP: #1858594)

  * tools/perf fails to build after Xenial update to 4.4.208 upstream stable
    release (LP: #1858798)
    - Revert "perf report: Add warning when libunwind not compiled in"

  * CVE-2019-18885
    - btrfs: refactor btrfs_find_device() take fs_devices as argument
    - btrfs: merge btrfs_find_device and find_device

  * Integrate Intel SGX driver into linux-azure (LP: #1844245)
    - [Packaging] Add systemd service to load intel_sgx

  * Xenial update: 4.4.208 upstream stable release (LP: #1858462)
    - btrfs: do not leak reloc root if we fail to read the fs root
    - btrfs: handle ENOENT in btrfs_uuid_tree_iterate
    - ALSA: hda/ca0132 - Keep power on during processing DSP response
    - ALSA: hda/ca0132 - Avoid endless loop
    - drm: mst: Fix query_payload ack reply struct
    - iio: light: bh1750: Resolve compiler warning and make code more readable
    - spi: Add call to spi_slave_abort() function when spidev driver is released
    - staging: rtl8188eu: fix possible null dereference
    - rtlwifi: prevent memory leak in rtl_usb_probe
    - IB/iser: bound protection_sg size by data_sg size
    - media: am437x-vpfe: Setting STD to current value is not an error
    - media: i2c: ov2659: fix s_stream return value
    - media: i2c: ov2659: Fix missing 720p register config
    - media: ov6650: Fix stored frame format not in sync with hardware
    - tools/power/cpupower: Fix initializer override in hsw_ext_cstates
    - usb: renesas_usbhs: add suspend event support in gadget mode
    - hwrng: omap3-rom - Call clk_disable_unprepare() on exit only if not idled
    - regulator: max8907: Fix the usage of uninitialized variable in
      max8907_regulator_probe()
    - media: flexcop-usb: fix NULL-ptr deref in flexcop_usb_transfer_init()
    - samples: pktgen: fix proc_cmd command result check logic
    - mwifiex: pcie: Fix memory leak in mwifiex_pcie_init_evt_ring
    - media: ti-vpe: vpe: fix a v4l2-compliance warning about invalid pixel format
    - media: ti-vpe: vpe: fix a v4l2-compliance failure about frame sequence
      number
    - media: ti-vpe: vpe: Make sure YUYV is set as default format
    - extcon: sm5502: Reset registers during initialization
    - x86/mm: Use the correct function type for native_set_fixmap()
    - perf report: Add warning when libunwind not compiled in
    - iio: adc: max1027: Reset the device at probe time
    - Bluetooth: hci_core: fix init for HCI_USER_CHANNEL
    - drm/gma500: fix memory disclosures due to uninitialized bytes
    - x86/ioapic: Prevent inconsistent state when moving an interrupt
    - arm64: psci: Reduce the waiting time for cpu_psci_cpu_kill()
    - libata: Ensure ata_port probe has completed before detach
    - pinctrl: sh-pfc: sh7734: Fix duplicate TCLK1_B
    - bnx2x: Fix PF-VF communication over multi-cos queu...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Pat Viafore (patviafore) wrote :

We are no longer seeing this issue with the latest images.

Thiago Martins (martinx) wrote :

I'm seeing this problem with Ubuntu 18.04.4 with HWE Kernel, fully upgraded.

Terry Rudd (terrykrudd) wrote :

Thiago, can you be more specific about what you're seeing and with exactly which kernel?

Thiago Martins (martinx) wrote :

Terry,

 Ubuntu 18.04.4 with Kernel 5.3.0-40-generic, running on QEMU, takes more than 1:30 minutes to boot, it stops here for a while:

---
[ *] A start job is running for sys-devi…\x21hv_kvp.device (51s / 1min 30s)
---

 I'm using the following image to build my QEMU VMs:

http://uec-images.ubuntu.com/releases/18.04/release/ubuntu-18.04-server-cloudimg-amd64.img

Cheers!
Thiago

Terry Rudd (terrykrudd) wrote :

Thank you Thiago, we will investigate further.

Thiago Martins (martinx) wrote :

This is also happening on bare-metal machines as well!

It starts to happen after installing the "linux-cloud-tools-generic-hwe-18.04" package.

Also, Ubuntu 20.04 has the same problem.

20.04 slow boot stops for 1:30 min heree:
---
A start job is running for /sys/dev…misc/vmbus!hv_kvp (13s / 1min 30s)
---

Marcelo Cerri (mhcerri) wrote :

Hi.

Do you know if the problem happens with other kernel versions such as 4.15 or if it didn't use to happen before?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.