[Hyper-V] KVP daemon fails to start on first boot of disco VM

Bug #1820063 reported by Francis Ginther on 2019-03-14
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Seth Forshee

Bug Description

SRU Justification

Impact: The KVP daemon fails to start on first boot due to being started before the hv_kvp device appears.

Fix: Update the hv-kvp-daemon service file to start the daemon after device node appears.

Regression Potential: The changes are only to the hv-kvp-daemon service file and adding a udev rule, so the worst case regression would be that the service does not start. In testing the service did start as expected.

Test Case: See comment #15.

---

Launching a recent daily image of disco on azure results in a VM in which the hv-kvp-daemon.service fails to start:

$ systemctl status -o cat hv-kvp-daemon.service
● hv-kvp-daemon.service - Hyper-V KVP Protocol Daemon
   Loaded: loaded (/lib/systemd/system/hv-kvp-daemon.service; enabled; vendor pr
   Active: failed (Result: exit-code) since Thu 2019-03-14 13:07:15 UTC; 11min a
 Main PID: 219 (code=exited, status=1/FAILURE)

Started Hyper-V KVP Protocol Daemon.
KVP starting; pid is:219
open /dev/vmbus/hv_kvp failed; error: 2 No such file or directory
hv-kvp-daemon.service: Main process exited, code=exited, status=1/FAILURE
hv-kvp-daemon.service: Failed with result 'exit-code'.

The instance was created with:
$ az vm create --resource-group [redacted] --image Canonical:UbuntuServer:19.04-DAILY:19.04.201903130 --size Standard_D2_v2 --name disco-0313

As best as I can tell, the /dev/vmbus/hv_kvp isn't available when the hv-kvp-daemon.service starts, but it is available a few seconds later. Manually starting the daemon once I can ssh in works.
---
ProblemType: Bug
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.10-0ubuntu23
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
CRDA: N/A
DistroRelease: Ubuntu 19.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Microsoft Corporation Virtual Machine
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB: 0 hyperv_fb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.18.0-1011-azure root=PARTUUID=11894199-2ca2-4912-9c41-d28128744d57 ro console=tty1 console=ttyS0 panic=-1
ProcVersionSignature: User Name 4.18.0-1011.11-azure 4.18.20
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-1011-azure N/A
 linux-backports-modules-4.18.0-1011-azure N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: disco uec-images
Uname: Linux 4.18.0-1011-azure x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip floppy netdev plugdev sudo video
_MarkForUpload: True
dmi.bios.date: 06/02/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 090007
dmi.board.name: Virtual Machine
dmi.board.vendor: Microsoft Corporation
dmi.board.version: 7.0
dmi.chassis.asset.tag: 7783-7084-3265-9085-8269-3286-77
dmi.chassis.type: 3
dmi.chassis.vendor: Microsoft Corporation
dmi.chassis.version: 7.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr090007:bd06/02/2017:svnMicrosoftCorporation:pnVirtualMachine:pvr7.0:rvnMicrosoftCorporation:rnVirtualMachine:rvr7.0:cvnMicrosoftCorporation:ct3:cvr7.0:
dmi.product.name: Virtual Machine
dmi.product.uuid: 3b0f2160-7fc4-a646-904c-4248f04792d4
dmi.product.version: 7.0
dmi.sys.vendor: Microsoft Corporation

CVE References

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1820063

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: disco

apport information

tags: added: apport-collected uec-images
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-hyper-v
Tobias Koch (tobijk) wrote :

This also happens in Bionic minimal images and seems to be a simple race where the daemon is launched when the device is not yet visible.

It seems Fedora had the same or a similar issue:

https://bugzilla.redhat.com/show_bug.cgi?id=1195029

From the conversation in that bug report, my understanding is that this daemon supports features that can be enabled/disabled while the system is live and that the daemon should therefore be launched/stopped dynamically when the /dev/vmbus/hv_kvp device node comes or goes.

Since this race condition happening or not happening may depend on host, hypervisor and boot speed, it could pop up also on existing releases and thus should probably be treated with a certain amount of urgency.

Tobias Koch (tobijk) on 2019-03-27
tags: added: bionic cosmic xenial
Seth Forshee (sforshee) wrote :

I agree with comment #11, that this seems like a service problem. I took a quick look at the kernel side. Normally I might suggest that we make the relevant driver built-in, but in this case it already is. It looks like the device gets created in response to a message from the host (afaict, I'm unfamiliar with hyperv), and I'm not sure how we would speed that up. In any case it seems like modifying the service to only start once its dependencies are available is a good idea.

Dimitri John Ledkov (xnox) wrote :

It's funny how long it took them to figure it out, but yes

https://bugzilla.redhat.com/show_bug.cgi?id=1195029#c22

is the right answer. Please use that patch / combination of tagging the device with tag&wants using udev rule + making the service bind to it.

Seth Forshee (sforshee) on 2019-04-03
Changed in linux (Ubuntu):
assignee: nobody → Seth Forshee (sforshee)
status: Confirmed → In Progress
Seth Forshee (sforshee) wrote :

Please try the attached package. I believe you should be able to install it alone; none of the other packages look to be locked to a specific version of linux-cloud-tools-common.

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Francis Ginther (fginther) wrote :

This fixed the problem that I'm seeing on disco with a 20190402 serial. My test was to recreate the problem by removing /dev/vmbus/hv_kvp then rebooting. This simple test showed that the service came up again in a failed state with 'systemctl status hv-kvp-daemon.service'.

Then I installed this updated package, and redid the same experiment. This time the service was running with no errors.

Seth Forshee (sforshee) on 2019-04-10
description: updated
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Seth Forshee (sforshee) on 2019-04-10
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (50.5 KiB)

This bug was fixed in the package linux - 5.0.0-11.12

---------------
linux (5.0.0-11.12) disco; urgency=medium

  * linux: 5.0.0-11.12 -proposed tracker (LP: #1824383)

  * hns3: PPU_PF_ABNORMAL_INT_ST over_8bd_no_fe found [error status=0x1]
    (LP: #1824194)
    - net: hns3: fix for not calculating tx bd num correctly

  * disco: unable to use iptables/enable ufw under -virtual kernel
    (LP: #1823862)
    - [Packaging] add bpfilter to linux-modules

  * Make shiftfs a module rather than built-in (LP: #1824354)
    - [Config] CONFIG_SHIFT_FS=m

  * shiftfs: chown sets untranslated ids in lower fs (LP: #1824350)
    - SAUCE: shiftfs: use translated ids when chaning lower fs attrs

  * [Hyper-V] KVP daemon fails to start on first boot of disco VM (LP: #1820063)
    - [Packaging] bind hv_kvp_daemon startup to hv_kvp device

linux (5.0.0-10.11) disco; urgency=medium

  * linux: 5.0.0-10.11 -proposed tracker (LP: #1823936)

  * Apparmor enforcement failure in lxc selftests (LP: #1823379)
    - SAUCE: apparmor: Restore Y/N in /sys for apparmor's "enabled"

  * systemd cause kernel trace "BUG: unable to handle kernel paging request at
    6db23a14" on Cosmic i386 (LP: #1813244)
    - openvswitch: fix flow actions reallocation

linux (5.0.0-9.10) disco; urgency=medium

  * linux: 5.0.0-9.10 -proposed tracker (LP: #1823228)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * Huawei Hi1822 NIC has poor performance (LP: #1820187)
    - net-next/hinic: replace disable_irq_nosync/enable_irq

  * Add uid shifting overlay filesystem (shiftfs) (LP: #1823186)
    - shiftfs: uid/gid shifting bind mount
    - shiftfs: rework and extend
    - shiftfs: support some btrfs ioctls
    - [Config] enable shiftfs

  * Cannot boot or install - have to use nomodeset (LP: #1821820)
    - Revert "drm/i915/fbdev: Actually configure untiled displays"

  * Disco update: v5.0.6 upstream stable release (LP: #1823060)
    - netfilter: nf_tables: fix set double-free in abort path
    - dccp: do not use ipv6 header for ipv4 flow
    - genetlink: Fix a memory leak on error path
    - gtp: change NET_UDP_TUNNEL dependency to select
    - ipv6: make ip6_create_rt_rcu return ip6_null_entry instead of NULL
    - mac8390: Fix mmio access size probe
    - mISDN: hfcpci: Test both vendor & device ID for Digium HFC4S
    - net: aquantia: fix rx checksum offload for UDP/TCP over IPv6
    - net: datagram: fix unbounded loop in __skb_try_recv_datagram()
    - net/packet: Set __GFP_NOWARN upon allocation in alloc_pg_vec
    - net: phy: meson-gxl: fix interrupt support
    - net: rose: fix a possible stack overflow
    - net: stmmac: fix memory corruption with large MTUs
    - net-sysfs: call dev_hold if kobject_init_and_add success
    - net: usb: aqc111: Extend HWID table by QNAP device
    - packets: Always register packet sk in the same order
    - rhashtable: Still do rehash when we get EEXIST
    - sctp: get sctphdr by offset in sctp_compute_cksum
    - sctp: use memdup_user instead of vmemdup_user
    - tcp: do not use ipv6 header for ipv4 flow
    - tipc: allow servic...

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.