[Hyper-V] LIS daemons fail to start after disable/re-enable VM integration services

Bug #1701222 reported by Chris Valean on 2017-06-29
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Joseph Salisbury
Xenial
High
Joseph Salisbury
Zesty
High
Joseph Salisbury
Artful
High
Joseph Salisbury
systemd (Ubuntu)
Undecided
Unassigned

Bug Description

Issue description: Hyper-V daemons fail to start after disable/re-enable VM integration services.

Platform: host independent
Affected daemons - KVP, FCOPY and VSS.

Distribution name and release: Ubuntu 16.04, Ubuntu 17.04
Kernel version: 4.11.0-rc7-next-20170421 (for Ubuntu 16.04), 4.10.0-19-generic (for Ubuntu 17.04)

Repro rate: 100%

Steps to reproduce:
1. Start VM with Guest Services enabled (FCopy daemon starts automatically)
2. Go to File > Settings > Integration Services, uncheck Guest Services and apply (FCopy daemon will stop at this point)
3. Re-enable Guest Services from VM Settings (Fcopy daemon is not running).
This is the issue. systemd monitors for the service and if we have the hook for the Guest Service, it tries to start the daemon again.
systemd attempt to start any of the LIS daemons will fail, but manually executing the daemon binary, it will start the daemon.

Additional Info:
- the steps above can be repro'd with KVP / Data Exchange integration service as well.
- Manually starting hv_fcopy_daemon works fine.
- other distros (RHEL) does not have this behavior, the LIS daemons are started automatically by systemd once we re-enable the integration service.

On the upstream kernel and the upstream hv daemons, these messages are recorded in syslog, once we re-enable the Guest service:
HV_FCOPY: pread failed: Bad file descriptor
systemd[1]: hv-fcopy-daemon.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: hv-fcopy-daemon.service: Unit entered failed state.
systemd[1]: hv-fcopy-daemon.service: Failed with result 'exit-code'.

Chris Valean (cvalean) on 2017-06-29
Changed in linux (Ubuntu):
status: New → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Hi Chris,

Can you give the 4.12 kernel a test:
] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
no longer affects: linux (Ubuntu Trusty)
Changed in linux (Ubuntu Xenial):
status: New → Triaged
Changed in linux (Ubuntu Yakkety):
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Changed in linux (Ubuntu Yakkety):
importance: Undecided → High
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Joseph Salisbury (jsalisbury)
no longer affects: linux (Ubuntu Yakkety)
Changed in linux (Ubuntu Zesty):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

Also, do you know if this is a regression?

Chris Valean (cvalean) wrote :

I took the v4.12 kernel and compiled the LIS daemons from those sources as well.

Reloading the Data Exchange service, KVP will fail with the below message. This is the same as initially reported, so nothing changed in the behavior.
Not sure on the regression part, only up to the point on differences between distributions.

KVP failure with v4.12 and
Jul 6 01:18:55 xenial KVP: read failed; error:9 Bad file descriptor
Jul 6 01:18:55 xenial systemd[1]: hv-kvp-daemon.service: Main process exited, code=exited, status=1/FAILURE
Jul 6 01:18:55 xenial systemd[1]: hv-kvp-daemon.service: Unit entered failed state.
Jul 6 01:18:55 xenial systemd[1]: hv-kvp-daemon.service: Failed with result 'exit-code'.

After this error when systemd tries to start KVP again, if I manually do #systemctl start hv-kvp-daemon.service then the daemon will start just fine.

Joshua R. Poulson (jrp) wrote :

This is a regression, we did lots of work a year ago to ensure the daemons handle bouncing integration services settings.

Joseph Salisbury (jsalisbury) wrote :

Do we know which kernel version caused the regression? If so, I can review the commits or perform a bisect if needed.

Chris Valean (cvalean) wrote :

We verified anything between 14.04.0 and 15.10, the thing with those releases is that even if we disable the Data Exchange integration service, the KVP daemon remains running on the VM.

Only starting from 16.04 the KVP daemon stops when Data Exchange is disabled.
So 16.04 is the first version with this changed behavior, but then we get to the initial issue where systemd fails to start the service automatically, but it does load manually afterwards.

Joseph Salisbury (jsalisbury) wrote :

Do you know the specific kernel version when this bug started happening in 16.04? If not, we may have to test a few to narrow it down. All 16.04 kernels are available here:

https://launchpad.net/ubuntu/xenial/+source/linux

Brad Figg (brad-figg) wrote :

From what I was just told, no, they don't know when this started happening.

Joseph Salisbury (jsalisbury) wrote :

Would it be possible to test some of the early 16.04 kernels to try and figure out a last good kernel and first bad kernel?

Chris Valean (cvalean) wrote :

The same happens on 4.4.0-40 kernel, I think this goes with comment #6.

Joseph Salisbury (jsalisbury) wrote :

Can you see if the issue was happening back in 4.4.0-20:

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/9583414

Ionut Lenghel (ilenghel) wrote :

The same happens on 4.4.0-20 kernel.

Joseph Salisbury (jsalisbury) wrote :

Let's try further back, can you test 4.4.0-10:
https://launchpad.net/ubuntu/+source/linux/4.4.0-10.25/+build/9287900

Ionut Lenghel (ilenghel) wrote :

I have checked back to 4.2.0-16. It seems that 4.4.0-2 is the last kernel where the daemons remain running even if integration services are being disabled. Starting with 4.4.0-3 the daemons stop if the integration services are disabled, but do not automatically start if the integrations services are being enabled again.

I have also checked the content of linux-cloud-tools-common package for the above mentioned kernels, and it seems there is no difference between the two packages (as shown by the md5sum file).

Changed in linux (Ubuntu Xenial):
status: Triaged → In Progress
Changed in linux (Ubuntu Zesty):
status: Triaged → In Progress
Changed in linux (Ubuntu):
status: Triaged → In Progress
Joseph Salisbury (jsalisbury) wrote :

I started a kernel bisect between Ubuntu 4.4.0-2 and Ubuntu 4.4.0-3. The kernel bisect will require testing of about 7-10 test kernels.

I built the first test kernel, up to the following commit:
ef754413085f5ab7688989585f9223fec0ed7870

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

I have tested the kernel provided in the link above. The daemons do not stop when the integration services are disabled.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
c6f73b9f6bb9c8bf8e5beb31bbe4e18cb8c4b6d4

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

I have tested the kernel provided in the link above. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
c0e46ba65adcd94a49c2d66bb600e66a56ace491

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

I have tested the kernel provided. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
501faf881377dc38998fe8603ee1767d61883cfb

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

I have tested the kernel provided. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
49594cb78e0decefdd60bf462a25d7f50e8d5239

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

I have tested the kernel above. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
74caeadc074b099b80abca86a19ab95003ca294e

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

I have tested the kernel above. The daemon still remains running after the integration service is being disabled.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
be636eeda2290c65f3283ee3bdc149a5d205a798

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

Tested the above kernel build. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
2d796f537d26e54a7fdf87632f9b8ee811759bc9

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

Tested the above kernel build. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
0cff5d394aaba1d6c75fe0036981b7a93a1c4b36

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Hide

Ionut Lenghel (ilenghel) wrote :

I have tested the above kernel build. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

I built the next test kernel, up to the following commit:
332e5643b07ad6f24fd86290b081fc43a2ece778

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not? I will build the next test kernel based on your test results.

Thanks in advance

Ionut Lenghel (ilenghel) wrote :

I have tested the above kernel build. The issue still occurs.

Joseph Salisbury (jsalisbury) wrote :

Xenial commit 332e5643b07(Mainline commit b521549a09ddfac) was reported as the first bad commit. I built a Xenial test kernel with this commit reverted. It can be downloaded from:

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1701222

Can you test that kernel and report back if it has the bug or not?

Ionut Lenghel (ilenghel) wrote :

I have tested the above kernel. The daemon stops if the integration service is being disabled, but does not automatically start after the service is being enabled again (same behavior as 4.4.0-3).

Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. To confirm we provided the correct input to the bisect, can you test the following two kernels and confirm that they were good or bad:

Commit 332e564(Should be BAD and exhibit bug):
http://kernel.ubuntu.com/~jsalisbury/lp1701222/332e564/

Commit 74caead(Should be GOOD and not exhibit bug):
http://kernel.ubuntu.com/~jsalisbury/lp1701222/74caead/

Ionut Lenghel (ilenghel) wrote :

I have tested both kernels provided, and both exhibit the bug. In both cases the daemon does not stop after the integration services is being disabled.

Joseph Salisbury (jsalisbury) wrote :

Thanks for testing. We told the bisect commit 74caead was good per comment #27, which led us down the wrong bisect path.

I can build the next test kernel after telling the bisect commit 74caead was actually bad.

Before doing that, the last kernel before that we said was good was in comment #17. To confirm that, can you test that kernel again and see if it is in fact good and does not have the bug. It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1701222/ef754413085f5/

Ionut Lenghel (ilenghel) wrote :

I have tested the kernel provided with commit ef754413085f5 and the bug persists. The daemons do not stop if the integration services are being disabled.

The kernel provided in comment #36 was not affected by the bug. That was the first good commit.

Regarding comment #27, the kernel tested was reported as bad (the daemon still remains running after the integration service is being disabled).

Just to make sure we got this right, there are 2 issues right now:

1. The daemons do not stop if the integration service is being disabled, they remain running in the process list, but are not functional. This issue is present in the kernels tested up to 4.4.0-2. This issue is NOT present in kernel 4.4.0-3, here the daemons processes do stop if the integration service is disabled. Regarding this issue, the kernel provided in comment #36 is a good kernel.

2. The daemons do not automatically start after the integration service is disabled and then enabled again. This is the initial bug discovered in 16.04 and 17.04.

Kernel 4.4.0-3 does not have the first issue, but it has the second one.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the clarification. Since there are two issues, we should probably focus on one at a time. The test kernel posted in comment #36 sounds like it fixes the first issue, so lets handle that one first.

I'll ping the upstream patch author for that commit and get some feedback. Before doing that, can you test 4.14-rc1 to be sure this issue is not already fixed in mainline? It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc1/

Ionut Lenghel (ilenghel) wrote :

I have tested the mainline kernel provided with the compiled daemons form linux-next. In this case, the daemons do stop when the integration service is being disabled, but do not automatically start after we re-enable the integration service. Manually starting the service works fine. So the mainline is still affected by the second issue.

Joseph Salisbury (jsalisbury) wrote :

Ok, I reread the bug description and more importantly, comment #15. So it sounds like the first issue has been fixed in all kernels since kernel version 4.4.0-2, but that is when the second issue was introduced. If that is the case, we should focus on the second issue, where the daemons do not automatically start after the integration service is disabled and then enabled again.

We know that version 4.4.0-2 has the first issue, but does it also have the second issue?

Ionut Lenghel (ilenghel) wrote :

I have tested 4.4.0-2.16 from here: https://launchpad.net/ubuntu/+source/linux/4.4.0-2.16/+build/8908724

In this case, the daemon remains running when the integration service is being disabled. systemctl status shows the daemon running and active, but the daemon is not working actually. Re-enabling the service does not make the daemon to work properly. Manually starting a new instance of the daemon results in the daemon actually working, but with two different PIDs in the process list.

To sum it up, yes, it has the second issue as well.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update. So the mainline kernel still has the second issue and 4.4.0-2.16 also has the second issue. Presumably all the kernel version in between these two also have the second issue.

To determine if the second issue is a regression, we should test further back from 4.4.0-2.

Can you test the 4.3.0-1 kernel:
https://launchpad.net/ubuntu/+source/linux/4.3.0-1.10/+build/8370648

Ionut Lenghel (ilenghel) wrote :

I have tested Xenial back to 4.2.0-16 from here: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/ppa/+build/8099555

The second issue is still present.

Ionut Lenghel (ilenghel) wrote :

I have tested the proposed kernel, but with this version both bugs are showing up.

Could it be that the bug is not in the kernel, but in systemd?
Below is a comparison of the strace command ran on RHEL 7.3 and Ubuntu 16.04.1, when trying to re-enable the Data Exchange integration service:
http://pastebin.ubuntu.com/25615867/

Since manually starting the daemon after re-enabling the integration service works fine, could it be that systemd doesn't handle the daemon starting process when it receives the call to do so?
Is there any way to see how systemd handles the call to start the daemon?

Joseph Salisbury (jsalisbury) wrote :

It is possible. I've added a systemd bug task.

The unit entered a failed state, and thus would not automatically started until failed state clears.
Why does bouncing of the services result in the daemon exiting with an error condition? Should that particular exit code result in a graceful shutdown of the service, such that the unit can be restarted again?
What userspace events should be causing the start of this unit? a udev event / udev rule? a .path unit monitoring sysfs or some such?

Is it possible to reproduce this on e.g. azure for me to investigate?

I'm going to remove per-series tasks from systemd, until this issue is trianged. per-series tasks on the src:systemd package are used to track fixes/patches which have been developed and are ready for inclusion in the distribution.

no longer affects: systemd (Ubuntu Artful)
no longer affects: systemd (Ubuntu Zesty)
no longer affects: systemd (Ubuntu Xenial)
Joshua R. Poulson (jrp) wrote :

Since it requires control of the host, this cannot be done in Azure. Control of the host is rather restricted there.

Changed in linux (Ubuntu Xenial):
status: In Progress → Incomplete
Changed in linux (Ubuntu Zesty):
status: In Progress → Incomplete
Changed in linux (Ubuntu Artful):
status: In Progress → Incomplete
Changed in linux (Ubuntu):
status: In Progress → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers