[Hyper-V] Ubuntu 14.04.2 LTS Generation 2 SCSI Errors on VSS Based Backups

Bug #1470250 reported by Joseph Salisbury
200
This bug affects 27 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Critical
Joseph Salisbury
Trusty
Won't Fix
High
Joseph Salisbury
Xenial
Fix Released
Critical
Joseph Salisbury
Yakkety
Fix Released
Critical
Joseph Salisbury
Zesty
Fix Released
Critical
Joseph Salisbury

Bug Description

Customers have reported running various versions of Ubuntu 14.04.2 LTS on Generation 2 Hyper-V Hosts. On a random Basis, the file system will be mounted Read-Only due to a "disk error" (which really isn't the case here). As a result, they must reboot the Ubuntu guest to get the file system to mount RW again.

The Error seen are the following:
Apr 30 00:02:01 balticnetworkstraining kernel: [640153.968142] storvsc: Sense Key : Unit Attention [current]
Apr 30 00:02:01 balticnetworkstraining kernel: [640153.968145] storvsc: Add. Sense: Changed operating definition
Apr 30 00:02:01 balticnetworkstraining kernel: [640153.968161] sd 0:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
Apr 30 01:23:26 balticnetworkstraining kernel: [645039.584164] hv_storvsc vmbus_0_4: cmd 0x2a scsi status 0x2 srb status 0x82
Apr 30 01:23:26 balticnetworkstraining kernel: [645039.584178] hv_storvsc vmbus_0_4: stor pkt ffff88006eb6c700 autosense data valid - len 18
Apr 30 01:23:26 balticnetworkstraining kernel: [645039.584180] storvsc: Sense Key : Unit Attention [current]
Apr 30 01:23:26 balticnetworkstraining kernel: [645039.584183] storvsc: Add. Sense: Changed operating definition
Apr 30 01:23:26 balticnetworkstraining kernel: [645039.584198] sd 0:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.

This relates to the VSS "Windows Server Backup" process that kicks off at midnight on the host and finishes an hour and half later.
Yes, we do have hv_vss_daemon and hv_kvp_daemon running for the correct kernel version we have. We're currently running kernel version 3.13.0-49-generic #83 on one system and 3.16.0-34-generic #37 on the other. -- We see the same errors on both.
As a result, we've been hesitant to drop any more ubuntu guests on our 2012R2 hyper-v system because of this. We can stop the backup process and all is good, but we need nightly backups to image all of our VM's. All the windows guests have no issues of course. We also have some CentOS based guests running without issues from what we've seen.

CVE References

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu):
status: New → In Progress
importance: Undecided → Critical
Revision history for this message
Frederik Bosch (f-bosch) wrote :

My latest report was that latest builds with patches are much more stable but are also not a complete fix for the problem. It is still there and occurs randomly. The error message is not changed. I have no real indication what causes the read-only state. During the latest RO state I noticed there was a IO peak at that time. However, the IO peak was just unpacking some files from a tar.gz.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Error message that happens when this bug occurs:

[154272.293488] sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
[154272.293508] sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
[154272.293665] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
[154272.293671] blk_update_request: I/O error, dev sda, sector 201805560
[154272.293718] Aborting journal on device sda1-8.
[154272.314119] EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected aborted journal
[154272.314154] EXT4-fs (sda1): Remounting filesystem read-only

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Frederik Bosch Can you post what kernel version you are currently using?

Revision history for this message
Joshua R. Poulson (jrp) wrote :

I believe he is running 14.04.2, which means at least the HWE kernel.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@jrp @jsalisbury I am using this kernel: http://kernel.ubuntu.com/~jsalisbury/lp1445195/vivid/ on 14.04.2. So that build is much more stable but a complete fix of the problem.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@jrp @jsalisbury I am using this kernel: http://kernel.ubuntu.com/~jsalisbury/lp1445195/vivid/ on 14.04.2. So that build is much more stable but NOT a complete fix of the problem.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Prior comments regarding this issue can be found in bug 1445195

Changed in linux (Ubuntu Vivid):
status: New → In Progress
Changed in linux (Ubuntu Utopic):
status: New → In Progress
Changed in linux (Ubuntu Trusty):
status: New → In Progress
Changed in linux (Ubuntu Vivid):
importance: Undecided → High
Changed in linux (Ubuntu Utopic):
importance: Undecided → High
Changed in linux (Ubuntu Trusty):
importance: Undecided → High
Changed in linux (Ubuntu Wily):
importance: Critical → High
Changed in linux (Ubuntu Vivid):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Utopic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The Wily kernel has been rebased to upstream 4.1, which has all the current Hyper-V commits in mainline. Can you give this test kernel a test to see if it still exhibits this issue, or if it is resolved.

If it still exhibits the issue, we know that a new fix is needed. If this test kernel fixes this issue, we know it is fixed upstream and we need to identify which commit(s) fix things.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1470250/wily/

Thanks in advance!

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@jsalisbury I did not find any new commits on this subject in the current kernel master (https://github.com/torvalds/linux). And I believe you already included all HV commits in the last test build from bug 1441595.

So testing this test kernel would mean I am testing whether another commit (not specifically for this issue) might have fixed this issue, right?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Frederik, yes that is correct. This kernel basically has all HV related commits that are currently in mainline.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@jsalisbury I will start testing this week. However, I feel the latest reports were pretty clear: the issue is there. While I was the only one at first that still had problems, after a while more people reported (in bug 1441595) that the new build still contains the issue. In my opinion, it is now HyperV team's turn to come up with a final solution. Nevertheless, I want to contribute where possible. /cc @jrp

Revision history for this message
Frederik Bosch (f-bosch) wrote :

That should have been bug 1445195.

Revision history for this message
Dustin (dander88) wrote :

We have the same VSS Issues in 14.04 LTS but hyper v (2012) gen 1. Has this been seen before? We can reproduce the error on command.

Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Dustin it sounds like you have a reliable way to reproduce this bug? If so, can you list those steps here for others to try?

Also, if you can reproduce this, would it be possible for you to test the kernel posted in comment #9?

Thanks!

Revision history for this message
ubuntu (h-lbuntu-2) wrote :

On the other bug (1445195) somebody reported seeing the same error on Gen 1 devices. I wanted to report that we see the identical bug on both Gen 1 and Gen 2 devices. Frequency does not appear to be any different but I don't have precise data.

Revision history for this message
Dustin (dander88) wrote :

@Joseph - There are not too much, we use the backup program called ALTARO. It will produce this error about every 10 or so backups.

We are working on getting that new kernel into some test units. I will post results when we are done. Thanks for the follow up.

Revision history for this message
Chris Valean (cvalean) wrote :

Hi Dustin,
Some questions on the topic, my apologize if these got replied before or in other threads.

1. Is this repro using Windows Server Backup directly, and not through Altaro?
2. For the VM setup, is this a standard local vhdx on scsi controller 0 for a Gen2 VM for OS disk? Or there are any other disks attached to the VM?
3. Backup location - is this done to a separate local disk or where exactly?
4. VM and vm disk load - I/O - before I saw that there was only an archive untar, what is the general load or services running on the system at the time of the backup?

Revision history for this message
John Wilkinson (cohn) wrote :

@Joseph Salisbury Is there a specific subset of those .deb packages that need to be run, or are they all needed to patch the relevant bugs?

Revision history for this message
Dustin (dander88) wrote :

@Chris -

1- No Local Backup at all - All through Altaro
2 - Standard Local on Scsi 0
3- Sent to a local NAS
3 - I dont know the exact load that they currently have - I know that it is under 80% of system resources for sure. Not 100% on the services, I know MYSQL, other than that not sure.

Revision history for this message
Dustin (dander88) wrote :

After running the patches, we are still seeing the same error in the syslog. Options?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Dustin, can you run "uname -a" to confirm your machine is running the latest Wily kernel built from the current mainline kernel?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@John WIlkinson, You should only need the linux-image and linux-image-extra .deb packages to install the latest kernel. The -headers .deb packages should not be needed.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@dander88 According to @jrp the message "Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters." is beneign. He mentioned that in bug 1445195.

Are you just seeing that message or do your systems also go into read-only mode? From my point of view the SCSI message has no significance to the read-only bug. The message also pops up with successful backups.

@h-lbuntu-2 What kind of bug do you mean? Also the SCSiI message? Or do you also have read-only problems?

@cohn I would install them all. Be aware you might have to ignore dependencies when you are on 14.04 and go to kernel 3.19, e.g. binutils. Run dpkg --ignore-depends=binutils -i *.deb to install them. It is fine to do that: those dependencies can be ignored without problems.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@cohn By the way, I have not installed the 4.x kernels yet. So I have no idea how many dependency problems you will run in to. Could you let me know?

Revision history for this message
Dustin (dander88) wrote :

@F-Bosch we did not enter the "read-only" mode with the one test we tried. I will keep a backup schedule going multiple times a day and see if it ever goes into read only mode. I will report back in a few days to let you know the results

Revision history for this message
Dustin (dander88) wrote :

So far so good after the kernal update. I have been backing up a test VM 4 times a day for almost 2 weeks. It has not gone into read only mode as of yet.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@dander88 That sounds great, but we had the same results for a test VPS machine. As indicated by @jrp before: it depends on your IO load if the machine goes into read-only. And test machines usually do not have that many IO load.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Frederik, have you tested with the Wily kernel yet, posted in comment #9? The Wily kernel has since been rebased to 4.2, so testing of the latest kernel by applying the latest Wily updates would be great.

Revision history for this message
ubuntu (h-lbuntu-2) wrote :

Bug is still present in Vivid 3.19.0-26-generic.

Is there a workaround that avoids this problem? There's considerable pressure to move off of Hyper-V and I'd rather not do it.

[57055.788468] sd 0:0:0:0: [storvsc] Sense Key : Unit Attention [current]
[57055.788561] sd 0:0:0:0: [storvsc] Add. Sense: Changed operating definition
[57055.788704] sd 0:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
[57055.788719] blk_update_request: I/O error, dev sda, sector 207053224
[57055.788924] Aborting journal on device sda2-8.
[57055.880744] EXT4-fs error (device sda2): ext4_journal_check_start:56: Detected aborted journal
[57055.880833] EXT4-fs (sda2): Remounting filesystem read-only
[57055.885165] sd 0:0:0:1: [storvsc] Sense Key : Unit Attention [current]
[57055.885269] sd 0:0:0:1: [storvsc] Add. Sense: Changed operating definition
[57055.885342] sd 0:0:0:1: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
[57315.373166] sd 0:0:0:2: [storvsc] Sense Key : Unit Attention [current]
[57315.373230] sd 0:0:0:2: [storvsc] Add. Sense: Changed operating definition
[57315.373379] sd 0:0:0:2: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.

Linux db003 3.19.0-26-generic #28-Ubuntu SMP Tue Aug 11 14:16:32 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
ubuntu (h-lbuntu-2) wrote :

To add to comment #30 I posted, the read-only bug continues to occur on both Generation 1 and Generation 2 VMs.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@ubuntu

Is it possible for you to test the latest Wily kernel? The Wily kernel has been rebased to the upstream 4.2 kernel, so it should have all the latest Hyper-V updates in mainline.

The Wily kernel can be downloaded from:
https://launchpad.net/ubuntu/+source/linux/4.2.0-7.7/+build/7856238

Revision history for this message
ubuntu (h-lbuntu-2) wrote :

@Joseph

Thank you. I upgraded the production server that is most often errors out with the Wily kernel you referenced above and will report back Please be patient, an individual VM can go several days without the error.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

Unfortunately, I did not have had time to test Willy kernels. But the other 3.x stable kernels are still contain the problem.

Revision history for this message
Joshua R. Poulson (jrp) wrote :

We're getting ready for a new round of storvsc fixes that correspond to LIS 4.0.11 we'll have to see if that improves things further.

Revision history for this message
Frederik Bosch (f-bosch) wrote :

@jrp Thanks for that, I am happy to hear there is still being worked on. Let me know if there is a build that I can test
@h-lbuntu-2 How are your results with Willy kernels?

Revision history for this message
ubuntu (h-lbuntu-2) wrote :

Just wanted to report that I installed the Wily Kernel on Sept 3rd and the VM's ran without errors until yesterday. The errors are different then before but doing the same thing:
blk_update_request: I/O error, dev sda, sector 206963000
Aborting journal on device sda2-8
EXT4-fs errors (device sda2): ext4_journal_check_start:56: Detected aborted journal
EXT4-fs (sda2): Remounting filesystem read-only

Revision history for this message
The Fold (stuart-luscombe) wrote :

I am experiencing this same issue when backing up a 14.04 LTS Gen 1 VM using Veeam. The error seems to occur when the VSS snapshots are being taken. The error did not occur until I had followed Microsoft's instructions on packages to install for Ubuntu (https://technet.microsoft.com/en-GB/library/dn531029.aspx).

Revision history for this message
Frederik Bosch (f-bosch) wrote :

What do you mean with: did not occur until you followed those instructions? You mean without the daemons everything was fine?

Revision history for this message
Frederik Bosch (f-bosch) wrote :

A little remark that still has no answer yet: what are the Ubuntu specifics that cause this issue? In the 10 months we have these machines running: many crashes for Ubuntu with VSS snapshots while the CentOS machines have had no crash at all. Maybe @jsalisbury has a explanation for this?

no longer affects: linux (Ubuntu Utopic)
Changed in linux (Ubuntu Xenial):
importance: High → Critical
Changed in linux (Ubuntu Xenial):
importance: Critical → High
importance: High → Critical
tags: added: patch
293 comments hidden view all 373 comments
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update. I'll rebuild the 14.04 test kernel, but this time with both patches. That is what the Xenial test kernel has.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'm still working on getting a Trusty test kernel build with both patches. Alexs' patch is requiring some prereq commits to work with Trusty. I should have a test kernel ready shortly.

Revision history for this message
Emsi (trash1-z) wrote :

Thank you for the update. I'm staying tuned :)

Brad Figg (brad-figg)
no longer affects: linux (Ubuntu Wily)
no longer affects: linux (Ubuntu Vivid)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

There should be a Trusty test kernel soon. I've had to identify 13 prereq commits to get the two patches to apply and build properly.

For Xenial and newer, there has been positive test results from Emsi and faulpeltz with Alexs' patch and the one from faulpeltz. Are we comfortable with submitting an SRU request for those two patches, or do we think more testing is required.

Revision history for this message
Alex Ng (alexng-v) wrote :

Hi @faulpeltz,

A few questions/comments about your patch:

1) Can you submit your patch to the upstream kernel?
2) Under load, were you able to measure how long the FIFREEZE operation took before it succeeded? I'm trying to see if we can increase the timeout of the kernel driver before it hits the error condition that you encountered.

Thanks,
Alex

Revision history for this message
faulpeltz (mg-h) wrote :

@Alex
1) Yes, but I might need some help with that. Which list/maintainer should I submit it to?

2) On our test machine, with both the hyper-v host as well as the guest under heavy i/o load, it was a few hundred ms but with high variance and spiking (quite often) in to the 2-4s range, with some extreme values being 10s and more.
I assume that this varies a lot with regards to the test hardware, especially storage.
With light load (but not idle) this was well in the <100ms range but with occasional spikes.
Another thing I noticed was that the freeze operation often took a lot longer (but not always) when triggered directly by the vss daemon through a host backup and not just from a standalone test program which just called FREEZE/THAW in a loop.

Revision history for this message
Alex Ng (alexng-v) wrote :

Thanks @faulpeltz for the info.

I sent you a private message with the list of maintainers to send the patch (trying to avoid pasting it here in case spam bots can go through this archive).

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I was able to backport the two patches to Trusty. It required quite a few prerequisite commits. The following commits were backported to a Trusty test kernel:

f9cc88d lp1470250: patch from faulpeltz
6f3e8a2 Drivers: hv: utils: Continue to poll VSS channel after handling requests.
725a85d Drivers: hv: utils: fix a race on userspace daemons registration
a0c12c6 Drivers: hv: kvp: fix IP Failover
51fea7d Drivers: hv: utils: Invoke the poll function after handshake
42fb309 Drivers: hv: utils: run polling callback always in interrupt context
505de58 Drivers: hv: fcopy: full handshake support
2ec9789 Drivers: hv: vss: full handshake support
e12c519 Tools: hv: vss: use misc char device to communicate with kernel
1f3abc0 Drivers: hv: kvp: convert to hv_utils_transport
d789589 Drivers: hv: fcopy: convert to hv_utils_transport
51ed5da Drivers: hv: vss: convert to hv_utils_transport
edbff5d Drivers: hv: util: introduce hv_utils_transport abstraction
056fbb5 Drivers: hv: fcopy: switch to using the hvutil_device_state state machine
5eb0af4 Drivers: hv: vss: switch to using the hvutil_device_state state machine
a170e78 Drivers: hv: kvp: switch to using the hvutil_device_state state machine
a539598 Drivers: hv: fcopy: rename fcopy_work -> fcopy_timeout_work
7894b35 Drivers: hv: kvp: rename kvp_work -> kvp_timeout_work
d3fc031 Drivers: hv: kvp,vss: Fast propagation of userspace communication failure
ee85362 Drivers: hv: vss: Introduce timeout for communication with userspace
d714c5a Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition
b957545 connector: add portid to unicast in addition to broadcasting

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1470250/AlexFaulpeltzPatch/trusty/

Can this kernel be tested to see if it resolves this bug?

Note, that both the linux-image and linux-image-extra .deb packages need to be installed.

Revision history for this message
Benjamin Ihrig (benjamin-ihrig) wrote :

Any news on this? Recently it seemed like a fix was found, but no update for about a month now.

Thanks guys!

Revision history for this message
Alex Ng (alexng-v) wrote :

I'll let @jsalisbury comment on the status of his backported patches.

@jsalisbury, I'd also take this recently submitted patch from upstream kernel as well. It ensures that the VSS driver doesn't timeout any long running FREEZE operations too early. This should preclude the need for faulpeltz's patch in most cases, unless the FREEZE operation takes extraordinarily long (> 15 mins).

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/hv?id=b357fd3908c1191f2f56e38aa77f2aecdae18bc8

Revision history for this message
faulpeltz (mg-h) wrote :

@Alex
The modified timeout should take care of the issue, but I think its a good idea for the VSS daemon to issue a THAW before either exiting or trying to recover

Revision history for this message
Alex Ng (alexng-v) wrote :

@faulpeltz
I agree. Your patch should also be included in case a FREEZE operation does exceed the increased timeout.

I've attached a modified version of your patch that addresses some concerns we had offline. Could you give it a try?

Revision history for this message
faulpeltz (mg-h) wrote :

@Alex
I will try as soon as I have some spare time

@jsalisbury
Unfortunately I didnt have time to test the kernel from #341.
Including the #343 upstream commit should take care of our issue, and using the patch from #345 (replacing my initial patch) should prevent any hv_vss_daemon crashes in extremely slow cases

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@faulpeltz, Thanks for the update. I'll build a test kernel with all the latest patches and backport them to prior releases.

Just to confirm, so I build the test kernels with everything needed, we need the following:

New patch from comment #343:
Drivers: hv: vss: Operation timeouts should match host expectation

The updated patch from @faulpeltz posted in comment #343
Tools: hv: vss: Thaw the filesystem and continue after freeze fails

And lastly, Alex's patch, which has already landed in mainline:
497af84 Drivers: hv: utils: Continue to poll VSS channel after handling requests.

I have not submitted an SRU request for these three patches yet. I just want to confirm all we need are these three patches, and have good testing feedback.

Revision history for this message
Alex Ng (alexng-v) wrote :

Thanks Joseph for compiling the list. The patches you've outlined should be sufficient.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built Yakkety and Xenial test kernels with the three patches mentioned in comment #347. It would be great if these test kernels can be tested. If they resolve the bug, I'll submit and SRU request to have them included in Yakkety and Xenial.

The test kernels can be downloaded from:
Xenial: http://kernel.ubuntu.com/~jsalisbury/lp1470250/xenial/
Yakkety: http://kernel.ubuntu.com/~jsalisbury/lp1470250/yakkety

The Trusty kernel requires quite a bit of prerequisites commits and backporting. I'll post a trusty test kernel once it's available.

The Zesty kernel will pick up these patches when they land in mainline. The patches will be SRU'd to Zesty if they don't land in mainline in time for release.

Revision history for this message
Emsi (trash1-z) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Emsi, right now there are only Xenial and Yakkety test kernels with all the updated patches. I will have a Trusty test kernel available shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Emsi and whoever can test Trusty. There is a test kernel available at:
http://kernel.ubuntu.com/~jsalisbury/lp1470250/trusty

This test kernel has the updated 3 patches. It also required 19 prerequisite patches.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

For those interested, the Trusty test kernel contains the following commits:

e81d871 UBUNTU: SAUCE: (no-up) Tools: hv: vss: Thaw the filesystem and continue after freeze fails
01032c1 UBUNTU: SAUCE: (no-up) Drivers: hv: vss: Operation timeouts should match host expectation
4a6f680 Drivers: hv: utils: Continue to poll VSS channel after handling requests.
61c18ef Drivers: hv: kvp: fix IP Failover
db70a1c Drivers: hv: utils: Invoke the poll function after handshake
3c8486d Drivers: hv: utils: run polling callback always in interrupt context
e6e9811 Drivers: hv: fcopy: full handshake support
8f0d521 Drivers: hv: vss: full handshake support
8b167aa Tools: hv: vss: use misc char device to communicate with kernel
6712c24 Drivers: hv: kvp: convert to hv_utils_transport
e5bd829 Drivers: hv: fcopy: convert to hv_utils_transport
9ec74ad Drivers: hv: vss: convert to hv_utils_transport
a2adece Drivers: hv: util: introduce hv_utils_transport abstraction
7aa0716 Drivers: hv: fcopy: switch to using the hvutil_device_state state machine
d804513 Drivers: hv: vss: switch to using the hvutil_device_state state machine
5c9bfa1 Drivers: hv: kvp: switch to using the hvutil_device_state state machine
11aef70 Drivers: hv: fcopy: rename fcopy_work -> fcopy_timeout_work
742f132 Drivers: hv: kvp: rename kvp_work -> kvp_timeout_work
a5a8e26 Drivers: hv: kvp,vss: Fast propagation of userspace communication failure
77fbf9a Drivers: hv: vss: Introduce timeout for communication with userspace
d1401e0 Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition
fcee4902 connector: add portid to unicast in addition to broadcasting

Revision history for this message
Joshua R. Poulson (jrp) wrote :

Microsoft will test the trusty kernel in the coming days. The Xenial kernel may be getting all the patches from the 4.9 rebase that's currently in progress.

Revision history for this message
Emsi (trash1-z) wrote :

3.13.0-103-generic #150~lp1470250 SMP looks promising.
Several days without crash. I encountered some IO errors but the filesystem remains rw.

[Wed Jan 11 13:47:35 2017] hv_storvsc vmbus_0_1: cmd 0x35 scsi status 0x2 srb status 0x82
[Wed Jan 11 13:47:35 2017] hv_storvsc vmbus_0_1: stor pkt ffff8800f0b0f728 autosense data valid - len 18
[Wed Jan 11 13:47:35 2017] storvsc: Sense Key : Unit Attention [current]
[Wed Jan 11 13:47:35 2017] storvsc: Add. Sense: Changed operating definition
[Wed Jan 11 13:47:35 2017] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The
 Linux SCSI layer does not automatically adjust these parameters.
[Wed Jan 11 13:47:35 2017] end_request: I/O error, dev sda, sector 0
[Wed Jan 11 13:48:56 2017] hv_storvsc vmbus_0_1: cmd 0x28 scsi status 0x2 srb status 0x82
[Wed Jan 11 13:48:56 2017] hv_storvsc vmbus_0_1: stor pkt ffff8800f0fa0668 autosense data valid - len 18
[Wed Jan 11 13:48:56 2017] storvsc: Sense Key : Unit Attention [current]
[Wed Jan 11 13:48:56 2017] storvsc: Add. Sense: Changed operating definition
[Wed Jan 11 13:48:56 2017] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
[Wed Jan 11 14:07:56 2017] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x82
[Wed Jan 11 14:07:56 2017] hv_storvsc vmbus_0_1: stor pkt ffff8800f0fa50a8 autosense data valid - len 18
[Wed Jan 11 14:07:56 2017] storvsc: Sense Key : Unit Attention [current]
[Wed Jan 11 14:07:56 2017] storvsc: Add. Sense: Changed operating definition
[Wed Jan 11 14:07:56 2017] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.

Revision history for this message
James Straub (jstraub) wrote :

It's been more than 1.5 years!
When can we expect this fix to be released?

$ uname -a
Linux Server123 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty

Jan 30 01:54:55 Server123 kernel: [78577.687381] cifs_vfs_err: 18 callbacks suppressed
Jan 30 01:54:55 Server123 kernel: [78577.687389] CIFS VFS: open dir failed
Jan 30 01:54:55 Server123 kernel: [78577.692074] CIFS VFS: open dir failed
Jan 30 01:54:56 Server123 kernel: [78578.027096] CIFS VFS: open dir failed
Jan 30 01:54:56 Server123 kernel: [78578.031664] CIFS VFS: open dir failed
Jan 30 03:00:27 Server123 Hyper-V VSS: VSS: freeze of /archive: Success
Jan 30 03:00:27 Server123 Hyper-V VSS: VSS: freeze of /: Success
Jan 30 03:00:27 Server123 kernel: [82509.382621] hv_storvsc vmbus_0_1: cmd 0x28 scsi status 0x2 srb status 0x82
Jan 30 03:00:27 Server123 kernel: [82509.382730] hv_storvsc vmbus_0_1: stor pkt ffff8800d306d368 autosense data valid - len 18
Jan 30 03:00:27 Server123 kernel: [82509.382735] storvsc: Sense Key : Unit Attention [current]
Jan 30 03:00:27 Server123 kernel: [82509.382741] storvsc: Add. Sense: Changed operating definition
Jan 30 03:00:27 Server123 kernel: [82509.382811] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
Jan 30 03:00:27 Server123 Hyper-V VSS: VSS: thaw of /archive: Success
Jan 30 03:00:27 Server123 Hyper-V VSS: VSS: thaw of /: Success
Jan 30 03:00:57 Server123 kernel: [82539.424627] hv_storvsc vmbus_0_1: cmd 0x2a scsi status 0x2 srb status 0x82
Jan 30 03:00:57 Server123 kernel: [82539.424678] hv_storvsc vmbus_0_1: stor pkt ffff8800b89df468 autosense data valid - len 18
Jan 30 03:00:57 Server123 kernel: [82539.424683] storvsc: Sense Key : Unit Attention [current]
Jan 30 03:00:57 Server123 kernel: [82539.424689] storvsc: Add. Sense: Changed operating definition
Jan 30 03:00:57 Server123 kernel: [82539.424834] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.

Revision history for this message
David (c6d) wrote :

Yeah an official fix would be nice. I am tired of getting up early to see if I have to reboot any vm's :)

Revision history for this message
Brian Vargyas (brianv) wrote :

So I'm the OP on this bug report from what seems like ages ago... While a lot of hard work as been done by some of the members posting here to try and replicate it, it's been very difficult to track the specific event that has been causing this. While we "fixed" our problem a long time ago by moving all the VM's effected to a server and disabling Hyper-V nightly snapshots, instead opting to shut down all VM's once a month on a weekend and shapshot the whole system and bring it back online, it hasn't been optimal.

Lately, we've been upgrading the 14.04LTS systems one at a time to 16.04 with the latest kernel and dropping them back into active nightly backup systems and so far, have not seen a read-only file system failure with systems running the latest updates. Now these systems are not write heavy either, which tends to trigger this bug more often. At least we've seen some improvement in our environment just by keeping up to date with the latest distribution updates. Outside of this, microsoft has released server 2016 during our lifetime, and while we are not running, it would be interesting to see if that Hyper-V server OS is any better/worse then 2012R2.

Revision history for this message
Alex Ng (alexng-v) wrote :

Hi @jsalisbury,

Any status update on the patches for this issue?

It appears the test kernels have resolved the issue.

Let us know if you need additional testing or have questions about the patches.

Thanks,
Alex

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Just waiting on test results for Trusty per comments #352 and #353. It sounds like the test kernels resolved the issue? If that is the case, I'll submit the SRU request.

Revision history for this message
Emsi (trash1-z) wrote :

I can confirm that the trusty patches are preventing the filesystem failure.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The patch from @faulpeltz posted in comment #345 has never landed in mainline. Is there plans for this patch to be sent upstream, or do we just want to include it as a SAUCE patch?

Revision history for this message
Joshua R. Poulson (jrp) wrote :

That patch was submitted upstream under the title "[PATCH] Tools: hv: recover after hv_vss_daemon freeze times out" but I don't see that it was committed. I'll poke around.

Revision history for this message
Joshua R. Poulson (jrp) wrote :

By the way, the "operation times should match host expectation" patch in #343 (and listed in #347) is upstream:

commit b357fd3908c1191f2f56e38aa77f2aecdae18bc8
Author: Alex Ng <email address hidden>
Date: Sun Nov 6 13:14:11 2016 -0800

    Drivers: hv: vss: Operation timeouts should match host expectation

    Increase the timeout of backup operations. When system is under I/O load,
    it needs more time to freeze. These timeout values should also match the
    host timeout values more closely.

    Signed-off-by: Alex Ng <email address hidden>
    Signed-off-by: K. Y. Srinivasan <email address hidden>
    Signed-off-by: Greg Kroah-Hartman <email address hidden>

Revision history for this message
Alex Ng (alexng-v) wrote :

The patch from @faulpeltz hasn't been mainlined because of some feedback that it shouldn't have to close and reopen the /dev/vmbus/hv_vss device after failure.

I addressed this comment this in a modified version of @faulpeltz's patch (see comment #345).

I haven't heard from @faulpeltz whether he's tested it. From my testing, it seems fine and I expect to resubmit it upstream. If we can make it a SAUCE patch, that would be nice.

Revision history for this message
faulpeltz (mg-h) wrote :

Sorry guys I had been swamped with other stuff and then simply forgot to test @alexng's patch
I ran it overnight on my original test setup and it also worked for me.

Brad Figg (brad-figg)
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Released
Revision history for this message
Emsi (trash1-z) wrote :

How about rusty?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Trusty requires quite a few prereq commits(See comment #353), so it's SRU will be a little behind. However, it will be SRU'd shortly as well.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The fixes for Xenial, Yakkety and Zesty are committed or released.

However, I built one more Trusty test kernel with all of the prereq commits and the patches. Would it be possible to have this kernel tested one more time before it is SRU'd? It would be good to confirm no regressions are introduced in Trusty due to the number of changes needed. It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1470250/trusty/

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.1 KiB)

This bug was fixed in the package linux - 4.4.0-75.96

---------------
linux (4.4.0-75.96) xenial; urgency=low

  * linux: 4.4.0-75.96 -proposed tracker (LP: #1684441)

  * [Hyper-V] hv: util: move waiting for release to hv_utils_transport itself
    (LP: #1682561)
    - Drivers: hv: util: move waiting for release to hv_utils_transport itself

linux (4.4.0-74.95) xenial; urgency=low

  * linux: 4.4.0-74.95 -proposed tracker (LP: #1682041)

  * [Hyper-V] hv: vmbus: Raise retry/wait limits in vmbus_post_msg()
    (LP: #1681893)
    - Drivers: hv: vmbus: Raise retry/wait limits in vmbus_post_msg()

linux (4.4.0-73.94) xenial; urgency=low

  * linux: 4.4.0-73.94 -proposed tracker (LP: #1680416)

  * CVE-2017-6353
    - sctp: deny peeloff operation on asocs with threads sleeping on it

  * vfat: missing iso8859-1 charset (LP: #1677230)
    - [Config] NLS_ISO8859_1=y

  * Regression: KVM modules should be on main kernel package (LP: #1678099)
    - [Config] powerpc: Add kvm-hv and kvm-pr to the generic inclusion list

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in complain mode

  * Permission denied and inconsistent behavior in complain mode with 'ip netns
    list' command (LP: #1648903)
    - SAUCE: fix regression with domain change in complain mode

  * unexpected errno=13 and disconnected path when trying to open /proc/1/ns/mnt
    from a unshared mount namespace (LP: #1656121)
    - SAUCE: apparmor: null profiles should inherit parent control flags

  * apparmor refcount leak of profile namespace when removing profiles
    (LP: #1660849)
    - SAUCE: apparmor: fix ns ref count link when removing profiles from policy

  * tor in lxd: apparmor="DENIED" operation="change_onexec"
    namespace="root//CONTAINERNAME_<var-lib-lxd>" profile="unconfined"
    name="system_tor" (LP: #1648143)
    - SAUCE: apparmor: Fix no_new_privs blocking change_onexec when using stacked
      namespaces

  * apparmor oops in bind_mnt when dev_path lookup fails (LP: #1660840)
    - SAUCE: apparmor: fix oops in bind_mnt when dev_path lookup fails

  * apparmor auditing denied access of special apparmor .null fi\ le
    (LP: #1660836)
    - SAUCE: apparmor: Don't audit denied access of special apparmor .null file

  * apparmor label leak when new label is unused (LP: #1660834)
    - SAUCE: apparmor: fix label leak when new label is unused

  * apparmor reference count bug in label_merge_insert() (LP: #1660833)
    - SAUCE: apparmor: fix reference count bug in label_merge_insert()

  * apparmor's raw_data file in securityfs is sometimes truncated (LP: #1638996)
    - SAUCE: apparmor: fix replacement race in reading rawdata

  * unix domain socket cross permission check failing with nested namespaces
    (LP: #1660832)
    - SAUCE: apparmor: fix cross ns perm of unix domain sockets

  * Xenial update to v4.4.59 stable release (LP: #1678960)
    - xfrm: policy: init locks early
    - virtio_balloon: init ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.5 KiB)

This bug was fixed in the package linux - 4.8.0-49.52

---------------
linux (4.8.0-49.52) yakkety; urgency=low

  * linux: 4.8.0-49.52 -proposed tracker (LP: #1684427)

  * [Hyper-V] hv: util: move waiting for release to hv_utils_transport itself
    (LP: #1682561)
    - Drivers: hv: util: move waiting for release to hv_utils_transport itself

linux (4.8.0-48.51) yakkety; urgency=low

  * linux: 4.8.0-48.51 -proposed tracker (LP: #1682034)

  * [Hyper-V] hv: vmbus: Raise retry/wait limits in vmbus_post_msg()
    (LP: #1681893)
    - Drivers: hv: vmbus: Raise retry/wait limits in vmbus_post_msg()

linux (4.8.0-47.50) yakkety; urgency=low

  * linux: 4.8.0-47.50 -proposed tracker (LP: #1679678)

  * CVE-2017-6353
    - sctp: deny peeloff operation on asocs with threads sleeping on it

  * CVE-2017-5986
    - sctp: avoid BUG_ON on sctp_wait_for_sndbuf

  * vfat: missing iso8859-1 charset (LP: #1677230)
    - [Config] NLS_ISO8859_1=y

  * [Hyper-V] pci-hyperv: Use device serial number as PCI domain (LP: #1667527)
    - net/mlx4_core: Use cq quota in SRIOV when creating completion EQs

  * Regression: KVM modules should be on main kernel package (LP: #1678099)
    - [Config] powerpc: Add kvm-hv and kvm-pr to the generic inclusion list

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in complain mode

  * Permission denied and inconsistent behavior in complain mode with 'ip netns
    list' command (LP: #1648903)
    - SAUCE: fix regression with domain change in complain mode

  * unexpected errno=13 and disconnected path when trying to open /proc/1/ns/mnt
    from a unshared mount namespace (LP: #1656121)
    - SAUCE: apparmor: null profiles should inherit parent control flags

  * apparmor refcount leak of profile namespace when removing profiles
    (LP: #1660849)
    - SAUCE: apparmor: fix ns ref count link when removing profiles from policy

  * tor in lxd: apparmor="DENIED" operation="change_onexec"
    namespace="root//CONTAINERNAME_<var-lib-lxd>" profile="unconfined"
    name="system_tor" (LP: #1648143)
    - SAUCE: apparmor: Fix no_new_privs blocking change_onexec when using stacked
      namespaces

  * apparmor oops in bind_mnt when dev_path lookup fails (LP: #1660840)
    - SAUCE: apparmor: fix oops in bind_mnt when dev_path lookup fails

  * apparmor auditing denied access of special apparmor .null fi\ le
    (LP: #1660836)
    - SAUCE: apparmor: Don't audit denied access of special apparmor .null file

  * apparmor label leak when new label is unused (LP: #1660834)
    - SAUCE: apparmor: fix label leak when new label is unused

  * apparmor reference count bug in label_merge_insert() (LP: #1660833)
    - SAUCE: apparmor: fix reference count bug in label_merge_insert()

  * apparmor's raw_data file in securityfs is sometimes truncated (LP: #1638996)
    - SAUCE: apparmor: fix replacement race in reading rawdata

  * unix domain socket cross permission check failing with n...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
Changed in linux (Ubuntu Trusty):
status: In Progress → Won't Fix
Brad Figg (brad-figg)
tags: added: cscc
Displaying first 40 and last 40 comments. View all 373 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.