[Hyper-V] Kernel patches for storvsc

Bug #1445195 reported by Robert C Jennings on 2015-04-16
94
This bug affects 15 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Andy Whitcroft
Trusty
High
Andy Whitcroft
Utopic
High
Andy Whitcroft
Vivid
High
Andy Whitcroft
Wily
High
Andy Whitcroft

Bug Description

Storage driver performance updates for vivid

K. Y. Srinivasan (7):
  scsi: storvsc: Increase the ring buffer size
  scsi: storvsc: Size the queue depth based on the ringbuffer size
  scsi: storvsc: Always send on the selected outgoing channel
  scsi: storvsc: Retrieve information about the capability of the target
  scsi: storvsc: Fix a bug in copy_from_bounce_buffer()
  scsi: storvsc: Don't assume that the scatterlist is not chained
  scsi: storvsc: Set the tablesize based on the information given by the host

tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Vivid):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Andy Whitcroft (apw)
Joseph Salisbury (jsalisbury) wrote :

I built a Vivid test kernel with the 7 requested commits. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1445195/

Can you test this kernel and see if it resolves this bug?

I'll also build a test kernel for bug 1452074 and post a link to it in that bug report.

Joshua R. Poulson (jrp) wrote :

We will test this kernel. This needs to go to 15.10, 15.04, 14.10, 14.04, and 14.04 HWE.

Changed in linux (Ubuntu Utopic):
assignee: nobody → Andy Whitcroft (apw)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Andy Whitcroft (apw)
Changed in linux (Ubuntu Precise):
assignee: nobody → Andy Whitcroft (apw)
Changed in linux (Ubuntu Utopic):
importance: Undecided → High
Changed in linux (Ubuntu Trusty):
importance: Undecided → High
Changed in linux (Ubuntu Precise):
importance: Undecided → High
Changed in linux (Ubuntu Utopic):
status: New → Triaged
Changed in linux (Ubuntu Trusty):
status: New → Triaged
Changed in linux (Ubuntu Precise):
status: New → Triaged
no longer affects: linux (Ubuntu Precise)
Frederik Bosch (f-bosch) wrote :

The problem we encountered with HyperV backups, filed here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1456985, should be solved by these patches too.

Our bug is an exact copy of the initial bug that was reported here: https://social.technet.microsoft.com/Forums/windowsserver/en-US/8807f61c-565e-45bc-abc4-af09abf59de2/ubuntu-14042-lts-generation-2-scsi-errors-on-vss-based-backups.

Joseph Salisbury (jsalisbury) wrote :

We now have test kernels available for Trusty, Utopic and Vivid. They are all available from:
http://kernel.ubuntu.com/~jsalisbury/lp1445195/

It would be great if you could test the kernels for all releases.

Joseph Salisbury (jsalisbury) wrote :

It sounds like the following commit is also needed to solve bug 1456985 :

commit dc45708ca9988656d706940df5fd102672c5de92
Author: K. Y. Srinivasan <email address hidden>
Date: Fri May 1 11:03:02 2015 -0700

    storvsc: Set the SRB flags correctly when no data transfer is needed

That commit was requested in bug 1454758 . I'm going to mark bug 1454758 as a duplicate of this bug and build all of the commits requested in that bug and this bug together in a test kernel.

Joseph Salisbury (jsalisbury) wrote :

I built Trusty, Utopic and Vivid test kernels, with the commits requested in this bug, any pre-reqs and commit dc45708ca per bug 1454758 . The details are as follows:

Trusty:
  dc45708 storvsc: Set the SRB flags correctly when no data transfer is needed <- Added per bug 1454758

  b9ec3a5 scsi: storvsc: Increase the ring buffer size
  f458aad scsi: storvsc: Size the queue depth based on the ringbuffer size
  0147dab scsi: storvsc: Always send on the selected outgoing channel
  5117b93 scsi: storvsc: Retrieve information about the capability of the target
  8de5807 scsi: storvsc: Fix a bug in copy_from_bounce_buffer()
  aaced99 scsi: storvsc: Don't assume that the scatterlist is not chained
  be0cf6c scsi: storvsc: Set the tablesize based on the information given by the

  d61031e Drivers: hv: vmbus: Support a vmbus API for efficiently sending page arrays < Pre-req needed for all releases.
  011a7c3 Drivers: hv: vmbus: Cleanup the packet send path <- Pre-req only needed for Trusty.

Utopic:
  dc45708 storvsc: Set the SRB flags correctly when no data transfer is needed <- Added per bug 1454758

  b9ec3a5 scsi: storvsc: Increase the ring buffer size
  f458aad scsi: storvsc: Size the queue depth based on the ringbuffer size
  0147dab scsi: storvsc: Always send on the selected outgoing channel
  5117b93 scsi: storvsc: Retrieve information about the capability of the target
  8de5807 scsi: storvsc: Fix a bug in copy_from_bounce_buffer()
  aaced99 scsi: storvsc: Don't assume that the scatterlist is not chained
  be0cf6c scsi: storvsc: Set the tablesize based on the information given by the

  d61031e Drivers: hv: vmbus: Support a vmbus API for efficiently sending page arrays < Pre-req needed for all releases.

Vivid:
  dc45708 storvsc: Set the SRB flags correctly when no data transfer is needed <- Added per bug 1454758

  b9ec3a5 scsi: storvsc: Increase the ring buffer size
  f458aad scsi: storvsc: Size the queue depth based on the ringbuffer size
  0147dab scsi: storvsc: Always send on the selected outgoing channel
  5117b93 scsi: storvsc: Retrieve information about the capability of the target
  8de5807 scsi: storvsc: Fix a bug in copy_from_bounce_buffer()
  aaced99 scsi: storvsc: Don't assume that the scatterlist is not chained
  be0cf6c scsi: storvsc: Set the tablesize based on the information given by the

  d61031e Drivers: hv: vmbus: Support a vmbus API for efficiently sending page arrays < Pre-req needed for all releases.

The test kernels can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1445195/

Can you test these kernels and post back the results?

Thanks in advance!

Changed in linux (Ubuntu Precise):
importance: Undecided → High
status: New → Triaged
assignee: nobody → Andy Whitcroft (apw)
Joshua R. Poulson (jrp) wrote :

Testing in progress.

Frederik Bosch (f-bosch) wrote :

Testing in progress.

Frederik Bosch (f-bosch) wrote :

No luck with the current build.

Linux VPS-Genkgo 3.19.0-18-generic #18 SMP Wed May 20 17:57:26 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

First backup was succesful, second one failed.

[ 2379.049088] sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
[ 2379.049109] sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
[ 2379.049131] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
[ 3439.820142] sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
[ 3439.820163] sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
[ 3439.820341] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
[ 3439.820350] blk_update_request: I/O error, dev sda, sector 201644056
[ 3439.820394] Aborting journal on device sda1-8.
[ 3439.824253] EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected aborted journal
[ 3439.824285] EXT4-fs (sda1): Remounting filesystem read-only

Frederik Bosch (f-bosch) wrote :

Another confirmation that the reported bugs are not fixed in the vivid build. We have had another crash this morning, same log messages. It took quite some hours longer than yesterday though, which confirms the occurrences are still random (we are not able to see the pattern).

Joseph Salisbury (jsalisbury) wrote :

@Joshua R. Poulson , Are you still seeing issues as well with the test kernel?

Joshua R. Poulson (jrp) wrote :

We did not reproduce this issue with our performance testing, but it is difficult to reproduce. We have a query into our development team to analyze f-bosch's reports.

Frederik Bosch (f-bosch) wrote :

@jrp: did you test the 3.19 kernel or did you use trusty and/or utopic build?

Joseph Salisbury (jsalisbury) wrote :

I think I may have found an issue. The test kernel I build did not contain:
  8de5807 scsi: storvsc: Fix a bug in copy_from_bounce_buffer()

That commit is getting applied to the Vivid kernel when it gets the upstream 3.19.7 updates. Because of this I didn't cherry pick it. That commit is only applied to the Vivid master-next branch and I built with the master branch.

I'll build a new test kernel and ensure this commit is included.

Frederik Bosch (f-bosch) wrote :

@jrp: could you confirm that this might solve my issue with the 3.19 kernel? Is that specific commit required?

Because I was still seeing "Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters." with every backup I did.

Joshua R. Poulson (jrp) wrote :

@jsalisbury: yes, that patch is important, but it is difficult to exercise.

@f-bosch: the missing patch might, to really depends on whether your I/O pattern exercises the bug. That warning, however, is benign. We tested our submission upstream as a set, not individually.

Joseph Salisbury (jsalisbury) wrote :

I confirmed the Utopic and Trusty test kernels I posted did in fact have that patch.

I rebuilt the Vivid test kernel with the master-next branch and confirmed it does have that patch now. It is posted to:
http://kernel.ubuntu.com/~jsalisbury/lp1445195/vivid/

Chris Valean (cvalean) wrote :

Joseph, did the new kernel for trusty build yesterday has any changes?
I tried to installed it on top of a up-to-date 14.04.2. The kernel installed fine, however at reboot the VM fails to boot

Files used: http://kernel.ubuntu.com/~jsalisbury/lp1445195/trusty/

Looking now to get a serial log and to see if it's repro on other builds.

Frederik Bosch (f-bosch) wrote :

We are starting backup tests with new Vivid kernel now.

Frederik Bosch (f-bosch) wrote :

One thing I have noticed already, is new logs regarding received packages.

May 27 16:30:18 Hyper-V VSS: VSS: op=FREEZE: succeeded
May 27 16:30:18 Hyper-V VSS: VSS: op=FREEZE: succeeded
May 27 16:30:18 Hyper-V VSS: Received packet from untrusted pid:2102
May 27 16:30:18 Hyper-V VSS: Received packet from untrusted pid:1086
May 27 16:30:18 Hyper-V VSS: VSS: op=THAW: succeeded
May 27 16:30:18 Hyper-V VSS: VSS: op=THAW: succeeded
May 27 16:30:18 Hyper-V VSS: Received packet from untrusted pid:2102
May 27 16:30:18 Hyper-V VSS: Received packet from untrusted pid:1086

And pids (2102 and 1086) are hv_vss_daemon processes.

root 1086 0.0 0.0 4332 1512 ? Ss 16:04 0:00 /usr/lib/linux-tools/3.19.0-19-generic/hv_vss_daemon
root 2102 0.0 0.0 4332 1516 ? Ss 16:06 0:00 /usr/lib/linux-tools/3.19.0-19-generic/hv_vss_daemon

Current kernel: 3.19.0-19-generic #19~lp1445195 SMP Tue May 26 17:43:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux.

Frederik Bosch (f-bosch) wrote :

In addition to the previous message, there seemed to be two processes running of the vss and kvp daemon (while there is only one instance of fcopy). This is weird (I think), but I have seen it before with other kernels. A reboot helps in that case. This can also be an explanation for the "received package from untrusted pid" message. Not on topic, but I wonder why there were two vss and kvp daemons running.

Joshua R. Poulson (jrp) wrote :

cloud-tools is tied to the kernel version, you may have ended up with two instances of it.

Frederik Bosch (f-bosch) wrote :

@jrp: Ok. After reboot it was fine: just one process per daemon. However, now I do not see any logs of FREEZE and THAW anymore (while backups are successful). I do see the warning that the operating parameters have changed. Probably nothing to worry about.

Nonetheless, being a programmer myself, I do like to see consistency in the logs. It usually helps solving the problem. And when it is not consistent, it is many times an indication that something is wrong.

Brian Vargyas (brianv) wrote :

I downloaded the 3.16-39 kernels and files and applied them to a test system we have running on Hyper-V. I'm not able to get the -39 kernel to fully boot. The startup hangs on the following line:

EXT4-fs (sda2) re-mounted. Opts: errors=remount-ro

If I restart the system with kernel -38, I can boot, but of course without the patches here. I'm running 14.04 LTS with the 3.16 kernel, but I could try the 3.13 kernel or 3.19. Another problem with the install is that linux-tools required a different REV of binutils, but from what I could tell, it wanted a rev lower then what I had, I ignored it- -- but that shouldn't be the reason the system won't finish mounting the file system and booting.

Frederik Bosch (f-bosch) wrote :

@brianv, you should run dpkg --ignore-depends=binutils -i *.deb to install. The binutils update can be ignored (if you look at the changelog it is a minor version bump without api changes). Maybe you should try the 3.19 kernel: we do not have any booting problems while we are also on trusty.

Brian Vargyas (brianv) wrote :

Thanks! So, 3.19 kernel runs fine. Something is wrong with the 3.16 build. I show the VSS and KVP daemons registered as well. I'm going to let this run a few days and see what happens. I should start to see the errors during VSS backup tonight if they occur. I may apply this kernel to a few other production VM's that on a regular basis go into RO filesystem mode overnight, but I'll run on our test VM for now.

Joshua R. Poulson (jrp) wrote :

@brianv I would avoid taking this to production. When these patches go into -proposed I'll recommend testing again with the rest of proposed updates to other packages, and other kernel changes that come from other bugs, and that would be a better time to consider broader tests. I would recommend avoiding even -proposed for production use.

Frederik Bosch (f-bosch) wrote :

Our tests are positive so far. Already creating backups for 19 hours, with a backup every half an hour.

Brian Vargyas (brianv) wrote :

1st Night Backups Ran okay, with no strange SCSI errors. The following was logged into syslog:

May 28 00:01:22 test Hyper-V VSS: VSS: op=FREEZE: succeeded
May 28 00:01:21 test kernel: [22188.912452] sd 0:0:0:0: [storvsc] Sense Key : Unit Attention [current]
May 28 00:01:21 test kernel: [22188.912465] sd 0:0:0:0: [storvsc] Add. Sense: Changed operating definition
May 28 00:01:21 test kernel: [22188.912596] sd 0:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
May 28 00:01:21 test Hyper-V VSS: VSS: op=THAW: succeeded
May 28 00:17:01 test CRON[1323]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 01:17:01 test CRON[1329]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
May 28 01:36:17 test kernel: [27884.812212] sd 0:0:0:0: [storvsc] Sense Key : Unit Attention [current]
May 28 01:36:17 test kernel: [27884.812229] sd 0:0:0:0: [storvsc] Add. Sense: Changed operating definition
May 28 01:36:17 test kernel: [27884.812361] sd 0:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.

So we're still getting the messages about, hey, something changed on the Disk, but I can live with that for the price of a proper VSS snapshot.

Again, this is with the 3.19 -39 kernel on 14.04.2 LTS. I'll keep an eye on it. I guess I'll wait for production changes, but it's so tempting :-)

Frederik Bosch (f-bosch) wrote :

@brainv We are seeing exactly the same. Backups are not crashing while we are still creating backups every half an hour. Our perception is that the bug is fixed, but we are going to add more boxes to see what happens then.

Joseph Salisbury (jsalisbury) wrote :

@Chris Valean, Are you still seeing issues with the Trusty test kernel?

@Brian and @Frederick, Did you also test the Trusty and Utopic test kernels, or just Vivid?

Thanks again for all the help!

Frederik Bosch (f-bosch) wrote :

@jsalisbury: I only tested the Vivid build.

Frederik Bosch (f-bosch) wrote :

Very unfortunate but the bug has been triggered again.

[154272.293488] sd 2:0:0:0: [storvsc] Sense Key : Unit Attention [current]
[154272.293508] sd 2:0:0:0: [storvsc] Add. Sense: Changed operating definition
[154272.293665] sd 2:0:0:0: Warning! Received an indication that the operating parameters on this target have changed. The Linux SCSI layer does not automatically adjust these parameters.
[154272.293671] blk_update_request: I/O error, dev sda, sector 201805560
[154272.293718] Aborting journal on device sda1-8.
[154272.314119] EXT4-fs error (device sda1): ext4_journal_check_start:56: Detected aborted journal
[154272.314154] EXT4-fs (sda1): Remounting filesystem read-only

Frederik Bosch (f-bosch) wrote :

Build: 3.19.0-19-generic #19~lp1445195 SMP Tue May 26 17:43:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Frederik Bosch (f-bosch) wrote :

@jrp Could you indicate with which commits did you exactly tested the bug on your side? I understand from jsalisbury that the base for the current kernel is the master-next branch. He applied the patches mentioned in his earlier post to that kernel. Are we missing any other commits?

Another issue filed on read-only with Hyper-V also still occurs, see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1439780. Can that issue be related to this issue?

Frederik Bosch (f-bosch) wrote :

@jrp Do you have any suggestions on my remarks above? After the machine went intro read-only we rebooted and started the whole sequence again. We did not have had any problems since then. As so it seems, the window of opportunity for the bug to manifest has been decreased drastically. However, I also think it has not been fixed completely.

Joshua R. Poulson (jrp) wrote :

@f-bosch We've tested all of the upstream commits as we regularly test linux-next on Hyper-V (we don't submit patches to Ubuntu until they accepted upstream except in extreme situations).

We are continuing to investigate reports of problems, but we are having difficulty reproducing. I think this current patchset should go through, as it greatly reduces the chances of the problem.

Frederik Bosch (f-bosch) wrote :

@jrp I have tried to discover if there were any differences between the backup that failed and the other ones. I noticed this one thing. Before or during backup (cannot recall exactly atm) a 'composer install' process was running. Composer is a php package manager. It downloads and deploys many small files on the filesystem (download of multiple tar.gz files that are being unpacked). This machine normally does nothing more than serving http requests. It is the opposite of a high-load server. So the composer process is an outlier compared to the usual operations. Maybe this helps to reproduce the io pattern.

Thanks anyway for the work already. It makes a difference for sure. Hope there will be a final solution anytime soon.

Joshua R. Poulson (jrp) wrote :

The Utopic kernel does not work for us, it crashes. Continuing with Vivid testing.

Joshua R. Poulson (jrp) wrote :
Download full text (48.9 KiB)

 Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ Ø澚ŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ ;
¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ Iª¿šŸÐ [ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 3.16.0-39-generic (root@gloin) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #53~lp1445195 SMP Tue May 26 19:27:10 UTC 2015 (Ubuntu 3.16.0-39.53~lp1445195-generic 3.16.7-ckt11)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-39-generic root=UUID=95966cd1-2fc3-498d-b72d-721997da6608 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 rootdelay=300 nomdmonddf nomdmonisw
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Centaur CentaurHauls
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x0000000000...

Frederik Bosch (f-bosch) wrote :

@jrp How did your Vivid testing do? Were you able to find any particularities?

Joseph Salisbury (jsalisbury) wrote :

@Frederik, would it be possible for you to test the Utopic(3.16) kernel to see if it fails to boot for you like Brian mentioned on comment #27?

Frederik Bosch (f-bosch) wrote :

@jsalisbury I guess so, I will try it asap.

Stephen A. Zarkos (stevez) wrote :

The test team has tested the Vivid kernel and confirmed that it works as expected. So let's please ack these patches for the next Vivid kernel.

The Utopic kernel provided crashes at boot time, so we're unable to test that. Can you help us debug that issue?

Frederik Bosch (f-bosch) wrote :

@stevez That means you were not able replicate the read-only issue we had during our tests of the patched Vivid kernel?

Joshua R. Poulson (jrp) wrote :

@f-bosch No, we have not seen that in our testing.

Brian Vargyas (brianv) wrote :

Our test system running 14.04LTS with the Vivid -19 patched kernel has been running okay for 14 days now with no errors into read-only. While non-patched systems have dropped into read-only 3-4 times during this time frame. I've been unable to get the Utopic kernel to work, it crashes as Stephen mentions. The system I have this test kernel running on is very lightly used with very little disk writes going on, so it's hardly a real-world test. I'm going to install the patched kernel into one of our database systems that goes read only on a frequent basis and see what happens and monitor.

Frederik Bosch (f-bosch) wrote :

@brainv We will do similar tests this weekend and next week. Since we actually faced the read-only mode once more, I am not feeling comfortable yet with the current solution. I am going to try to reproduce the io pattern that we had when system went into read-only. It was something like `git clone` followed by `composer install` (PHP package manager) followed by a php script that creates and migrates mariadb database.

Joseph Salisbury (jsalisbury) wrote :

At some point it was believed that bug 1454758 was a duplicate of this bug. Or that this patch set would also fix bug
1454758
. See comment #5. However, this original purpose of this bug was to get storage driver performance updates into Vivid.

The current testing seems to be more specific to bug 1454758 and is focused on backup failures and not the performance improvements.

Maybe we should spit these two bugs back apart so the patches listed in the original description can be tested to measure the performance improvement?

Then we can focus on getting the backup issue resolved and tested back in bug 1454758

It's good to hear the patches do what is expected per comment #47. I'll start the SRU process for Vivid. I'll also looking into building another Utopic test kernel since it's failing to boot.

Brad Figg (brad-figg) on 2015-06-16
Changed in linux (Ubuntu Vivid):
status: Triaged → Fix Committed
Joseph Salisbury (jsalisbury) wrote :

I built a new utopic test kernel which can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1445195/utopic/

@Stephen can you see if this kernel boots now and get the expected performance gains?

Stephen A. Zarkos (stevez) wrote :

We'll test this new kernel. It appears the directory "/~jsalisbury/lp1445195/utopic" is empty, but it looks like the kernel was placed in the "/~jsalisbury/lp1445195" directory instead. I assume this is the correct kernel, but to be safe can you confirm?

Thanks!
Steve

Joseph Salisbury (jsalisbury) wrote :

Yes that is the correct kernel. I've moved the files to the utopic directory now.

Stephen A. Zarkos (stevez) wrote :

We tested the Utopic kernel and it works fine now. No crashes or issues and performance is better.

Thanks!

Joseph Salisbury (jsalisbury) wrote :

Thanks for the feedback, Stephen. I'll start the SRU process Utopic.

I'm also working on backports for Trusty and Precise. Just working to identify all the needed prereq commits.

Thanks again!

Joseph Salisbury (jsalisbury) wrote :

I build a Trusty test kernel with the requested patches and a few prerequisites. It can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1445195/trusty

@Stephen can you see if this kernel boots now and get the expected performance gains?

Brian Vargyas (brianv) wrote :

I wanted to give an update into testing the Vivid kernel on 14.04 LTS re: https://bugs.launchpad.net/bugs/1454758

Our non-critical database production server ran for almost two weeks, but this morning finally died when the file system went RO overnight due to VSS backups. This is with the vivid-19 kernel patches posted here. This also confirms @f-bosch findings that he still is having problems with the new kernel. I went back to https://bugs.launchpad.net/bugs/1454758 and it looks like it's still listed a a duplicate of this bug ID. Being that @jsalisbury wants to keep this on topic for performance improvements, I'm not sure where that leaves us who are more concerned about the reliability aspects of these improvements over the performance.

Looking for best next steps on this one.

ubuntu (h-lbuntu-2) wrote :

Been running this kernel on Utopic 14.10 since it was released. The new kernel reduced the frequency of RO errors from almost daily to about once a week or so but has not eliminated them.

Ditto to previous poster that we're more concerned about reliability than performance.

NicholasC (5-nicholas) wrote :

I'm echoing the previous poster's sentiments:

Reliability > Performance

Joseph Salisbury (jsalisbury) wrote :

I agree it's best to focus on performance in this bug, and use bug 1454758 to focus on the reliability issues.

Joseph Salisbury (jsalisbury) wrote :

For those affected by the reliability issues and not performance, we will be using bug 1454758 or bug 1452074 to focus on that issue.

Joshua R. Poulson (jrp) wrote :

@f-bosch, @brianv please paste the relevant logs into https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1452074 and we will continue the reliability discussion there. It could be the networking piece is having a side effect.

Joseph Salisbury (jsalisbury) wrote :

A new bug was opened: bug 1470250 . Let's now use that bug to focus on the VSS Backup issues. That way we can use bug 1452074 to focus on the patch that affects networking and not storage.

Stephen A. Zarkos (stevez) wrote :

@Joseph, I swear I thought I responded to this bug, sorry for the delay. We have finished testing the Trusty kernel you posted at http://kernel.ubuntu.com/~jsalisbury/lp1445195/trusty, and confirmed performance is better.

Thanks!
Steve

Stephen A. Zarkos (stevez) wrote :

Hi Joseph,

Do you know will these patches will be available in the next kernel maintenance release?

Thanks!
Steve

Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
Stephen A. Zarkos (stevez) wrote :

Our team has tested the Vivid proposed kernel and found no regressions. Performance is comparable to previous test kernels.

Thanks!

tags: added: verification-done-vivid
removed: verification-needed-vivid
Dustin (dander88) wrote :

We just saw this same error on a GEN 1 - Linux 14.04 LTS as well, anyone else seeing this on GEN 1?

Frederik Bosch (f-bosch) wrote :

@dander88 Are you talking on the backup bug? Please report them in bug 1470250.

Dustin (dander88) wrote :

@F-Bosch - yep that VSS backup issue - I will report it in that spot. Thanks

Launchpad Janitor (janitor) wrote :
Download full text (17.6 KiB)

This bug was fixed in the package linux - 3.19.0-23.24

---------------
linux (3.19.0-23.24) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1472346

  [ Chris J Arges ]

  * SAUCE: Don't use atomic read in evlist.c
    - LP: #1410673

linux (3.19.0-23.23) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1472048

  [ Chris J Arges ]

  * [Config] Add CRYPTO_DEV_NX_*, 842_* as modules
    - LP: #1454687

  [ Lu, Han ]

  * SAUCE: i915_bpo: drm/i915/audio: add codec wakeup override
    enabled/disable callback
    - LP: #1460674

  [ Timo Aaltonen ]

  * SAUCE: Backport I915_OVERLAY_DISABLE_DEST_COLORKEY
    - LP: #1460674
  * SAUCE: i915_bpo: Rebase to drm-intel-next-fixes-2015-05-29
    - LP: #1460674
  * SAUCE: i915_bpo: Revert "drm/i915: Implement the intel_dp_autotest_edid
    function for DP EDID complaince tests"
    - LP: #1460674
  * SAUCE: i915_bpo: Revert "drm/i915: Add debugfs test control files for
    Displayport compliance testing"
    - LP: #1460674
  * SAUCE: Load i915_bpo from the hda driver on SKL/CHV
    - LP: #1460674
  * SAUCE: i915_bpo: Don't try to support BXT
    - LP: #1460674
  * SAUCE: i915_bpo: drm/i915/skl: Fix DMC API version.

  [ Upstream Kernel Changes ]

  * Revert "usb: dwc2: add bus suspend/resume for dwc2"
    - LP: #1471252
  * Revert "HID: logitech-hidpp: support combo keyboard touchpad TK820"
    - LP: #1471252
  * Revert "KVM: x86: drop fpu_activate hook"
    - LP: #1471252
  * Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"
    - LP: #1471252
  * drm/i915: add component support
    - LP: #1460661
  * ALSA: hda: export struct hda_intel
    - LP: #1460661
  * ALSA: hda: pass intel_hda to all i915 interface functions
    - LP: #1460661
  * ALSA: hda: add component support
    - LP: #1460661
  * drm/atomic-helpers: Fix documentation typos and wrong copy&paste
    - LP: #1460674
  * drm/atomic: Rename drm_atomic_helper_commit_pre_planes() state argument
    - LP: #1460674
  * drm/atomic-helper: Rename commmit_post/pre_planes
    - LP: #1460674
  * drm/atomic-helpers: make mode_set hooks optional
    - LP: #1460674
  * drm/atomic-helper: Fix kerneldoc for prepare_planes
    - LP: #1460674
  * drm: Complete moving rotation property to core
    - LP: #1460674
  * drm: Share plane pixel format check code between legacy and atomic
    - LP: #1460674
  * drm/atomic: Constify a bunch of functions pointer structs
    - LP: #1460674
  * drm: Fix some typo mistake of the annotations
    - LP: #1460674
  * drm: change connector to tmp_connector
    - LP: #1460674
  * drm: atomic: Expose CRTC active property
    - LP: #1460674
  * drm: atomic: Allow setting CRTC active property
    - LP: #1460674
  * drm/atomic-helpers: Properly avoid full modeset dance
    - LP: #1460674
  * drm/atomic: Add helpers for state-subclassing drivers
    - LP: #1460674
  * drm: Fix some typos
    - LP: #1460674
  * drm/atomic: Add for_each_{connector,crtc,plane}_in_state helper macros
    - LP: #1460674
  * drm/atomic-helper: Don't call atomic_update_plane when it stays off
    - LP: #1460674
  * drm/atomic-helper: Really recover pre-atomic plane/cursor behavior
 ...

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Wily):
status: Triaged → Fix Released
Changed in linux (Ubuntu Utopic):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Trusty):
status: Triaged → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Stephen A. Zarkos (stevez) wrote :

Testing complete, no issues found. Thanks!

tags: added: verification-done-trusty
removed: verification-needed-trusty
Frederik Bosch (f-bosch) wrote :

Since this also helps on the backup issue: please merge these patches!

Joseph Salisbury (jsalisbury) wrote :

@Frederik These patches are now in Wily and Vivid. They are queued up to be in the next Utopic and Trusty releases.

I'm still working on backporting the patches to Precise. There are allot of prerequisites to get them to apply to Precise.

Chris Valean (cvalean) wrote :

Joseph, for Trusty proposed the kernel we should be looking at is 3.13.0-62.101?
That's what I was getting on latest 14.04.3.

Joseph Salisbury (jsalisbury) wrote :

@Chris, yes the Trusty proposed kernel version is 3.13.0-62.101

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Utopic):
status: Fix Committed → Fix Released
andfra74 (1-andrea) wrote :

any news for precise (12.04) patch?

Changed in linux (Ubuntu Precise):
status: Triaged → In Progress
Joseph Salisbury (jsalisbury) wrote :

I was able to create a Precise test kernel with the request patches. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1445195/precise/

There were quite a few prerequisites required to get the requested patches into Precise. Due to this, these may be too many changes for an SRU. In that case, the lts-backport kernel would have to be used with Precise.

The following is a list of the needed prerequisites(Note the SHA1 is not the upstream SHA1, but subject is the same as upstream):

134b913 scsi: storvsc: Size the queue depth based on the ringbuffer size
6410547 scsi: storvsc: Increase the ring buffer size
39fb497 storvsc: use cmd_size to allocate per-command data
45c7382 Drivers: hv: vmbus: Support a vmbus API for efficiently sending page arrays
8dac1a7 virtio_scsi: use cmd_size
705de1c scsi: add a blacklist flag which enables VPD page inquiries
e99a978 [SCSI] add support for per-host cmd pools
529db02 hv: Add hyperv.h to uapi headers
33200a3 [SCSI] storvsc: Implement multi-channel support
06f9766 [SCSI] storvsc: Update the storage protocol to win8 level
db2e1ac Drivers: hv: vmbus: Implement multi-channel support
1ecf050 [SCSI] Allow error handling timeout to be specified
34c3fe0 [SCSI] storvsc: Restructure error handling code on command completion
b652ce5 Drivers: hv: Manage signaling state on a per-connection basis
3250c7d Drivers: hv: Extend/modify vmbus_channel_offer_channel for win7 and beyond
b0b97ff Drivers: hv: Setup a mapping for Hyper-V's notion cpu ID
dd2de78 Drivers: hv: Optimize the signaling on the write path
2741cbd Drivers: hv: Move vmbus version definitions to hyperv.h
dfa47ec Drivers: hv: Save and export negotiated vmbus version
b718621 Drivers: hv: Support handling multiple VMBUS versions
f43b693 [SCSI] Disable DIF on Hitachi Ultrastar 15K300
d3c4496 [SCSI] Handle disk devices which can not process medium access commands

Adrian Suhov (asuhov) wrote :

I installed the kernel from http://kernel.ubuntu.com/~jsalisbury/lp1445195/precise/ , on a Ubuntu 12.04 x64 vm on Hyper-V. After installing, the vm won't boot ( NULL pointer dereference ).

I attached the boot log.

Joseph Salisbury (jsalisbury) wrote :

I built a second Precise test kernel. It looks like some addition prerequisite commits are required.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1445195/precise/

Can you test this kernel to ensure it boots and to see if it resolves this bug?

Thanks in advance!

Ovidiu Rusu (orusu) wrote :

I successfully installed the second Precise test kernel from Joseph on a Ubuntu Precise 12.04.02 and the vm boots fine.
I should mention that the default kernel is 3.5.0-23-generic. The difference between kernels are pretty high.
My question is: shouldn't we do rebase to 3.5 version?

Joseph Salisbury (jsalisbury) wrote :

12.04.02 was end of life as of Aug 2014, so only the 3.2 and 3.13 backport kernel from Trusty are now supported.

Joshua R. Poulson (jrp) wrote :

Indeed, the intention here was to go to the HWE kernel for Precise, 3.13.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Joshua. So it sounds like these patches don't need to go in to the 3.2 kernel. If that is the case, I'll mark the Precise bug task as invalid.

no longer affects: linux (Ubuntu Precise)
Michele Primavera (michyprima) wrote :

ext4fs still goes randomly readonly for me on both offline and online backups on 15.10 with linux 4.2.0-18. should I go back to linux 3.x?

Joshua R. Poulson (jrp) wrote :

Michele, we are tracking the VSS issue you may be having in bug 1470250. Is this the same issue?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers