file not initialized to 0s under some conditions

Bug #1371591 reported by Leann Ogasawara
284
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
High
Unassigned
Precise
Invalid
Undecided
Unassigned
Trusty
Fix Committed
High
Unassigned
Utopic
Won't Fix
High
Unassigned
linux-lts-trusty (Ubuntu)
Confirmed
Undecided
Unassigned
Precise
Won't Fix
High
Unassigned
Trusty
Invalid
Undecided
Unassigned
Utopic
Invalid
Undecided
Unassigned

Bug Description

SRU Justification:

[Impact]

Under some conditions, after fallocate() the file is observed not to be completely initilized to 0s: some 4KB pages have left-over data from previous files that occupied those pages. Note that in addition to causing functional problems for applications expecting files to be initialized to 0s, this is a security issue because it allows data to "leak" from one file to another, bypassing file access controls.

The problem has been seen running under the following VMWare-based virtual environments:
Fusion 6.0.2
ESXi 5.1.0

And under the following versions of Ubuntu:
Ubuntu 12.04, 3.11.0-26-generic
Ubuntu 14.04.1, 3.13.0-32-generic
Ubuntu 14.04.1, 3.13.0-35-generic

But did not reproduce under the following version:
Ubuntu 10.04, 2.6.32-38-server

The problem reproduced under LVM, but did not reproduce without LVM.

[Test Case]

I reproduced the problem as follows under VMWare Fusion:
set up custom VM with default disk size (20 GB) and memory size (1 GB)
attach Ubuntu 14.04.1 ISO to CDROM, set it as boot device, boot up
select all defaults during installation _including_ LVM
install gcc
unpack the attached repro.tgz
run repro.sh

what it does:
* fills the disk with a file containing bytes of 0xcc then deletes it
* repeatedly runs the repro program which creates two files and accesses them in a certain pattern
* checks the file f0 with hexdump; it should contain all 0s, but if pages 0x1000-0x7000 contain 0xcc you have reproduced the problem

If the problem does not appear to reproduce, please try waiting a bit and checking the f0 files with hexdump again. This behavior was observed by a customer reproducing the problem under ESXi. I since added an sync after the running the repro binary which I think will fix that.

If you still can't reproduce the problem please let me know if there's anything I can do to help. For example can we trace the disk accesses at the SCSI level to verify whether the appropriate SCSI commands are being sent? This may help determine whether the problem is in Linux or in VMWare.

[Fix]

mptfusion: enable no_write_same in scsi_host_template
commit 4089b71cc820a426d601283c92fcd4ffeb5139c2 upstream

https://lkml.org/lkml/2014/9/25/482

(Note this patch may be reverted in the future as there is active discussion upstream about a more generic fix)

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1371591

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Critical
description: updated
Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Chris J Arges (arges)
Revision history for this message
Bruce Lucas (bruce-lucas) wrote : Re: file not initialized to 0s under some conditions on VMWare
summary: - FS Corruption with Ubuntu and VMWare
+ file not initialized to 0s under some conditions on VMWare
Revision history for this message
Chris J Arges (arges) wrote :

So are you sure 20GB is enough disk space for the VM? I tried in a KVM VM (just to try the reproducer) and I get 'No space left on device' errors from fallocate.

In addition can you post the machine information someone in this bug so I can reproduce the exact setup on my end with vmware.

Also to be clear, LVM _is_ required to reproduce the issue.

Thanks for the overall great bug description and reproducer script!
--chris

Revision history for this message
Bruce Lucas (bruce-lucas) wrote :

20 GB should be more than enough. It should run the repro binary several times, using more disk space each time since it leaves the each run in place when it goes on to the next, and then get a failure from fallocate on the last one when the disk is filled up. I've attached a file showing a successful repro run on a 20 GB disk. Could the no space errors you are seeing be a ulimit issue?

Yes, LVM is required to reproduce the issue; sorry about the error in the repro steps; fixed now.

Can you clarify what machine information you're looking for?

description: updated
Revision history for this message
Bruce Lucas (bruce-lucas) wrote :
Revision history for this message
Chris J Arges (arges) wrote :

When creating the VM, did you use the default settings, or was the hardware configuration changed at all? I.e. did you add extra processors, or just use 1?

I'll look into my errors a bit more, getting VMWare Workstation setup; has this been only reproduced in VMWare Fusion? Also are there specific versions of VMWare * you have used to repo the issue? Also what kernel version can repo this issue? (uname -a)?

Chris J Arges (arges)
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Revision history for this message
Chris J Arges (arges) wrote :

@bruce-lucas:

Ok! I can reproduce this issue in VMWare Workstation. I'll start investigating more deeply.
In my KVM instance previously I did not reproduce the issue.

Revision history for this message
Bruce Lucas (bruce-lucas) wrote :

I accepted all the default settings when creating the VM (which for Fusion was 20 GB disk, 1 GB memory, single processor).

On Fusion at least It is important to do the manual install: select installation method / more options / create a custom VM, mount the CD, set as boot device, bot up, go through the Ubuntu installer, accept the default to use LVM. If you go through select installation method / install from disc it does not take you through the Ubuntu installer UI, and it does not use LVM. At least that's what I see in Fusion; can't say for sure about Workstation.

I reproduced it on VMWare Fusion, a customer reproduced it on VMWare ESXi (versions listed in original description).

The kernel versions where it has been reproduced are also listed above next to the Ubuntu distro version.

Some more information:
* was able to reproduce the problem on Fedora 20 Fedora 20 (3.11.10-301, with LVM)
* did not reproduce the problem on Centos 7.0 (3.10.0-123, with LVM)

So so far all repros have been on 3.11 kernels or later.

information type: Public → Public Security
Revision history for this message
Bruce Lucas (bruce-lucas) wrote :

Awesome, thanks.

Revision history for this message
Bruce Lucas (bruce-lucas) wrote :

By the way, a couple more pieces of information that may be relevant:

The problem is sensitive to the particular pattern of access to map0. If you remove either of the two writes (at 0x0 and 0x7000) the problem disappears, or if you change 0x7000 to a higher page it also disappears.

We also observed a message about a WRITE SAME failure in syslog. I believe this means that the platform does not implement SCSI WRITE SAME, which I imagine would be used to zero pages under some circumstances, but instead uses a fallback of "manually" zeroing the pages. Perhaps there is a problem in this area?

tags: added: kernel-da-key
Revision history for this message
Chris J Arges (arges) wrote :

@bruce-lucas

OK I have reproduced this with the latest 3.17-rc kernel. So most likely this is an upstream issue.
However I'll start testing previous versions to see if this is a regression between 3.10,3.11; this will help us zero in on the code changes that may have introduced this behavior.

I do get the same error on my runs:
[ 176.092640] dm-0: WRITE SAME failed. Manually zeroing.
I'll investigate this in parallel.

tags: added: kernel-bug-exists-upstream
Revision history for this message
Chris J Arges (arges) wrote :

The 'regression' is between 3.9 and 3.10-rc1. I'll bisect between these tags to see where the issue is.

Revision history for this message
Chris J Arges (arges) wrote :

The bisect resulted in the following:

dc019b21fb92d620a3b52ccecc135ac968a7c7ec is the first bad commit
commit dc019b21fb92d620a3b52ccecc135ac968a7c7ec
Author: Mike Snitzer <email address hidden>
Date: Fri May 10 14:37:16 2013 +0100

    dm table: fix write same support

    If device_not_write_same_capable() returns true then the iterate_devices
    loop in dm_table_supports_write_same() should return false.

    Reported-by: Bharata B Rao <email address hidden>
    Signed-off-by: Mike Snitzer <email address hidden>
    Cc: <email address hidden> # v3.8+
    Signed-off-by: Alasdair G Kergon <email address hidden>

:040000 040000 d8b62d18789b5c9e5b52c076abcf4c8c066b5d59 71a5511a8ea76f43bd167524a9186c1d78407bce M drivers

--

However, I don't think the issue is with this patch. The function 'device_not_write_same_capable()' correctly returns:
   return q && !q->limits.max_write_same_sectors;
If max_write_same_sectors is 0 (write_same not supported), then true is returned and thus 'not_write_same_capable'.

Likewise the function 'dm_table_supports_write_same' iterates through dm tables and checks

 if (!ti->type->iterate_devices ||
                    ti->type->iterate_devices(ti, device_not_write_same_capable, NULL))
                        return false;

So if iterate_devices is NULL, this if returns false, otherwise if iterate_devices exist, then device_not_write_same_capable is called, if it returns 'true' then the function returns 'false' (A bit confusing, but essentially the parent function is 'supports_write_same' and uses a 'not_write_same_capable' function to check this fact. )

That logic was introduced in: d54eaa5a0fde0a202e4e91f200f818edcef15bee (v3.8-rc1), which means that previous to that we might not see the same behavior which could account for 2.6.38 not failing this test case.

Relevant thread: http://www.spinics.net/lists/dm-devel/msg19583.html

--

Looking at the affected VM:

Now this makes sense why LVM is only affected, and explains the helpful kernel message output. If we check the dm's for our LVM vg's we see the following:

ubuntu@ubuntu:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 243M 0 part /boot
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 19.8G 0 part
  ├─ubuntu--vg-root (dm-0) 252:0 0 18.8G 0 lvm /
  └─ubuntu--vg-swap_1 (dm-1) 252:1 0 1020M 0 lvm [SWAP]
sr0 11:0 1 572M 0 rom

ubuntu@ubuntu:~$ cat /sys/dev/block/252\:1/queue/write_same_max_bytes
33553920
ubuntu@ubuntu:~$ cat /sys/dev/block/252\:0/queue/write_same_max_bytes
33553920

So write_same support is enabled, but then that causes the failure. So at this point, I wonder if the underlying virtual SCSI is at fault.

Revision history for this message
Chris J Arges (arges) wrote :

FWIW, I also tested this on a similar KVM instance using a SCSI disk device and installed Ubuntu using LVM, running the same test case does not result in the failure; even though write_same is enabled for those devices.

ubuntu@lp1371591:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 30G 0 disk
├─sda1 8:1 0 243M 0 part /boot
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 29.8G 0 part
  ├─lp1371591--vg-root (dm-0) 252:0 0 28.8G 0 lvm /
  └─lp1371591--vg-swap_1 (dm-1) 252:1 0 1G 0 lvm [SWAP]
sr0 11:0 1 1024M 0 rom
ubuntu@lp1371591:~$ cat /sys/block/dm-*/queue/write_same_max_bytes
33553920
33553920

Revision history for this message
Chris J Arges (arges) wrote :

As you mentioned in #10, blkdev_issue_zeroout (the function that prints 'WRITE SAME failed. Manually zeroing.'), still causes the test to show failures even if we should be manually zeroing out the block device when the hardware write_same fails. A test I could try is to short circuit BLKZEROOUT to always use the manual case and see if there are still failures.

Revision history for this message
Chris J Arges (arges) wrote :

Ok my experiment doing the following (and I know launchpad will mangle my spacing...):

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9b5b561..03ad981 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -283,6 +283,7 @@ int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
                         sector_t nr_sects, gfp_t gfp_mask)
 {
+#if 0
        if (bdev_write_same(bdev)) {
                unsigned char bdn[BDEVNAME_SIZE];

@@ -293,7 +294,7 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
                bdevname(bdev, bdn);
                pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
        }
-
+#endif
        return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
 }
 EXPORT_SYMBOL(blkdev_issue_zeroout);

Forcing manual zeroout works fine, so the 'hardware' zeroout is the problem.

Revision history for this message
Chris J Arges (arges) wrote :

As a workaround, you can use IDE/SATA disks instead of SCSI (default), I'm not sure of performance implications, but the test cases passes.

Revision history for this message
Chris J Arges (arges) wrote :

Can you try this build and see if it fixes the issue? Please install it into the guest VM:
http://people.canonical.com/~arges/lp1371591/

Thanks,

Revision history for this message
Bruce Lucas (bruce-lucas) wrote :

I can confirm that this fixes the issue, both for the mongod repro and for the standalone repro attached to this report.

Revision history for this message
Chris J Arges (arges) wrote :
Revision history for this message
Bruce Lucas (bruce-lucas) wrote : Re: [Bug 1371591] Re: file not initialized to 0s under some conditions on VMWare
Download full text (3.3 KiB)

Thanks Chris.

I take it from the other thread that bubbling the setting up from the lower
layer to the dm-* layer won't be possible.

Do you know if this patch will fix it for ESXi as well as for the VMWare
desktop products? I don't have access to ESXi, but the issue was originally
reported to us by a customer running ESXi. I could ask them to try the
patched kernel, but maybe we could do a simpler check first. Is there a way
to determine the vendor id for their virtual SCSI disks on a running system
so we can verify that the vendor id is the same?

Thanks,
Bruce

On Wed, Sep 24, 2014 at 10:00 AM, Chris J Arges <email address hidden>
wrote:

> https://lkml.org/lkml/2014/9/23/509
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1371591
>
> Title:
> file not initialized to 0s under some conditions on VMWare
>
> Status in “linux” package in Ubuntu:
> In Progress
> Status in “linux” source package in Trusty:
> New
>
> Bug description:
> Under some conditions, after fallocate() the file is observed not to
> be completely initilized to 0s: some 4KB pages have left-over data
> from previous files that occupied those pages. Note that in addition
> to causing functional problems for applications expecting files to be
> initialized to 0s, this is a security issue because it allows data to
> "leak" from one file to another, bypassing file access controls.
>
> The problem has been seen running under the following VMWare-based
> virtual environments:
> Fusion 6.0.2
> ESXi 5.1.0
>
> And under the following versions of Ubuntu:
> Ubuntu 12.04, 3.11.0-26-generic
> Ubuntu 14.04.1, 3.13.0-32-generic
> Ubuntu 14.04.1, 3.13.0-35-generic
>
> But did not reproduce under the following version:
> Ubuntu 10.04, 2.6.32-38-server
>
> The problem reproduced under LVM, but did not reproduce without LVM.
>
> I reproduced the problem as follows under VMWare Fusion:
> set up custom VM with default disk size (20 GB) and memory size (1 GB)
> attach Ubuntu 14.04.1 ISO to CDROM, set it as boot device, boot up
> select all defaults during installation _including_ LVM
> install gcc
> unpack the attached repro.tgz
> run repro.sh
>
> what it does:
> * fills the disk with a file containing bytes of 0xcc then deletes it
> * repeatedly runs the repro program which creates two files and accesses
> them in a certain pattern
> * checks the file f0 with hexdump; it should contain all 0s, but if
> pages 0x1000-0x7000 contain 0xcc you have reproduced the problem
>
> If the problem does not appear to reproduce, please try waiting a bit
> and checking the f0 files with hexdump again. This behavior was
> observed by a customer reproducing the problem under ESXi. I since
> added an sync after the running the repro binary which I think will
> fix that.
>
> If you still can't reproduce the problem please let me know if there's
> anything I can do to help. For example can we trace the disk accesses
> at the SCSI level to verify whether the appropriate SCSI commands are
> being sent? This may help determine whether the problem is in Linux o...

Read more...

Revision history for this message
Chris J Arges (arges) wrote : Re: file not initialized to 0s under some conditions on VMWare

Bruce,

The following should get some of the info needed:
tail /sys/class/scsi_device/*/device/{vendor,model}

Revision history for this message
Bruce Lucas (bruce-lucas) wrote : Re: [Bug 1371591] Re: file not initialized to 0s under some conditions on VMWare
Download full text (3.3 KiB)

This is what the customer reports. Ignoring the CD-ROM, this is slightly
different from what I see on my VMWare Fusion installation, which reports
"VMware, " (with a comma and a space). Presumably this is an insignificant
difference?
Bruce

tail /sys/class/scsi_device/*/device/{vendor,model}
==> /sys/class/scsi_device/1:0:0:0/device/vendor <==
NECVMWar

==> /sys/class/scsi_device/2:0:0:0/device/vendor <==
VMware

==> /sys/class/scsi_device/1:0:0:0/device/model <==
VMware IDE CDR10

==> /sys/class/scsi_device/2:0:0:0/device/model <==
Virtual disk

On Wed, Sep 24, 2014 at 11:25 AM, Chris J Arges <email address hidden>
wrote:

> Bruce,
>
> The following should get some of the info needed:
> tail /sys/class/scsi_device/*/device/{vendor,model}
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1371591
>
> Title:
> file not initialized to 0s under some conditions on VMWare
>
> Status in “linux” package in Ubuntu:
> In Progress
> Status in “linux” source package in Trusty:
> New
>
> Bug description:
> Under some conditions, after fallocate() the file is observed not to
> be completely initilized to 0s: some 4KB pages have left-over data
> from previous files that occupied those pages. Note that in addition
> to causing functional problems for applications expecting files to be
> initialized to 0s, this is a security issue because it allows data to
> "leak" from one file to another, bypassing file access controls.
>
> The problem has been seen running under the following VMWare-based
> virtual environments:
> Fusion 6.0.2
> ESXi 5.1.0
>
> And under the following versions of Ubuntu:
> Ubuntu 12.04, 3.11.0-26-generic
> Ubuntu 14.04.1, 3.13.0-32-generic
> Ubuntu 14.04.1, 3.13.0-35-generic
>
> But did not reproduce under the following version:
> Ubuntu 10.04, 2.6.32-38-server
>
> The problem reproduced under LVM, but did not reproduce without LVM.
>
> I reproduced the problem as follows under VMWare Fusion:
> set up custom VM with default disk size (20 GB) and memory size (1 GB)
> attach Ubuntu 14.04.1 ISO to CDROM, set it as boot device, boot up
> select all defaults during installation _including_ LVM
> install gcc
> unpack the attached repro.tgz
> run repro.sh
>
> what it does:
> * fills the disk with a file containing bytes of 0xcc then deletes it
> * repeatedly runs the repro program which creates two files and accesses
> them in a certain pattern
> * checks the file f0 with hexdump; it should contain all 0s, but if
> pages 0x1000-0x7000 contain 0xcc you have reproduced the problem
>
> If the problem does not appear to reproduce, please try waiting a bit
> and checking the f0 files with hexdump again. This behavior was
> observed by a customer reproducing the problem under ESXi. I since
> added an sync after the running the repro binary which I think will
> fix that.
>
> If you still can't reproduce the problem please let me know if there's
> anything I can do to help. For example can we trace the disk accesses
> at the SCSI level to verify whether the appropriate SCSI commands are
> being s...

Read more...

Revision history for this message
Chris J Arges (arges) wrote : Re: file not initialized to 0s under some conditions on VMWare
Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.16.0-20.27

---------------
linux (3.16.0-20.27) utopic; urgency=low

  [ Tim Gardner ]

  * [Config] CONFIG_CXL=m
  * Release Tracking Bug
    - LP: #1376354

  [ Andi Kleen ]

  * SAUCE: perf tools: Fix perf record as non root with kptr_restrict == 1
    - LP: #1375441

  [ Chris J Arges ]

  * SAUCE: Revert "sd: don't use scsi_setup_blk_pc_cmnd for flush requests"
    - LP: #1375452

  [ Ian Munsie ]

  * SAUCE: (no-up) powerpc/cell: Move spu_handle_mm_fault() out of cell platform
  * SAUCE: (no-up) powerpc/cell: Move data segment faulting code out of cell platform
  * SAUCE: (no-up) powerpc/msi: Improve IRQ bitmap allocator
  * SAUCE: (no-up) powerpc/mm: Export mmu_kernel_ssize and mmu_linear_psize
  * SAUCE: (no-up) powerpc/powernv: Split out set MSI IRQ chip code
  * SAUCE: (no-up) cxl: Add new header for call backs and structs
  * SAUCE: (no-up) powerpc/powerpc: Add new PCIe functions for allocating cxl interrupts
  * SAUCE: (no-up) powerpc/mm: Add new hash_page_mm()
  * SAUCE: (no-up) powerpc/opal: Add PHB to cxl mode call
  * SAUCE: (no-up) powerpc/mm: Add hooks for cxl
  * SAUCE: (no-up) cxl: Add base builtin support
  * SAUCE: (no-up) cxl: Driver code for powernv PCIe based cards for userspace access
  * SAUCE: (no-up) cxl: Userspace header file.
  * SAUCE: (no-up) cxl: Add driver to Kbuild and Makefiles
  * SAUCE: (no-up) cxl: Add documentation for userspace APIs
 -- Tim Gardner <email address hidden> Tue, 30 Sep 2014 13:05:27 -0600

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Arvind Kumar (arvindkumar) wrote :
Download full text (3.8 KiB)

Hi Chris,

This is Arvind Kumar from VMware. Recently the issue discussed in this bug was brought into VMware's notice. We looked at the patch (https://lkml.org/lkml/2014/9/23/509) which was done to address the issue. Since the patch is done in mptsas driver, it addresses the issue only on lsilogic controller, if user uses some other controller e.g. pvscsi or buslogic then the issue remains. Moreover the patch disables the WRITE SAME completely on the lsilogic which indicates that VMware will never be able to support WRITE SAME on lsilogic. As I understand from the bug, it is concluded that the WRITE SAME is not properly implemented by VMware. Actually we don't support WRITE SAME at all.

We internally investigated the issue and as per our understanding the issue is not VMware specific and rather seems to be with the kernel, which could very well happen on real hardware too in case the disk doesn't support WRITE SAME command. Below are the details of the investigation by Petr Vandrovec.

--

In blk-lib.c on line 294 it checks whether bdev supports write_same. With LVM, bdev here is dm-0. It says yes, it is supported, and so write_same is invoked (note that check is racy in case device loses write_same capability between test and moment bio is issued):

    291 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
    292 sector_t nr_sects, gfp_t gfp_mask)
    293 {
    294 if (bdev_write_same(bdev)) {
    295 unsigned char bdn[BDEVNAME_SIZE];
    296
    297 if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
    298 ZERO_PAGE(0)))
    299 return 0;
    300
    301 bdevname(bdev, bdn);
    302 pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
    303 }
    304
    305 return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
    306 }
    307 EXPORT_SYMBOL(blkdev_issue_zeroout);

Then it gets to LVM, and LVM forwards request to sda. When it fails, kernel clears bdev_write_same() on sda, and returns -121 (EREMOTEIO).

Now next request comes. Nobody cleared bdev_write_same() on dm-0, it got cleared only on sda, so request gets to LVM, which forwards it to sda. Where it hits a snag in blk-core.c:

   1824 if (bio->bi_rw & REQ_WRITE_SAME && !bdev_write_same(bio->bi_bdev)) {
   1825 err = -EOPNOTSUPP;
   1826 goto end_io;
   1827 }

bi_bdev here is sda, and I/O fails with EOPNOTSUPP, without WRITE_SAME ever being issued. And then it hits completion code that treats EOPNOTSUPP as success:

     18 static void bio_batch_end_io(struct bio *bio, int err)
     19 {
     20 struct bio_batch *bb = bio->bi_private;
     21
     22 if (err && (err != -EOPNOTSUPP))
     23 clear_bit(BIO_UPTODATE, &bb->flags);
     24 if (atomic_dec_and_test(&bb->done))
     25 complete(bb->wait);
     26 bio_put(bio);
     27 }

So everybody outside of blkdev_issue_write_same() thinks that I/O succeeded, while in real...

Read more...

Revision history for this message
Petr Vandrovec (petr-vmware) wrote :

Hi Chris,
  can you revert original patch which blacklist VMware for no good reason, and instead apply attached patch, or its equivalent? As explained above, bug affects ALL disks that do not support WRITE_SAME when used with stackable block devices, like LVM. Due to the bug Linux kernel stops issuing WRITE_SAME after first failure, treating them as succeeding, instead of falling back to non-write-same code.

I have doubts about handling discards too, but I'm not Linux storage guru, so I left discard handling returning EOPNOTSUPP, rather than switching it to EREMOTEIO too.

Unfortunately I could not yet figure out how to reopen this bug so that correct fix can tracked.

Thanks,
Petr Vandrovec

Disclosure: I'm VMware employee.

Revision history for this message
Chris J Arges (arges) wrote :

@petr-vmware, arvindkumar:

Thanks for looking into this!
I think your analysis makes more sense, because we'd expect the manually zeroing to actually work instead of causing corruption. What kind of testing and configurations have you done this on?

Can you submit this patch upstream and reference my original patch (or reply to the thread with the original bug), once this receives some acks upstream and my patch gets reverted, we can do the same in Ubuntu kernel.

I'll reopen the bug.

Thanks,

Changed in linux (Ubuntu):
status: Fix Released → In Progress
importance: Critical → High
tags: added: patch
Revision history for this message
Robert C Jennings (rcj) wrote :

I've nominated Precise as well based on the description indicating a recreate there.

Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Chris J Arges (arges)
no longer affects: linux (Ubuntu Precise)
Changed in linux-lts-trusty (Ubuntu Trusty):
status: New → Invalid
Changed in linux (Ubuntu Precise):
status: New → Invalid
Changed in linux-lts-trusty (Ubuntu Precise):
status: New → Triaged
Swapneel Kekre (skekre)
summary: - file not initialized to 0s under some conditions on VMWare
+ file not initialized to 0s under some conditions
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-trusty (Ubuntu):
status: New → Confirmed
Revision history for this message
Kane York (kanepyork) wrote :

Fix is in Christoph Hellwig's tree, on path to 3.18 and backport to stable. https://lkml.org/lkml/2014/9/25/482

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Chris J Arges (arges)
description: updated
Revision history for this message
Chris J Arges (arges) wrote :

Verified in my trusty VM.

tags: added: verification-done-trusty
removed: verification-needed-trusty
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (10.4 KiB)

This bug was fixed in the package linux - 3.13.0-39.66

---------------
linux (3.13.0-39.66) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1386629

  [ Upstream Kernel Changes ]

  * KVM: x86: Check non-canonical addresses upon WRMSR
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Prevent host from panicking on shared MSR writes.
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Improve thread safety in pit
    - LP: #1384540
    - CVE-2014-3611
  * KVM: x86: Fix wrong masking on relative jump/call
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Warn if guest virtual address space is not 48-bits
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Emulator fixes for eip canonical checks on near branches
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: emulating descriptor load misses long-mode case
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Handle errors when RIP is set during far jumps
    - LP: #1384545
    - CVE-2014-3647
  * kvm: vmx: handle invvpid vm exit gracefully
    - LP: #1384544
    - CVE-2014-3646
  * Input: synaptics - gate forcepad support by DMI check
    - LP: #1381815

linux (3.13.0-38.65) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1379244

  [ Andy Whitcroft ]

  * Revert "SAUCE: scsi: hyper-v storsvc switch up to SPC-3"
    - LP: #1354397
  * [Config] linux-image-extra is additive to linux-image
    - LP: #1375310
  * [Config] linux-image-extra postrm is not needed on purge
    - LP: #1375310

  [ Upstream Kernel Changes ]

  * Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
    - LP: #1377564
  * Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev"
    - LP: #1377564
  * aufs: bugfix, stop calling security_mmap_file() again
    - LP: #1371316
  * ipvs: fix ipv6 hook registration for local replies
    - LP: #1349768
  * Drivers: add blist flags
    - LP: #1354397
  * sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout
    - LP: #1354397
  * drm/i915/bdw: Add 42ms delay for IPS disable
    - LP: #1374389
  * drm/i915: add null render states for gen6, gen7 and gen8
    - LP: #1374389
  * drm/i915/bdw: 3D_CHICKEN3 has write mask bits
    - LP: #1374389
  * drm/i915/bdw: Disable idle DOP clock gating
    - LP: #1374389
  * drm/i915: call lpt_init_clock_gating on BDW too
    - LP: #1374389
  * drm/i915: shuffle panel code
    - LP: #1374389
  * drm/i915: extract backlight minimum brightness from VBT
    - LP: #1374389
  * drm/i915: respect the VBT minimum backlight brightness
    - LP: #1374389
  * drm/i915/bdw: Apply workarounds in render ring init function
    - LP: #1374389
  * drm/i915/bdw: Cleanup pre prod workarounds
    - LP: #1374389
  * drm/i915: Replace hardcoded cacheline size with macro
    - LP: #1374389
  * drm/i915: Refactor Broadwell PIPE_CONTROL emission into a helper.
    - LP: #1374389
  * drm/i915: Add the WaCsStallBeforeStateCacheInvalidate:bdw workaround.
    - LP: #1374389
  * drm/i915/bdw: Remove BDW preproduction W/As until C stepping.
    - LP: #1374389
  * mptfusion: enable no_write_same for vmware scsi disks
    - LP: #1371591
  * iommu/amd: Fix cleanup_domai...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (10.4 KiB)

This bug was fixed in the package linux-lts-trusty - 3.13.0-39.66~precise1

---------------
linux-lts-trusty (3.13.0-39.66~precise1) precise; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1386866

  [ Upstream Kernel Changes ]

  * KVM: x86: Check non-canonical addresses upon WRMSR
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Prevent host from panicking on shared MSR writes.
    - LP: #1384539
    - CVE-2014-3610
  * KVM: x86: Improve thread safety in pit
    - LP: #1384540
    - CVE-2014-3611
  * KVM: x86: Fix wrong masking on relative jump/call
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Warn if guest virtual address space is not 48-bits
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Emulator fixes for eip canonical checks on near branches
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: emulating descriptor load misses long-mode case
    - LP: #1384545
    - CVE-2014-3647
  * KVM: x86: Handle errors when RIP is set during far jumps
    - LP: #1384545
    - CVE-2014-3647
  * kvm: vmx: handle invvpid vm exit gracefully
    - LP: #1384544
    - CVE-2014-3646
  * Input: synaptics - gate forcepad support by DMI check
    - LP: #1381815

linux (3.13.0-38.65) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1379244

  [ Andy Whitcroft ]

  * Revert "SAUCE: scsi: hyper-v storsvc switch up to SPC-3"
    - LP: #1354397
  * [Config] linux-image-extra is additive to linux-image
    - LP: #1375310
  * [Config] linux-image-extra postrm is not needed on purge
    - LP: #1375310

  [ Upstream Kernel Changes ]

  * Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
    - LP: #1377564
  * Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev"
    - LP: #1377564
  * aufs: bugfix, stop calling security_mmap_file() again
    - LP: #1371316
  * ipvs: fix ipv6 hook registration for local replies
    - LP: #1349768
  * Drivers: add blist flags
    - LP: #1354397
  * sd: fix a bug in deriving the FLUSH_TIMEOUT from the basic I/O timeout
    - LP: #1354397
  * drm/i915/bdw: Add 42ms delay for IPS disable
    - LP: #1374389
  * drm/i915: add null render states for gen6, gen7 and gen8
    - LP: #1374389
  * drm/i915/bdw: 3D_CHICKEN3 has write mask bits
    - LP: #1374389
  * drm/i915/bdw: Disable idle DOP clock gating
    - LP: #1374389
  * drm/i915: call lpt_init_clock_gating on BDW too
    - LP: #1374389
  * drm/i915: shuffle panel code
    - LP: #1374389
  * drm/i915: extract backlight minimum brightness from VBT
    - LP: #1374389
  * drm/i915: respect the VBT minimum backlight brightness
    - LP: #1374389
  * drm/i915/bdw: Apply workarounds in render ring init function
    - LP: #1374389
  * drm/i915/bdw: Cleanup pre prod workarounds
    - LP: #1374389
  * drm/i915: Replace hardcoded cacheline size with macro
    - LP: #1374389
  * drm/i915: Refactor Broadwell PIPE_CONTROL emission into a helper.
    - LP: #1374389
  * drm/i915: Add the WaCsStallBeforeStateCacheInvalidate:bdw workaround.
    - LP: #1374389
  * drm/i915/bdw: Remove BDW preproduction W/As until C stepping.
    - LP: #1374389
  * mptfusion: enable no_write_same for vmware scsi disks
    - LP: ...

Changed in linux-lts-trusty (Ubuntu Precise):
status: Triaged → Fix Released
Revision history for this message
Greg Swallow (gswallow-b) wrote :

Hi,

We just hit this bug on 3.13.0-39.66~precise1, running MongoDB 2.6.3. We're running Precise, with the trusty HWE enabled. We're on VMware, though from reading this bug report it doesn't matter.

Is there anything we can report to be helpful?

Revision history for this message
Chris J Arges (arges) wrote :

@greg,
Hi which VMware platform are you on, ESX or Fusion/WS?
Could you try running the test case as described in the description to see if it is the same issue?
Thanks,
--chris

Revision history for this message
Greg Swallow (gswallow-b) wrote :

We are on ESXi, version 5.5, according to our hosting provider.

To run the test, I created a new, 40GB virtual disk. It's attached as SCSI ID 1:3, on a VMware Paravirtual SCSI adapter. I set it up as a PV and put ext4 on it the "fast way," which is what I always do:

  701 pvcreate /dev/sdd
  702 vgextend mongo03-vg00 /dev/sdd
  703 lvcreate -n test -l 100%PV mongo03-vg00 /dev/sdd
  704 mount
  705 mkfs.ext4 -E lazy_itable_init=1 -O uninit_bg /dev/mongo03-vg00/test

Then I ran the test. Initially it passed, twice:

/test/db.1 /test
repro in db.1
creating f0
creating f1
touching files
hexdump f0:
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
...

But then, after doing some "multitasking" (we're working on an upgrade to MongoDB 2.6.5) and noticing that it was finished, I ran "sync" three times and checked the f0 files again:

hexdump db.1/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 cccc cccc cccc cccc cccc cccc cccc cccc
*
0007000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.10/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.2/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 cccc cccc cccc cccc cccc cccc cccc cccc
*
0007000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.3/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 cccc cccc cccc cccc cccc cccc cccc cccc
*
0007000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.4/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 cccc cccc cccc cccc cccc cccc cccc cccc
*
0007000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.5/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.6/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.7/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.8/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*
hexdump db.9/f0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1000000
*

Revision history for this message
Andrew Reis (areis422) wrote :

This bug also exists 14.10 - kernel version 3.16.0-25.33

Chris J Arges (arges)
Changed in linux-lts-trusty (Ubuntu Utopic):
status: New → Invalid
Changed in linux (Ubuntu Utopic):
assignee: nobody → Chris J Arges (arges)
status: New → Triaged
Revision history for this message
Chris J Arges (arges) wrote :

Ok so I think the issue here is that the device in ESXi is not properly blacklisted.
Taking the patch from:
https://lkml.org/lkml/2014/10/17/442

Blacklists multiple devices and should be the better solution than my original patch.
I've created some test builds against 3.16 and 3.13, if those affected by this issue could test them on their affected systems then I can reply to the original thread and try to get this applied upstream and in the Ubuntu kernels.

Here are the builds:
http://people.canonical.com/~arges/lp1371591v2/

Note if you are using the precise backport of 3.13 this kernel should install just fine. Keep in mind that these are TEST kernels only and shouldn't be installed on production systems and should be installed when testing is over.

Thanks!

Changed in linux-lts-trusty (Ubuntu Precise):
status: Fix Released → In Progress
Changed in linux (Ubuntu Utopic):
status: Triaged → In Progress
Changed in linux-lts-trusty (Ubuntu Precise):
assignee: nobody → Chris J Arges (arges)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Chris J Arges (arges)
status: Fix Released → In Progress
Changed in linux (Ubuntu):
status: Fix Released → In Progress
Changed in linux (Ubuntu Trusty):
importance: Undecided → High
Changed in linux (Ubuntu Utopic):
importance: Undecided → High
Changed in linux-lts-trusty (Ubuntu Precise):
importance: Undecided → High
Revision history for this message
Chris J Arges (arges) wrote :

* should be uninstalled when testing is over.

Mathew Hodson (mhodson)
tags: added: kernel-bug-exists-upstream-v3.17-rc1
Andy Whitcroft (apw)
Changed in linux (Ubuntu Utopic):
status: In Progress → Fix Committed
Changed in linux-lts-trusty (Ubuntu Precise):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Chris J Arges (arges)
Changed in linux (Ubuntu):
assignee: Chris J Arges (arges) → nobody
Revision history for this message
Rolf Leggewie (r0lf) wrote :

utopic has seen the end of its life and is no longer receiving any updates. Marking the utopic task for this ticket as "Won't Fix".

Changed in linux (Ubuntu Utopic):
status: Fix Committed → Won't Fix
Chris J Arges (arges)
Changed in linux (Ubuntu Trusty):
assignee: Chris J Arges (arges) → nobody
Changed in linux (Ubuntu Utopic):
assignee: Chris J Arges (arges) → nobody
Changed in linux-lts-trusty (Ubuntu Precise):
assignee: Chris J Arges (arges) → nobody
Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in linux-lts-trusty (Ubuntu Precise):
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.