Silent data corruption in Linux kernel 4.15

Bug #1796542 reported by Vasil Kolev on 2018-10-07
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Bionic
High
Colin Ian King

Bug Description

== SRU Justification [BIONIC] ==

A silent data corruption was introduced in v4.10-rc1 with commit
72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7
with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects
users of O_DIRECT, in our case a KVM virtual machine with drives
which use qemu's "cache=none" option.

== Fix ==

Upstream commits:

0aa69fd32a5f766e997ca8ab4723c5a1146efa8b
  block: add a lower-level bio_add_page interface

b403ea2404889e1227812fa9657667a1deb9c694
  block: bio_iov_iter_get_pages: fix size of last iovec

9362dd1109f87a9d0a798fbc890cb339c171ed35
  blkdev: __blkdev_direct_IO_simple: fix leak in error case

17d51b10d7773e4618bcac64648f30f12d4078fb
  block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

The first 3 patches are required for a clean application of the final
patch that actually addresses the problem with a fix to this known
issue.

== Regression Potential ==

This touches the block layer, so there is risk potential in data
corruption. The fixes have several weeks in the upstream kernel and
so far, I see no subsequent fixes required.

== Test Case ==

Build the program listed below [1]
kudos to Jan Kara, and run with:

dd if=/dev/zero if=loop.img bs=1M count=2048
sudo losetup /dev/loop0 loop.img

./blkdev-dio-test /dev/loop0 0 &
./blkdev-dio-test /dev/loop0 2048 &

Without the fix, ones lost writes fairly soon. Without the fix, this
runs without any losy write messages.

blkdev-dio-test.c:

#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <sys/uio.h>

#define PAGE_SIZE 4096
#define SECT_SIZE 512
#define BUF_OFF (2*SECT_SIZE)

int main(int argc, char **argv)
{
 int fd = open(argv[1], O_RDWR | O_DIRECT);
 int ret;
 char *buf;
 loff_t off;
 struct iovec iov[2];
 unsigned int seq;

 if (fd < 0) {
  perror("open");
  return 1;
 }

 off = strtol(argv[2], NULL, 10);

 buf = aligned_alloc(PAGE_SIZE, PAGE_SIZE);

 iov[0].iov_base = buf;
 iov[0].iov_len = SECT_SIZE;
 iov[1].iov_base = buf + BUF_OFF;
 iov[1].iov_len = SECT_SIZE;

 seq = 0;
 memset(buf, 0, PAGE_SIZE);
 while (1) {
  *(unsigned int *)buf = seq;
  *(unsigned int *)(buf + BUF_OFF) = seq;
  ret = pwritev(fd, iov, 2, off);
  if (ret < 0) {
   perror("pwritev");
   return 1;
  }
  if (ret != 2*SECT_SIZE) {
   fprintf(stderr, "Short pwritev: %d\n", ret);
   return 1;
  }
  ret = pread(fd, buf, PAGE_SIZE, off);
  if (ret < 0) {
   perror("pread");
   return 1;
  }
  if (ret != PAGE_SIZE) {
   fprintf(stderr, "Short read: %d\n", ret);
   return 1;
  }
  if (*(unsigned int *)buf != seq ||
      *(unsigned int *)(buf + SECT_SIZE) != seq) {
   printf("Lost write %u: %u %u\n", seq, *(unsigned int *)buf, *(unsigned int *)(buf + SECT_SIZE));
   return 1;
  }
  seq++;
 }

 return 0;
}

References:
[1] https://www.spinics.net/lists/linux-block/msg28507.html

====================

TLDR: commit 72ecad22d9f198aafee64218512e02ffa7818671 (in v4.10) introduced silent data corruption for O_DIRECT uses, it's fixed in 17d51b10d7773e4618bcac64648f30f12d4078fb (in v4.18)

A silent data corruption was introduced in v4.10-rc1 with commit 72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7 with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects users of O_DIRECT, in our case a KVM virtual machine with drives which use qemu's "cache=none" option.

This is the commit which fixes the issue:
---------------------
commit 17d51b10d7773e4618bcac64648f30f12d4078fb
Author: Martin Wilck <email address hidden>
Date: Wed Jul 25 23:15:09 2018 +0200

    block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

    bio_iov_iter_get_pages() currently only adds pages for the next non-zero
    segment from the iov_iter to the bio. That's suboptimal for callers,
    which typically try to pin as many pages as fit into the bio. This patch
    converts the current bio_iov_iter_get_pages() into a static helper, and
    introduces a new helper that allocates as many pages as

     1) fit into the bio,
     2) are present in the iov_iter,
     3) and can be pinned by MM.

    Error is returned only if zero pages could be pinned. Because of 3), a
    zero return value doesn't necessarily mean all pages have been pinned.
    Callers that have to pin every page in the iov_iter must still call this
    function in a loop (this is currently the case).

    This change matters most for __blkdev_direct_IO_simple(), which calls
    bio_iov_iter_get_pages() only once. If it obtains less pages than
    requested, it returns a "short write" or "short read", and
    __generic_file_write_iter() falls back to buffered writes, which may
    lead to data corruption.

    Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
    Reviewed-by: Christoph Hellwig <email address hidden>
    Signed-off-by: Martin Wilck <email address hidden>
    Signed-off-by: Jens Axboe <email address hidden>
-------------------------

Since there were a lot of components involved in the initial report to us (xfs, guest kernel, guest virtio drivers, qemu, host kernel, storage system), we had to isolate it. This is the commit which fixes the data corruption bug. We created a reliable reproduction and tested with the patch and without the patch. We also created a version of the kernel which prints when the data-corrupting path in the kernel is triggered.

> 1) The release of Ubuntu you are using, via 'lsb_release -rd' or System -> About Ubuntu

# lsb_release -rd
Description: Ubuntu 18.04.1 LTS
Release: 18.04

> 2) The version of the package you are using, via 'apt-cache policy pkgname' or by checking in Software Center

# apt-cache policy linux-image-4.15.0-36-generic
linux-image-4.15.0-36-generic:
 Installed: 4.15.0-36.39
 Candidate: 4.15.0-36.39
 Version table:
*** 4.15.0-36.39 500
    500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
    500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
    100 /var/lib/dpkg/status

> 3) What you expected to happen

We ran a fio random write workload over 8x 512MB files over XFS in guest OS, over qemu/kvm, over kernel 4.15.0-36.39-generic.
qemu-system was configured with cache=none, which means Direct IO. This is a very common configuration.
qemu-system was with aio=threads -- the default.

We were expecting no data corruption.

> 4) What happened instead

The guest filesystem was corrupted.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed (Ubuntu):
status: New → Confirmed
Vasil Kolev (krokodilerian) wrote :

Thank you for looking at this bug report.

We have some additional information.

qemu/kvm with cache=none (common) and aio=threads (default) is required to trigger the bug. aio=native (another common option) does not trigger it in our testing.

We applied the following patches on top of Ubuntu-4.15.0-36.39.

commit 0aa69fd32a5f766e997ca8ab4723c5a1146efa8b - block: add a lower-level bio_add_page interface
commit b403ea2404889e1227812fa9657667a1deb9c694 - block: bio_iov_iter_get_pages: fix size of last iovec
commit 9362dd1109f87a9d0a798fbc890cb339c171ed35 - blkdev: __blkdev_direct_IO_simple: fix leak in error case
commit 17d51b10d7773e4618bcac64648f30f12d4078fb - block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

The first one introduces a helper function, and the next three are the upstream patchset that fixes the issue.

Vasil Kolev (krokodilerian) wrote :

Some extra information: a tool to reproduce the problem is available in the linux-block mailing list: https://www.spinics.net/lists/linux-block/msg28507.html

Andy Whitcroft (apw) on 2018-10-09
affects: linux-signed (Ubuntu) → linux (Ubuntu)
Stefan Bader (smb) wrote :

According to bug information this is resolved in 4.18 and therefore in Cosmic/18.10.

Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Stefan Bader (smb) wrote :

Thanks for the report, we are investigating the issue and fix.

Changed in linux (Ubuntu Bionic):
assignee: nobody → Colin Ian King (colin-king)
no longer affects: linux-signed-hwe (Ubuntu)
no longer affects: linux-signed-hwe (Ubuntu Bionic)
Colin Ian King (colin-king) wrote :

OK, I've explored the fixes, it addresses the issue and I can't reproduce the problem with the fix testing with the provided reproducer as referenced in comment #3

description: updated
description: updated
summary: - Silent corruption in Linux kernel 4.15
+ Silent data corruption in Linux kernel 4.15
Changed in linux (Ubuntu Bionic):
status: Triaged → Fix Released
Stefan Bader (smb) wrote :

Please do not change the status. It will get updated once we have more details.

Changed in linux (Ubuntu Bionic):
status: Fix Released → Triaged
Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
Stefan Bader (smb) on 2018-10-10
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Vasil Kolev (krokodilerian) wrote :

I have verified that the issue doesn't get reproduced with kernel 4.15.0-38-generic from bionic-proposed.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (20.0 KiB)

This bug was fixed in the package linux - 4.15.0-38.41

---------------
linux (4.15.0-38.41) bionic; urgency=medium

  * linux: 4.15.0-38.41 -proposed tracker (LP: #1797061)

  * Silent data corruption in Linux kernel 4.15 (LP: #1796542)
    - block: add a lower-level bio_add_page interface
    - block: bio_iov_iter_get_pages: fix size of last iovec
    - blkdev: __blkdev_direct_IO_simple: fix leak in error case
    - block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs

linux (4.15.0-37.40) bionic; urgency=medium

  * linux: 4.15.0-37.40 -proposed tracker (LP: #1795564)

  * hns3: enable ethtool rx-vlan-filter on supported hw (LP: #1793394)
    - net: hns3: Add vlan filter setting by ethtool command -K

  * hns3: Modifying channel parameters will reset ring parameters back to
    defaults (LP: #1793404)
    - net: hns3: Fix desc num set to default when setting channel

  * hisi_sas: Add SATA FIX check for v3 hw (LP: #1794151)
    - scsi: hisi_sas: Add SATA FIS check for v3 hw

  * Fix potential corruption using SAS controller on HiSilicon arm64 boards
    (LP: #1794156)
    - scsi: hisi_sas: add memory barrier in task delivery function

  * hisi_sas: Reduce unnecessary spin lock contention (LP: #1794165)
    - scsi: hisi_sas: Tidy hisi_sas_task_prep()

  * Add functional level reset support for the SAS controller on HiSilicon D06
    systems (LP: #1794166)
    - scsi: hisi_sas: tidy host controller reset function a bit
    - scsi: hisi_sas: relocate some common code for v3 hw
    - scsi: hisi_sas: Implement handlers of PCIe FLR for v3 hw

  * HiSilicon SAS controller doesn't recover from PHY STP link timeout
    (LP: #1794172)
    - scsi: hisi_sas: tidy channel interrupt handler for v3 hw
    - scsi: hisi_sas: Fix the failure of recovering PHY from STP link timeout

  * getxattr: always handle namespaced attributes (LP: #1789746)
    - getxattr: use correct xattr length

  * Fix unusable NVIDIA GPU after S3 (LP: #1793338)
    - PCI: Reprogram bridge prefetch registers on resume

  * Fails to boot under Xen PV: BUG: unable to handle kernel paging request at
    edc21fd9 (LP: #1789118)
    - x86/EISA: Don't probe EISA bus for Xen PV guests

  * qeth: use vzalloc for QUERY OAT buffer (LP: #1793086)
    - s390/qeth: use vzalloc for QUERY OAT buffer

  * SRU: Enable middle button of touchpad on ThinkPad P72 (LP: #1793463)
    - Input: elantech - enable middle button of touchpad on ThinkPad P72

  * Dell new AIO requires a new uart backlight driver (LP: #1727235)
    - SAUCE: platform/x86: dell-uart-backlight: new backlight driver for DELL AIO
    - updateconfigs for Dell UART backlight driver

  * [Ubuntu] s390/crypto: Fix return code checking in cbc_paes_crypt.
    (LP: #1794294)
    - s390/crypto: Fix return code checking in cbc_paes_crypt()

  * hns3: Retrieve RoCE MSI-X config from firmware (LP: #1793221)
    - net: hns3: Fix MSIX allocation issue for VF
    - net: hns3: Refine the MSIX allocation for PF

  * net: hns: Avoid hang when link is changed while handling packets
    (LP: #1792209)
    - net: hns: add the code for cleaning pkt in chip
    - net: hns: add netif_carrier_off before change speed and duplex

  * Page leaki...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers