TLDR: commit 72ecad22d9f198aafee64218512e02ffa7818671 (in v4.10) introduced silent data corruption for O_DIRECT uses, it's fixed in 17d51b10d7773e4618bcac64648f30f12d4078fb (in v4.18)
A silent data corruption was introduced in v4.10-rc1 with commit 72ecad22d9f198aafee64218512e02ffa7818671 and was fixed in v4.18-rc7 with commit 17d51b10d7773e4618bcac64648f30f12d4078fb. It affects users of O_DIRECT, in our case a KVM virtual machine with drives which use qemu's "cache=none" option.
This is the commit which fixes the issue:
---------------------
commit 17d51b10d7773e4618bcac64648f30f12d4078fb
Author: Martin Wilck <email address hidden>
Date: Wed Jul 25 23:15:09 2018 +0200
block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
bio_iov_iter_get_pages() currently only adds pages for the next non-zero
segment from the iov_iter to the bio. That's suboptimal for callers,
which typically try to pin as many pages as fit into the bio. This patch
converts the current bio_iov_iter_get_pages() into a static helper, and
introduces a new helper that allocates as many pages as
1) fit into the bio,
2) are present in the iov_iter,
3) and can be pinned by MM.
Error is returned only if zero pages could be pinned. Because of 3), a
zero return value doesn't necessarily mean all pages have been pinned.
Callers that have to pin every page in the iov_iter must still call this
function in a loop (this is currently the case).
This change matters most for __blkdev_direct_IO_simple(), which calls
bio_iov_iter_get_pages() only once. If it obtains less pages than
requested, it returns a "short write" or "short read", and
__generic_file_write_iter() falls back to buffered writes, which may
lead to data corruption.
Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
Reviewed-by: Christoph Hellwig <email address hidden>
Signed-off-by: Martin Wilck <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
-------------------------
Since there were a lot of components involved in the initial report to us (xfs, guest kernel, guest virtio drivers, qemu, host kernel, storage system), we had to isolate it. This is the commit which fixes the data corruption bug. We created a reliable reproduction and tested with the patch and without the patch. We also created a version of the kernel which prints when the data-corrupting path in the kernel is triggered.
> 1) The release of Ubuntu you are using, via 'lsb_release -rd' or System -> About Ubuntu
We ran a fio random write workload over 8x 512MB files over XFS in guest OS, over qemu/kvm, over kernel 4.15.0-36.39-generic.
qemu-system was configured with cache=none, which means Direct IO. This is a very common configuration.
qemu-system was with aio=threads -- the default.
TLDR: commit 72ecad22d9f198a afee64218512e02 ffa7818671 (in v4.10) introduced silent data corruption for O_DIRECT uses, it's fixed in 17d51b10d7773e4 618bcac64648f30 f12d4078fb (in v4.18)
A silent data corruption was introduced in v4.10-rc1 with commit 72ecad22d9f198a afee64218512e02 ffa7818671 and was fixed in v4.18-rc7 with commit 17d51b10d7773e4 618bcac64648f30 f12d4078fb. It affects users of O_DIRECT, in our case a KVM virtual machine with drives which use qemu's "cache=none" option.
This is the commit which fixes the issue: ------- ------- 618bcac64648f30 f12d4078fb
-------
commit 17d51b10d7773e4
Author: Martin Wilck <email address hidden>
Date: Wed Jul 25 23:15:09 2018 +0200
block: bio_iov_ iter_get_ pages: pin more pages for multi-segment IOs
bio_ iov_iter_ get_pages( ) currently only adds pages for the next non-zero iter_get_ pages() into a static helper, and
segment from the iov_iter to the bio. That's suboptimal for callers,
which typically try to pin as many pages as fit into the bio. This patch
converts the current bio_iov_
introduces a new helper that allocates as many pages as
1) fit into the bio,
2) are present in the iov_iter,
3) and can be pinned by MM.
Error is returned only if zero pages could be pinned. Because of 3), a
zero return value doesn't necessarily mean all pages have been pinned.
Callers that have to pin every page in the iov_iter must still call this
function in a loop (this is currently the case).
This change matters most for __blkdev_ direct_ IO_simple( ), which calls iov_iter_ get_pages( ) only once. If it obtains less pages than file_write_ iter() falls back to buffered writes, which may
bio_
requested, it returns a "short write" or "short read", and
__generic_
lead to data corruption.
Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io") ------- ------- ----
Reviewed-by: Christoph Hellwig <email address hidden>
Signed-off-by: Martin Wilck <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
-------
Since there were a lot of components involved in the initial report to us (xfs, guest kernel, guest virtio drivers, qemu, host kernel, storage system), we had to isolate it. This is the commit which fixes the data corruption bug. We created a reliable reproduction and tested with the patch and without the patch. We also created a version of the kernel which prints when the data-corrupting path in the kernel is triggered.
> 1) The release of Ubuntu you are using, via 'lsb_release -rd' or System -> About Ubuntu
# lsb_release -rd
Description: Ubuntu 18.04.1 LTS
Release: 18.04
> 2) The version of the package you are using, via 'apt-cache policy pkgname' or by checking in Software Center
# apt-cache policy linux-image- 4.15.0- 36-generic 4.15.0- 36-generic: archive. ubuntu. com/ubuntu bionic- security/ main amd64 Packages archive. ubuntu. com/ubuntu bionic-updates/main amd64 Packages dpkg/status
linux-image-
Installed: 4.15.0-36.39
Candidate: 4.15.0-36.39
Version table:
*** 4.15.0-36.39 500
500 http://
500 http://
100 /var/lib/
> 3) What you expected to happen
We ran a fio random write workload over 8x 512MB files over XFS in guest OS, over qemu/kvm, over kernel 4.15.0- 36.39-generic.
qemu-system was configured with cache=none, which means Direct IO. This is a very common configuration.
qemu-system was with aio=threads -- the default.
We were expecting no data corruption.
> 4) What happened instead
The guest filesystem was corrupted.