Data corruption with hio driver

Bug #1701316 reported by Peter Sabaini on 2017-06-29
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Seth Forshee
Xenial
High
Seth Forshee
Yakkety
High
Seth Forshee
Zesty
High
Seth Forshee
Artful
High
Seth Forshee

Bug Description

Impact: Data corruption is seen when using the hio driver with 4.10 and later kernels.

Fix: Patch to fix incorrect use of enumerated values as bitmasks.

Test case: See below.

Regression potential: Very low. Changes are simple and Obviously Correct (TM), and they only affect the hio driver.

---

We are seeing data corruption issues using the hio driver with kernel 4.10.0

# uname -a
Linux arbok 4.10.0-26-generic #30~16.04.1-Ubuntu SMP Tue Jun 27 09:40:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Making xfs fails:

root@arbok:~# mkfs.xfs /dev/hioa
meta-data=/dev/hioa isize=512 agcount=4, agsize=48835584 blks
         = sectsz=512 attr=2, projid32bit=1
         = crc=1 finobt=1, sparse=0
data = bsize=4096 blocks=195342336, imaxpct=25
         = sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=95382, version=2
         = sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
bad magic number
bad magic number
Metadata corruption detected at xfs_sb block 0x0/0x200
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x200

The drive appears to be healthy. Firmware has been upgraded to ver 656:

root@arbok:~# hio_info -d /dev/hioa
hioa Serial number: 022XWV10G2000325
        Size(GB): 800
        Max size(GB): 800
        Hardware version: 1.0
        Firmware version: 656
        Driver version: 2.1.0.28
        Work mode: MLC
        Run time (sec.): 8910490
        Total read(MB): 8499
        Total write(MB): 0
        Lifetime remaining: 99.844%
        Max bad block rate: 0.167%
        Health: OK
        Comment: NA

No relevant entries about read/write errors in dmesg

Also just copying 8G random data and reading those back gives a hash mismatch:
root@arbok:~# dd if=/dev/urandom of=test.dat bs=1G count=8 iflag=fullblock
8+0 records in
8+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 85.6076 s, 100 MB/s
root@arbok:~# dd if=test.dat of=/dev/hioa bs=1G count=8 iflag=fullblock
8+0 records in
8+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 10.6034 s, 810 MB/s
root@arbok:~# dd if=/dev/hioa of=read-back.dat bs=1G count=8 iflag=fullblock
sha256sum test.dat read-
8+0 records in
8+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 66.1872 s, 130 MB/s
root@arbok:~# sha256sum test.dat read-back.dat
6376d245a07c42c990589a3c17c44e63d826d1cb583fc5a065deff9dae69fd3a test.dat
ebfb4ef19ae410f190327b5ebd312711263bc7579970e87d9c1e2d84e06b3c25 read-back.dat

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key
Peter Sabaini (peter-sabaini) wrote :

Indeed this issue started after upgrading to 4.10.0-26-generic. We've been using 4.8.0-34-generic previously, and didn't see that issue there.

The hio driver is not present upstream, it's been included in xenial with Bug #1603483.

Also, we've had consistency issues with this driver before (cf. Bug #1646643), but afaics the patches from that bug are included in 4.10.0-26-generic

Brad Figg (brad-figg) on 2017-07-06
Changed in linux (Ubuntu):
assignee: nobody → Seth Forshee (sforshee)
Seth Forshee (sforshee) on 2017-07-07
Changed in linux (Ubuntu Xenial):
assignee: nobody → Seth Forshee (sforshee)
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Seth Forshee (sforshee)
status: New → In Progress
importance: Undecided → High
Changed in linux (Ubuntu Zesty):
assignee: nobody → Seth Forshee (sforshee)
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu Artful):
status: Confirmed → In Progress
Seth Forshee (sforshee) wrote :

We've received a patch from Huawei which is said to fix this issue. The patch is an obviously correct fix, so I've applied it to artful. I'll provide a test build shortly for zesty.

Also will backport to yakkety, since there's potential for bugs there as well. Xenial is unaffected because the code paths in question are only relevant to 4.8 and later (I mistakenly nominated this for xenial, will mark that invalid).

Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Invalid
Seth Forshee (sforshee) wrote :

Here's the test build, please let me know if it fixes the data corruption. Thanks!

http://people.canonical.com/~sforshee/lp1701316/

tags: removed: kernel-key
Jill Rouleau (jillrouleau) wrote :

So far we've tested this build on one machine, using the below comparison 30+ times, and it looks good.

# i=0; while :; do i=$(($i+1)); echo -n "." ; dd if=/dev/urandom of=test.dat bs=1G count=8 iflag=fullblock ; dd if=test.dat of=/dev/hioa bs=1G count=8 iflag=fullblock ; dd if=/dev/hioa of=read-back.dat bs=1G count=8 ; cmp=$(md5sum test.dat read-back.dat|awk '{print $1}'|sort -u|wc -l); if [ "$cmp" -ne 1 ]; then echo "MISMATCH" ; md5sum test.dat read-back.dat ; echo $i; return 1; fi; done

Seth Forshee (sforshee) on 2017-07-11
description: updated
Seth Forshee (sforshee) on 2017-07-11
description: updated
Junien Fridrick (axino) wrote :

So we tested both Huawei driver 2.1.0.40+patch on 4.10.0-26-generic (from the repo), and 4.10.0-25.29+lp1701316v201707070815 (which includes 2.1.0.28), and both work (at least mkfs.xfs, and loop jillrouleau mentioned above).

Changed in linux (Ubuntu Zesty):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty

This bug was nominated against a series that is no longer supported, ie yakkety. The bug task representing the yakkety nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Won't Fix
Peter Sabaini (peter-sabaini) wrote :

Test: writing 32x10G random data and comparing sha hash.

Tested with 4.10.0-29-generic and 4.11.0-12-generic from xenial-proposed. No diffs, no I/O errors.

tags: added: verification-done-zesty
removed: verification-needed-zesty
Launchpad Janitor (janitor) wrote :
Download full text (6.6 KiB)

This bug was fixed in the package linux - 4.10.0-30.34

---------------
linux (4.10.0-30.34) zesty; urgency=low

  * CVE-2017-7533
    - dentry name snapshots

linux (4.10.0-29.33) zesty; urgency=low

  * linux: 4.10.0-29.33 -proposed tracker (LP: #1704961)

  * Opal and POWER9 DD2 (LP: #1702159)
    - powerpc/powernv: Tell OPAL about our MMU mode on POWER9
    - powerpc/powernv: Fix boot on Power8 bare metal due to opal_configure_cores()

  * CVE-2017-1000364
    - mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack
    - mm/mmap.c: expand_downwards: don't require the gap if !vm_prev

  * [Xenial] nvme: Quirks for PM1725 controllers (LP: #1704435)
    - nvme: Quirks for PM1725 controllers

  * hns: under heavy load, NIC may fail and require reboot (LP: #1704146)
    - net: hns: Bugfix for Tx timeout handling in hns driver

  * New ACPI identifiers for ThunderX SMMU (LP: #1703437)
    - iommu/arm-smmu: Plumb in new ACPI identifiers

  * CVE-2017-7482
    - rxrpc: Fix several cases where a padded len isn't checked in ticket decode

  * CVE-2017-1000365
    - fs/exec.c: account for argv/envp pointers

  * CVE-2017-10810
    - drm/virtio: don't leak bo on drm_gem_object_init failure

  * Data corruption with hio driver (LP: #1701316)
    - SAUCE: hio: Fix incorrect use of enum req_opf values

  * arm64: fix crash reading /proc/kcore (LP: #1702749)
    - fs/proc: kcore: use kcore_list type to check for vmalloc/module address
    - arm64: mm: select CONFIG_ARCH_PROC_KCORE_TEXT

  * cxlflash update request in the Xenial SRU stream (LP: #1702521)
    - scsi: cxlflash: Refactor context reset to share reset logic
    - scsi: cxlflash: Support SQ Command Mode
    - scsi: cxlflash: Cleanup prints
    - scsi: cxlflash: Cancel scheduled workers before stopping AFU
    - scsi: cxlflash: Enable PCI device ID for future IBM CXL Flash AFU
    - scsi: cxlflash: Separate RRQ processing from the RRQ interrupt handler
    - scsi: cxlflash: Serialize RRQ access and support offlevel processing
    - scsi: cxlflash: Implement IRQ polling for RRQ processing
    - scsi: cxlflash: Update sysfs helper routines to pass config structure
    - scsi: cxlflash: Support dynamic number of FC ports
    - scsi: cxlflash: Remove port configuration assumptions
    - scsi: cxlflash: Hide FC internals behind common access routine
    - scsi: cxlflash: SISlite updates to support 4 ports
    - scsi: cxlflash: Support up to 4 ports
    - scsi: cxlflash: Fence EEH during probe
    - scsi: cxlflash: Remove unnecessary DMA mapping
    - scsi: cxlflash: Fix power-of-two validations
    - scsi: cxlflash: Fix warnings/errors
    - scsi: cxlflash: Improve asynchronous interrupt processing
    - scsi: cxlflash: Support multiple hardware queues
    - scsi: cxlflash: Add hardware queues attribute
    - scsi: cxlflash: Introduce hardware queue steering
    - cxl: Enable PCI device IDs for future IBM CXL adapters
    - scsi: cxlflash: Select IRQ_POLL
    - scsi: cxlflash: Combine the send queue locks
    - scsi: cxlflash: Update cxlflash_afu_sync() to return errno
    - scsi: cxlflash: Reset hardware queue context via specified register
    - scsi: cxlflash: Schedule asynchronous res...

Read more...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (24.9 KiB)

This bug was fixed in the package linux - 4.11.0-13.19

---------------
linux (4.11.0-13.19) artful; urgency=low

  * CVE-2017-7533
    - dentry name snapshots

linux (4.11.0-12.18) artful; urgency=low

  * linux: 4.11.0-12.18 -proposed tracker (LP: #1707635)
    - no change rebuild to pick up the new binutils.

  * Adt tests of src:linux time out often on armhf lxc containers (LP: #1705495)
    - [Packaging] tests -- reduce rebuild test to one flavour
    - [Packaging] tests -- reduce rebuild test to one flavour -- use filter

  * [ARM64] config EDAC_GHES=y depends on EDAC_MM_EDAC=y (LP: #1706141)
    - [Config] set EDAC_MM_EDAC=y for ARM64

  * [Hyper-V] hv_netvsc: Exclude non-TCP port numbers from vRSS hashing
    (LP: #1690174)
    - hv_netvsc: Exclude non-TCP port numbers from vRSS hashing

  * ath10k doesn't report full RSSI information (LP: #1706531)
    - ath10k: add per chain RSSI reporting

  * ideapad_laptop don't support v310-14isk (LP: #1705378)
    - platform/x86: ideapad-laptop: Add several models to no_hw_rfkill

  * Ubuntu 16.04.3: Qemu fails on P9 (LP: #1686019)
    - KVM: PPC: Pass kvm* to kvmppc_find_table()
    - KVM: PPC: Use preregistered memory API to access TCE list
    - KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
    - powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange()
    - powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
    - powerpc/vfio_spapr_tce: Add reference counting to iommu_table
    - powerpc/mmu: Add real mode support for IOMMU preregistered memory
    - KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
    - KVM: PPC: Book3S HV: Add radix checks in real-mode hypercall handlers

  * hns: ethtool selftest crashes system (LP: #1705712)
    - net/hns:bugfix of ethtool -t phy self_test

  * ThunderX: soft lockup on 4.8+ kernels when running qemu-efi with vhost=on
    (LP: #1673564)
    - KVM: arm/arm64: vgic-v3: Use PREbits to infer the number of ICH_APxRn_EL2
      registers
    - KVM: arm/arm64: vgic-v3: Fix nr_pre_bits bitfield extraction
    - arm64: Add a facility to turn an ESR syndrome into a sysreg encoding
    - KVM: arm/arm64: vgic-v3: Add accessors for the ICH_APxRn_EL2 registers
    - KVM: arm64: Make kvm_condition_valid32() accessible from EL2
    - KVM: arm64: vgic-v3: Add hook to handle guest GICv3 sysreg accesses at EL2
    - KVM: arm64: vgic-v3: Add ICV_BPR1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_IGRPEN1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_IAR1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_EOIR1_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_AP1Rn_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_HPPIR1_EL1 handler
    - KVM: arm64: vgic-v3: Enable trapping of Group-1 system registers
    - KVM: arm64: Enable GICv3 Group-1 sysreg trapping via command-line
    - KVM: arm64: vgic-v3: Add ICV_BPR0_EL1 handler
    - KVM: arm64: vgic-v3: Add ICV_IGNREN0_EL1 handler
    - KVM: arm64: vgic-v3: Add misc Group-0 handlers
    - KVM: arm64: vgic-v3: Enable trapping of Group-0 system registers
    - KVM: arm64: Enable GICv3 Group-0 sysreg trapping via command-line
    - arm64: Add MIDR values for Cavium cn83XX SoCs
    - arm64: Add wor...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers