amd_iommu possible data corruption

Bug #1823037 reported by Jeff Lane 
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Jeff Lane 
Bionic
Fix Released
Medium
Jeff Lane 
Cosmic
Invalid
Medium
Jeff Lane 
Disco
Fix Released
Medium
Jeff Lane 

Bug Description

[Impact]
If a device has an exclusion range specified in the IVRS
table, this region needs to be reserved in the iova-domain
of that device. This hasn't happened until now and can cause
data corruption on data transfered with these devices.

The exlcusion range limit register needs to contain the
base-address of the last page that is part of the range, as
bits 0-11 of this register are treated as 0xfff by the
hardware for comparisons.

So correctly set the exclusion range in the hardware to the
last page which is _in_ the range.

[Fixes]
Cherry pick the following from Mainline
fd3b3448cf5adc2a2f09b70eaad03c27fe79e7a6 iommu/amd: Reserve exclusion range in iova-domain
3c677d206210f53a4be972211066c0f1cd47fe12 iommu/amd: Set exclusion range correctly

These can be picked from my branches here:
Bionic:
https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/bionic 1823037-amd_iommu-cherrypick
b7abdb585d0ae4193add1f7038476cd8d6dd7f41 iommu/amd: Reserve exclusion range in iova-domain
ee4a87e6b7fe7e039c1f02c50c53da63e4270732 iommu/amd: Set exclusion range correctly

Cosmic: https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/cosmic 1823037-amd_iommu-cherrypick
0acc8c0e862b46b12c233abd76593420f4a20e0c iommu/amd: Reserve exclusion range in iova-domain
fbead5d71bfc2a49ffca8d181e2708fd616e9a7f iommu/amd: Set exclusion range correctly

Disco:
https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/disco 1823037-amd_iommu-cherrypick
6ddc207ca0d442b413ea7f059937ba90e5139753 iommu/amd: Set exclusion range correctly

Note Disco only required f1cd47fe12 as the first patch was already picked in LP: #1830934

Not necessary for Eoan as these are already upstream in 5.2-rc1

[Regression Risk]
Low,this adds memory protection to the driver and has already been applied upstream and tested by Dell for 24 hours with no failures noted.

Revision history for this message
Jeff Lane  (bladernr) wrote :
no longer affects: linux (Ubuntu Xenial)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Jeff Lane (bladernr)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Jeff Lane (bladernr)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Cosmic):
status: New → Confirmed
Revision history for this message
Jeff Lane  (bladernr) wrote :

Test kernels are up here:
https://people.canonical.com/~jlane/testkernels/

Please test each and report back

Changed in linux (Ubuntu Cosmic):
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
status: Confirmed → In Progress
Revision history for this message
Michael Reed (mreed8855) wrote :

An additional patch has been identified that will be needed to solve this issue.

https://github.com/torvalds/linux/commit/3c677d206210f53a4be972211066c0f1cd47fe12#diff-083bf3d2f128b616e730b7f9f8fc65c4

Revision history for this message
Antti Tönkyrä (atonkyra) wrote :

Confirming that I was unable to reproduce the issue on linux-image-4.15.0-46-generic (4.15.0-46.49) from the test kernels repository with my test setup.

Previously I was able to cause filesystem corruption within 1 hour of testing under a test load (verified by btrfs checksums) across all disks on 2 different 32-core systems.

Revision history for this message
Jeff Lane  (bladernr) wrote :

So I'm a bit confused because I've been told that A: additional patches are necessary to solve the issue, and B: that the issue is solved without those additional patches.

When you did the test, were you using the older, failing firmware levels, or have you updated to the patched firmware that is also part of this fix?

Revision history for this message
Jerry Clement (jerry-clement) wrote :

Jeff, The second patch improves the fix against the amount of memory in the system. We still need both patches.

Revision history for this message
Michael Reed (mreed8855) wrote :

Hi Jerry,

The test kernels are located here and have "1" appended to them.

https://people.canonical.com/~jlane/testkernels/

Revision history for this message
Jerry Clement (jerry-clement) wrote :

All three test kernels have passed 24+ hour testing.

tags: added: verification-done-bionic
tags: added: verification-done-cosmic
removed: verification-done-bionic
tags: added: verification-done-bionic verification-done-disco
Jeff Lane  (bladernr)
description: updated
Jeff Lane  (bladernr)
description: updated
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Cosmic):
importance: Undecided → Medium
Jeff Lane  (bladernr)
description: updated
description: updated
Jeff Lane  (bladernr)
description: updated
Jeff Lane  (bladernr)
description: updated
Jeff Lane  (bladernr)
description: updated
description: updated
Jeff Lane  (bladernr)
description: updated
Revision history for this message
Jeff Lane  (bladernr) wrote :

SRU Request submitted

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.2 KiB)

This bug was fixed in the package linux - 4.15.0-55.60

---------------
linux (4.15.0-55.60) bionic; urgency=medium

  * linux: 4.15.0-55.60 -proposed tracker (LP: #1834954)

  * Request backport of ceph commits into bionic (LP: #1834235)
    - ceph: use atomic_t for ceph_inode_info::i_shared_gen
    - ceph: define argument structure for handle_cap_grant
    - ceph: flush pending works before shutdown super
    - ceph: send cap releases more aggressively
    - ceph: single workqueue for inode related works
    - ceph: avoid dereferencing invalid pointer during cached readdir
    - ceph: quota: add initial infrastructure to support cephfs quotas
    - ceph: quota: support for ceph.quota.max_files
    - ceph: quota: don't allow cross-quota renames
    - ceph: fix root quota realm check
    - ceph: quota: support for ceph.quota.max_bytes
    - ceph: quota: update MDS when max_bytes is approaching
    - ceph: quota: add counter for snaprealms with quota
    - ceph: avoid iput_final() while holding mutex or in dispatch thread

  * QCA9377 isn't being recognized sometimes (LP: #1757218)
    - SAUCE: USB: Disable USB2 LPM at shutdown

  * hns: fix ICMP6 neighbor solicitation messages discard problem (LP: #1833140)
    - net: hns: fix ICMP6 neighbor solicitation messages discard problem
    - net: hns: fix unsigned comparison to less than zero

  * Fix occasional boot time crash in hns driver (LP: #1833138)
    - net: hns: Fix probabilistic memory overwrite when HNS driver initialized

  * use-after-free in hns_nic_net_xmit_hw (LP: #1833136)
    - net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw()

  * hns: attempt to restart autoneg when disabled should report error
    (LP: #1833147)
    - net: hns: Restart autoneg need return failed when autoneg off

  * systemd 237-3ubuntu10.14 ADT test failure on Bionic ppc64el (test-seccomp)
    (LP: #1821625)
    - powerpc: sys_pkey_alloc() and sys_pkey_free() system calls
    - powerpc: sys_pkey_mprotect() system call

  * [UBUNTU] pkey: Indicate old mkvp only if old and curr. mkvp are different
    (LP: #1832625)
    - pkey: Indicate old mkvp only if old and current mkvp are different

  * [UBUNTU] kernel: Fix gcm-aes-s390 wrong scatter-gather list processing
    (LP: #1832623)
    - s390/crypto: fix gcm-aes-s390 selftest failures

  * System crashes on hot adding a core with drmgr command (4.15.0-48-generic)
    (LP: #1833716)
    - powerpc/numa: improve control of topology updates
    - powerpc/numa: document topology_updates_enabled, disable by default

  * Kernel modules generated incorrectly when system is localized to a non-
    English language (LP: #1828084)
    - scripts: override locale from environment when running recordmcount.pl

  * [UBUNTU] kernel: Fix wrong dispatching for control domain CPRBs
    (LP: #1832624)
    - s390/zcrypt: Fix wrong dispatching for control domain CPRBs

  * CVE-2019-11815
    - net: rds: force to destroy connection if t_sock is NULL in
      rds_tcp_kill_sock().

  * Sound device not detected after resume from hibernate (LP: #1826868)
    - drm/i915: Force 2*96 MHz cdclk on glk/cnl when audio power is enabled
    - drm/i915: Save the old CDCLK atomic state
...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (57.5 KiB)

This bug was fixed in the package linux - 5.0.0-21.22

---------------
linux (5.0.0-21.22) disco; urgency=medium

  * linux: 5.0.0-21.22 -proposed tracker (LP: #1834902)

  * Disco update: 5.0.15 upstream stable release (LP: #1834529)
    - net: stmmac: Use bfsize1 in ndesc_init_rx_desc
    - Drivers: hv: vmbus: Remove the undesired put_cpu_ptr() in hv_synic_cleanup()
    - ubsan: Fix nasty -Wbuiltin-declaration-mismatch GCC-9 warnings
    - staging: greybus: power_supply: fix prop-descriptor request size
    - staging: wilc1000: Avoid GFP_KERNEL allocation from atomic context.
    - staging: most: cdev: fix chrdev_region leak in mod_exit
    - staging: most: sound: pass correct device when creating a sound card
    - ASoC: tlv320aic3x: fix reset gpio reference counting
    - ASoC: hdmi-codec: fix S/PDIF DAI
    - ASoC: stm32: sai: fix iec958 controls indexation
    - ASoC: stm32: sai: fix exposed capabilities in spdif mode
    - ASoC: stm32: sai: fix race condition in irq handler
    - ASoC:soc-pcm:fix a codec fixup issue in TDM case
    - ASoC:hdac_hda:use correct format to setup hda codec
    - ASoC:intel:skl:fix a simultaneous playback & capture issue on hda platform
    - ASoC: dpcm: prevent snd_soc_dpcm use after free
    - ASoC: nau8824: fix the issue of the widget with prefix name
    - ASoC: nau8810: fix the issue of widget with prefixed name
    - ASoC: samsung: odroid: Fix clock configuration for 44100 sample rate
    - ASoC: rt5682: Check JD status when system resume
    - ASoC: rt5682: fix jack type detection issue
    - ASoC: rt5682: recording has no sound after booting
    - ASoC: wm_adsp: Add locking to wm_adsp2_bus_error
    - clk: meson-gxbb: round the vdec dividers to closest
    - ASoC: stm32: dfsdm: manage multiple prepare
    - ASoC: stm32: dfsdm: fix debugfs warnings on entry creation
    - ASoC: cs4270: Set auto-increment bit for register writes
    - ASoC: dapm: Fix NULL pointer dereference in snd_soc_dapm_free_kcontrol
    - drm/omap: hdmi4_cec: Fix CEC clock handling for PM
    - IB/hfi1: Clear the IOWAIT pending bits when QP is put into error state
    - IB/hfi1: Eliminate opcode tests on mr deref
    - IB/hfi1: Fix the allocation of RSM table
    - MIPS: KGDB: fix kgdb support for SMP platforms.
    - ASoC: tlv320aic32x4: Fix Common Pins
    - drm/mediatek: Fix an error code in mtk_hdmi_dt_parse_pdata()
    - perf/x86/intel: Fix handling of wakeup_events for multi-entry PEBS
    - perf/x86/intel: Initialize TFA MSR
    - linux/kernel.h: Use parentheses around argument in u64_to_user_ptr()
    - iov_iter: Fix build error without CONFIG_CRYPTO
    - xtensa: fix initialization of pt_regs::syscall in start_thread
    - ASoC: rockchip: pdm: fix regmap_ops hang issue
    - drm/amdkfd: Add picasso pci id
    - drm/amdgpu: Adjust IB test timeout for XGMI configuration
    - drm/amdgpu: amdgpu_device_recover_vram always failed if only one node in
      shadow_list
    - drm/amd/display: fix cursor black issue
    - ASoC: cs35l35: Disable regulators on driver removal
    - objtool: Add rewind_stack_do_exit() to the noreturn list
    - slab: fix a crash by reading /proc/slab_allocators
    - drm/sun4i: tcon top: Fix NULL/inv...

Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
Terry Rudd (terrykrudd)
Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Invalid
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
MasterCATZ (mastercatz) wrote :

Running Ubuntu 19.10 5.4.3-050403-generic

Ryzen 3900x GA x570 UD mainboard DDR 3200 ,

I have been getting data corruption on my 4x SSD on Sata ports ( btrfs/ext4)
disks running from my LSI / Avago SAS Raid 9302-16e 12Gb/s PCIe 3.0 controller (btrfs/EXT4)
and all my disks running from my HBA (zfs)

I started off with PCI-e gen 4 dropped it to gen 3 as that is what my raid controller is as not using a NVMe yet

all 64 gig Memory test's good and HDD's test good if tested individually anymore than 8 being accessed at a time they fail

being 3 separate controllers and multiple disks I am leading towards a software issue and I remember having data issues before when I had my a10-5800k and that was solved by disabling iommu , I had no issues with this setup while running my intel 4790k

Revision history for this message
MasterCATZ (mastercatz) wrote :

seems I finally found the real issue

in the bios of my Gigabyte x570 UD , when I have 4G encode option enabled I have silent data corruption
( or at least if its enabled along side with IOMMU )

I need to make another array that I can kill to keep testing with , but I can literally go into the bios enable 4G reboot then start a zfs scrub and instantly have crc errors pop up everywhere , reboot disable do a scrub and it's all good , seems to only kick in when multiple disks are in use

In the past ~2012 with my Gigabyte F2A85x-UP4 AMD A10 5800k I also had data issues with IOMMU enabled ... never had issues with my intel i7 4790k build , so it could very well just be AMD IOMMU that is causing it for me

Revision history for this message
Jeff Lane  (bladernr) wrote :

@MasterCATZ, please check this with the 20.04 5.4 kernel. If it's still an issue could you please file a separate kernel bug for your corruption issue?

Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.