amd_iommu possible data corruption

Bug #1823037 reported by Jeff Lane on 2019-04-03
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Jeff Lane
Bionic
Undecided
Jeff Lane
Cosmic
Undecided
Jeff Lane
Disco
Medium
Jeff Lane

Bug Description

[Impact]
If a device has an exclusion range specified in the IVRS table, this region needs to be reserved in the iova-domain of that device. This hasn't happened until now and can cause data corruption on data transfered with these devices.

Treat exclusion ranges as reserved regions in the iommu-core to fix the problem.

This is a clean cherry pick from mainline of 8aafaaf2212192012f5bae305bb31cdf7681d777
3c677d206210f53a4be972211066c0f1cd47fe12

[Test Case]

[Fixes]
Cherry pick the following from Mainline
fd3b3448cf5adc2a2f09b70eaad03c27fe79e7a6 iommu/amd: Reserve exclusion range in iova-domain
3c677d206210f53a4be972211066c0f1cd47fe12 iommu/amd: Set exclusion range correctly

[Regression Risk]
Only affects the amd_iommu driver:
    drivers/iommu/amd_iommu*

Jeff Lane (bladernr) wrote :
no longer affects: linux (Ubuntu Xenial)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Jeff Lane (bladernr)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Jeff Lane (bladernr)
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Cosmic):
status: New → Confirmed
Jeff Lane (bladernr) wrote :

Test kernels are up here:
https://people.canonical.com/~jlane/testkernels/

Please test each and report back

Changed in linux (Ubuntu Cosmic):
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
status: Confirmed → In Progress
Michael Reed (mreed8855) wrote :

An additional patch has been identified that will be needed to solve this issue.

https://github.com/torvalds/linux/commit/3c677d206210f53a4be972211066c0f1cd47fe12#diff-083bf3d2f128b616e730b7f9f8fc65c4

Antti Tönkyrä (atonkyra) wrote :

Confirming that I was unable to reproduce the issue on linux-image-4.15.0-46-generic (4.15.0-46.49) from the test kernels repository with my test setup.

Previously I was able to cause filesystem corruption within 1 hour of testing under a test load (verified by btrfs checksums) across all disks on 2 different 32-core systems.

Jeff Lane (bladernr) wrote :

So I'm a bit confused because I've been told that A: additional patches are necessary to solve the issue, and B: that the issue is solved without those additional patches.

When you did the test, were you using the older, failing firmware levels, or have you updated to the patched firmware that is also part of this fix?

Jerry Clement (jerry-clement) wrote :

Jeff, The second patch improves the fix against the amount of memory in the system. We still need both patches.

Michael Reed (mreed8855) wrote :

Hi Jerry,

The test kernels are located here and have "1" appended to them.

https://people.canonical.com/~jlane/testkernels/

Jerry Clement (jerry-clement) wrote :

All three test kernels have passed 24+ hour testing.

tags: added: verification-done-bionic
tags: added: verification-done-cosmic
removed: verification-done-bionic
tags: added: verification-done-bionic verification-done-disco
Jeff Lane (bladernr) on 2019-05-24
description: updated
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers