DMAR: ERROR: DMA PTE for vPFN 0x7bf32 already set

Bug #1970453 reported by Marian Rainer-Harbach
56
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I'm running Ubuntu 22.04 with kernel 5.15.0.27.30 on an HPE ProLiant DL20 Gen9 server. The server has an HPE Smart HBA H240 SATA controller.

Since Ubuntu 22.04, the kernel runs into trouble after a few hours of uptime. The problem starts with a few instances of a message such as this:

Apr 26 12:03:37 <hostname> kernel: DMAR: ERROR: DMA PTE for vPFN 0x7bf32 already set (to 7bf32003 not 24c563801)
Apr 26 12:03:37 <hostname> kernel: ------------[ cut here ]------------
Apr 26 12:03:37 <hostname> kernel: WARNING: CPU: 1 PID: 10171 at drivers/iommu/intel/iommu.c:2391 __domain_mapping.cold+0x94/0xcb
Apr 26 12:03:37 <hostname> kernel: Modules linked in: tls rpcsec_gss_krb5 binfmt_misc ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limi>
Apr 26 12:03:37 <hostname> kernel: drm_kms_helper aesni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci cec crypto_simd i2c_i801 rc_core cryptd drm xhci_pci_renesas ahci i2c_smbus tg3 hpsa>
Apr 26 12:03:37 <hostname> kernel: CPU: 1 PID: 10171 Comm: kworker/u4:0 Not tainted 5.15.0-27-generic #28-Ubuntu
Apr 26 12:03:37 <hostname> kernel: Hardware name: HP ProLiant DL20 Gen9/ProLiant DL20 Gen9, BIOS U22 04/01/2021
Apr 26 12:03:37 <hostname> kernel: Workqueue: writeback wb_workfn (flush-253:2)
Apr 26 12:03:37 <hostname> kernel: RIP: 0010:__domain_mapping.cold+0x94/0xcb
Apr 26 12:03:37 <hostname> kernel: Code: 27 9d 4c 89 4d b8 4c 89 45 c0 e8 03 c5 fa ff 8b 05 e7 e6 40 01 4c 8b 45 c0 4c 8b 4d b8 85 c0 74 09 83 e8 01 89 05 d2 e6 40 01 <0f> 0b e9 7e b2 b1 ff 89 ca 48 83 >
Apr 26 12:03:37 <hostname> kernel: RSP: 0018:ffffc077826b2fa0 EFLAGS: 00010202
Apr 26 12:03:37 <hostname> kernel: RAX: 0000000000000004 RBX: ffff9f0042062990 RCX: 0000000000000000
Apr 26 12:03:37 <hostname> kernel: RDX: 0000000000000000 RSI: ffff9f02b3d20980 RDI: ffff9f02b3d20980
Apr 26 12:03:37 <hostname> kernel: RBP: ffffc077826b2ff0 R08: 000000024c563801 R09: 000000000024c563
Apr 26 12:03:37 <hostname> kernel: R10: 00000000ffffffff R11: ffffffffc01550e0 R12: 000000000000000f
Apr 26 12:03:37 <hostname> kernel: R13: 000000000007bf32 R14: ffff9f00412f5800 R15: ffff9f0042062938
Apr 26 12:03:37 <hostname> kernel: FS: 0000000000000000(0000) GS:ffff9f02b3d00000(0000) knlGS:0000000000000000
Apr 26 12:03:37 <hostname> kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 12:03:37 <hostname> kernel: CR2: 00001530f676a01c CR3: 000000029c210001 CR4: 00000000002706e0
Apr 26 12:03:37 <hostname> kernel: Call Trace:
Apr 26 12:03:37 <hostname> kernel: <TASK>
Apr 26 12:03:37 <hostname> kernel: intel_iommu_map_pages+0xdc/0x120
Apr 26 12:03:37 <hostname> kernel: ? __alloc_and_insert_iova_range+0x203/0x240
Apr 26 12:03:37 <hostname> kernel: __iommu_map+0xda/0x270
Apr 26 12:03:37 <hostname> kernel: __iommu_map_sg+0x8e/0x120
Apr 26 12:03:37 <hostname> kernel: iommu_map_sg_atomic+0x14/0x20
Apr 26 12:03:37 <hostname> kernel: iommu_dma_map_sg+0x345/0x4d0
Apr 26 12:03:37 <hostname> kernel: __dma_map_sg_attrs+0x68/0x70
Apr 26 12:03:37 <hostname> kernel: dma_map_sg_attrs+0xe/0x20
Apr 26 12:03:37 <hostname> kernel: scsi_dma_map+0x39/0x50
Apr 26 12:03:37 <hostname> kernel: hpsa_scsi_ioaccel2_queue_command.constprop.0+0x11e/0x570 [hpsa]
Apr 26 12:03:37 <hostname> kernel: ? __blk_rq_map_sg+0x36/0x160
Apr 26 12:03:37 <hostname> kernel: hpsa_scsi_ioaccel_queue_command+0x82/0xd0 [hpsa]
Apr 26 12:03:37 <hostname> kernel: hpsa_ioaccel_submit+0x174/0x190 [hpsa]
Apr 26 12:03:37 <hostname> kernel: hpsa_scsi_queue_command+0x19c/0x240 [hpsa]
Apr 26 12:03:37 <hostname> kernel: ? recalibrate_cpu_khz+0x10/0x10
Apr 26 12:03:37 <hostname> kernel: scsi_dispatch_cmd+0x93/0x1f0
Apr 26 12:03:37 <hostname> kernel: scsi_queue_rq+0x2d1/0x690
Apr 26 12:03:37 <hostname> kernel: blk_mq_dispatch_rq_list+0x126/0x600
Apr 26 12:03:37 <hostname> kernel: ? __sbitmap_queue_get+0x1/0x10
Apr 26 12:03:37 <hostname> kernel: __blk_mq_do_dispatch_sched+0xba/0x2d0
Apr 26 12:03:37 <hostname> kernel: __blk_mq_sched_dispatch_requests+0x104/0x150
Apr 26 12:03:37 <hostname> kernel: blk_mq_sched_dispatch_requests+0x35/0x60
Apr 26 12:03:37 <hostname> kernel: __blk_mq_run_hw_queue+0x34/0xb0
Apr 26 12:03:37 <hostname> kernel: __blk_mq_delay_run_hw_queue+0x162/0x170
Apr 26 12:03:37 <hostname> kernel: blk_mq_run_hw_queue+0x83/0x120
Apr 26 12:03:37 <hostname> kernel: blk_mq_sched_insert_requests+0x69/0xf0
Apr 26 12:03:37 <hostname> kernel: blk_mq_flush_plug_list+0x103/0x1c0
Apr 26 12:03:37 <hostname> kernel: blk_flush_plug_list+0xdd/0x100
Apr 26 12:03:37 <hostname> kernel: blk_mq_submit_bio+0x2bd/0x600
Apr 26 12:03:37 <hostname> kernel: __submit_bio+0x1ea/0x220
Apr 26 12:03:37 <hostname> kernel: ? mempool_alloc_slab+0x17/0x20
Apr 26 12:03:37 <hostname> kernel: __submit_bio_noacct+0x85/0x1f0
Apr 26 12:03:37 <hostname> kernel: submit_bio_noacct+0x4e/0x120
Apr 26 12:03:37 <hostname> kernel: ? radix_tree_lookup+0xd/0x10
Apr 26 12:03:37 <hostname> kernel: ? bio_associate_blkg_from_css+0x1b2/0x310
Apr 26 12:03:37 <hostname> kernel: submit_bio+0x4a/0x130
Apr 26 12:03:37 <hostname> kernel: ? wbc_account_cgroup_owner+0x2c/0x80
Apr 26 12:03:37 <hostname> kernel: submit_bh_wbc+0x18d/0x1c0
Apr 26 12:03:37 <hostname> kernel: __block_write_full_page+0x227/0x4a0
Apr 26 12:03:37 <hostname> kernel: ? block_invalidatepage+0x150/0x150
Apr 26 12:03:37 <hostname> kernel: ? blkdev_llseek+0x60/0x60
Apr 26 12:03:37 <hostname> kernel: block_write_full_page+0x6f/0x90
Apr 26 12:03:37 <hostname> kernel: blkdev_writepage+0x18/0x20
Apr 26 12:03:37 <hostname> kernel: __writepage+0x1e/0x70
Apr 26 12:03:37 <hostname> kernel: write_cache_pages+0x1a9/0x460
Apr 26 12:03:37 <hostname> kernel: ? __set_page_dirty_no_writeback+0x40/0x40
Apr 26 12:03:37 <hostname> kernel: generic_writepages+0x58/0x90
Apr 26 12:03:37 <hostname> kernel: ? __blk_mq_do_dispatch_sched+0x7f/0x2d0
Apr 26 12:03:37 <hostname> kernel: blkdev_writepages+0xe/0x10
Apr 26 12:03:37 <hostname> kernel: do_writepages+0xda/0x200
Apr 26 12:03:37 <hostname> kernel: ? __percpu_counter_sum+0x6f/0xa0
Apr 26 12:03:37 <hostname> kernel: ? __blk_mq_sched_dispatch_requests+0x104/0x150
Apr 26 12:03:37 <hostname> kernel: ? mem_cgroup_css_rstat_flush+0x43a/0x870
Apr 26 12:03:37 <hostname> kernel: ? cpumask_next+0x23/0x30
Apr 26 12:03:37 <hostname> kernel: __writeback_single_inode+0x44/0x290
Apr 26 12:03:37 <hostname> kernel: writeback_sb_inodes+0x223/0x4d0
Apr 26 12:03:37 <hostname> kernel: __writeback_inodes_wb+0x56/0xf0
Apr 26 12:03:37 <hostname> kernel: wb_writeback+0x1cc/0x290
Apr 26 12:03:37 <hostname> kernel: wb_do_writeback+0x1a4/0x280
Apr 26 12:03:37 <hostname> kernel: wb_workfn+0x77/0x250
Apr 26 12:03:37 <hostname> kernel: ? psi_task_switch+0xc6/0x220
Apr 26 12:03:37 <hostname> kernel: ? finish_task_switch.isra.0+0xa6/0x270
Apr 26 12:03:37 <hostname> kernel: process_one_work+0x22b/0x3d0
Apr 26 12:03:37 <hostname> kernel: worker_thread+0x53/0x410
Apr 26 12:03:37 <hostname> kernel: ? process_one_work+0x3d0/0x3d0
Apr 26 12:03:37 <hostname> kernel: kthread+0x12a/0x150
Apr 26 12:03:37 <hostname> kernel: ? set_kthread_struct+0x50/0x50
Apr 26 12:03:37 <hostname> kernel: ret_from_fork+0x22/0x30
Apr 26 12:03:37 <hostname> kernel: </TASK>
Apr 26 12:03:37 <hostname> kernel: ---[ end trace 6eaabfe8ad4492e0 ]---

Afterwards, messages like

Apr 26 12:55:29 <hostname> kernel: dmar_fault: 152 callbacks suppressed
Apr 26 12:55:29 <hostname> kernel: DMAR: DRHD: handling fault status reg 2
Apr 26 12:55:29 <hostname> kernel: DMAR: [DMA Write NO_PASID] Request device [06:00.0] fault addr 0x7bf4a000 [fault reason 0x05] PTE Write access is not set

or

Apr 26 12:56:50 <hostname> kernel: dmar_fault: 152 callbacks suppressed
Apr 26 12:56:50 <hostname> kernel: DMAR: DRHD: handling fault status reg 2
Apr 26 12:56:50 <hostname> kernel: DMAR: [DMA Read NO_PASID] Request device [06:00.0] fault addr 0x7bf32000 [fault reason 0x06] PTE Read access is not set

are logged continuously. The logged device ID 06:00.0 is the HPE SATA controller.

The errors go away after a reboot until the problem occurs again after a few hours. In most cases, the server even reports a hardware fault and the storage fan spins up to 100 %.

The problem did _not_ occur in Ubuntu 20.04, the last 20.04 kernel I ran on this server was 5.13.0.39.44~20.04.24.

Setting the intel_iommu=off kernel boot parameter seems to work around the problem.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please give mainline kernel a try:
https://wiki.ubuntu.com/Kernel/MainlineBuilds

Revision history for this message
Marian Rainer-Harbach (marianrh) wrote :

I tried the latest currently available 5.15 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.36/amd64/, 5.15.36-051536-generic #202204271420 SMP Wed Apr 27 16:37:41 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux). The problem still occurs there.

Revision history for this message
Mitsuya Shibata (cosmos-door) wrote :
Revision history for this message
Marian Rainer-Harbach (marianrh) wrote :

If this question was meant for me: I cannot really say, in my case no VMs are involved at all.

Revision history for this message
Patrick Harold Donnelly (phdonnelly) wrote (last edit ):

I hit this bug upgrading my home server (proliant microserver gen8) and it seems to be causing memory corruption when it occurs ( at least in combination with zfs ). Using zfs mirrored root I experienced this issue after only a few minutes uptime, with DMAR messages flooding the log and very high CPU usage.

After rebooting with intel_iommu=off things are back to normal, but a zfs scrub indicated several thousand checksum errors detected on the root volume, some of them unrecoverable that had to be restored from backup, and a separate zfs RAIDZ1 volume experienced corrupted metadata and had to be rolled back with some data loss.

Revision history for this message
Trent Lloyd (lathiat) wrote :

With regards to the patch here:
https://lists.linuxfoundation.org/pipermail/iommu/2021-October/060115.html

It is mentioned this issue can occur if you are passing through a PCI device to a virtual machine guest. This patch seems like it never made it into the kernel. So I am curious if you are using any virtual machines on this host, and if any of them are mapping PCI devices from the host in.

Revision history for this message
Marian Rainer-Harbach (marianrh) wrote :

In my case the host is not running any virtual machines. I have another similar HPE Gen9 server (without the HPE Smart HBA H240 SATA controller) that doesn't have the problem. So in my case I suspect the H240 controller is (at least part of) the problem.

@phdonnelly, are you using VMs and/or do you have an H240 controller?

Revision history for this message
Patrick Harold Donnelly (phdonnelly) wrote :

Not running any VMs but I did have SR-IOV and VT-X enabled in the BIOS. Server has a HP B120i which is a fakeraid device and the drives are accessed as simple AHCI devices.

Revision history for this message
TJ (tj) wrote :

This is likely the issue addressed in upstream stable v5.15.6 commit 724ee060 - the links in that commit message seem to be very close:

commit 724ee060d0aba28f072fc7357a20366b0a519593
Author: Alex Williamson <email address hidden>
Date: Fri Nov 26 21:55:56 2021 +0800

    iommu/vt-d: Fix unmap_pages support

    [ Upstream commit 86dc40c7ea9c22f64571e0e45f695de73a0e2644 ]

    When supporting only the .map and .unmap callbacks of iommu_ops,
    the IOMMU driver can make assumptions about the size and alignment
    used for mappings based on the driver provided pgsize_bitmap. VT-d
    previously used essentially PAGE_MASK for this bitmap as any power
    of two mapping was acceptably filled by native page sizes.

    However, with the .map_pages and .unmap_pages interface we're now
    getting page-size and count arguments. If we simply combine these
    as (page-size * count) and make use of the previous map/unmap
    functions internally, any size and alignment assumptions are very
    different.

    As an example, a given vfio device assignment VM will often create
    a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff]. On a system that
    does not support IOMMU super pages, the unmap_pages interface will
    ask to unmap 1024 4KB pages at the base IOVA. dma_pte_clear_level()
    will recurse down to level 2 of the page table where the first half
    of the pfn range exactly matches the entire pte level. We clear the
    pte, increment the pfn by the level size, but (oops) the next pte is
    on a new page, so we exit the loop an pop back up a level. When we
    then update the pfn based on that higher level, we seem to assume
    that the previous pfn value was at the start of the level. In this
    case the level size is 256K pfns, which we add to the base pfn and
    get a results of 0x7fe00, which is clearly greater than 0x401ff,
    so we're done. Meanwhile we never cleared the ptes for the remainder
    of the range. When the VM remaps this range, we're overwriting valid
    ptes and the VT-d driver complains loudly, as reported by the user
    report linked below.

    The fix for this seems relatively simple, if each iteration of the
    loop in dma_pte_clear_level() is assumed to clear to the end of the
    level pte page, then our next pfn should be calculated from level_pfn
    rather than our working pfn.

    Fixes: 3f34f1259776 ("iommu/vt-d: Implement map/unmap_pages() iommu_ops callback")
    Reported-by: Ajay Garg <email address hidden>
    Signed-off-by: Alex Williamson <email address hidden>
    Tested-by: Giovanni Cabiddu <email address hidden>
    Link: https://<email address hidden>/
    Link: https://lore.kernel.org/r/163659074748.1617923.12716161410774184024.stgit@omen
    Signed-off-by: Lu Baolu <email address hidden>
    Link: https://<email address hidden>
    Signed-off-by: Joerg Roedel <email address hidden>
    Signed-off-by: Sasha Levin <email address hidden>

Revision history for this message
Jorge Bastos (jbastos5) wrote :

We found this exact same issue affecting several users with Unraid and kernel 5.15, they just needed to have VT-d enabled, it mostly affected HP servers of the same era, for example:

ProLiant MicroServer Gen8
ProLiant ML350p Gen8
ProLiant ML310e Gen8
ProLiant ML350 Gen9
ProLiant ML10 v2

but there was also a Lenovo X3100 M5, a Dell PowerEdge R710 and even a 6th gen i5 Asrock desktop board, so can't rule out any Intel based computer with VT-d enabled being affected sooner or later, but the HP servers listed were much more prone to the issue, we could sometimes trigger the bug in seconds, couple of minutes tops, by doing something i/o intensive, like for example starting a parity sync or check, problem started like above:

DMAR: ERROR: DMA PTE for vPFN xxxxxx already set (to xxxxxx not xxxxxx)

Then it started corrupting data, some times more severely than others, it happened with both filesystems typically used with Unraid, XFS and btrfs, though usually it was mostly found with btrfs since it started detecting data corruption errors, also btrfs is more vulnerable to memory corruption so it was more common to find a fatally damaged filesystem after some use.

Limetech found that changing iommu to pass-through mode instead of DMA translation mode fixed the problem, this can be done by changing this kernel option:

CONFIG_IOMMU_DEFAULT_PASSTHROUGH: Passthrough

Same as adding iommu=pt (or iommu.passtrough=1) to the kernel boot options.

They also found this document explaining the difference between the two modes:

https://lenovopress.com/lp1467.pdf

Iommu pass-through mode bypasses the DMA translation, so it makes sense that any DMA remapping errors like the above stop happening, also note that we didn't find, at least as of today, any issues with PCIe device pass-through to a VM by using this mode, and according to the pdf above it's also being adopted as the new default by other Linux OSes.

P.S. the above mentioned commit (v5.15.6 commit 724ee060) doesn't help, we still saw this issue with kernel 5.15.46 if iommu DMA translation mode was enabled.

Revision history for this message
Michael (mes2k) wrote :

Thanks a lot for Comment 11.

I also think that the mentioned patch does not solve the data corruption issue. From the comments I read here and in the mentioned links the patch can fix issues like
DMAR: ERROR: DMA PTE for vPFN xxxxxx already set (to yyyyyy not yyyyyy)

But there seems to be another issue where the log message looks like
DMAR: ERROR: DMA PTE for vPFN xxxxxx already set (to yyyyyy not zzzzzz)

@jbastos5 - does this match your observations?

I am running a MicroServer Gen8 (Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz), having four WD RED drives attached to the built-in controller. They are forming a md_raid5. Since I upgraded to Ubuntu 22.04 I saw crashes of syncthing due to a corrupt database. I have now also found files (e.g. Ubuntu ISO images) that became corrupt. I see no hint for a hardware failure (of course I have to trust smartctl here). The DMA error in my case always showed two different addresses (to yyyyyy not zzzzzz). The stack trace next to the error messages indicated iommu and raid (as far as I remember, unfortunately I haven't stored them). I have disabled iommu in the grub configuration and the errors and syncthing crashes are gone.

Since the issue causes data corruption, in my opinion it should be raised in priority.

Revision history for this message
Jorge Bastos (jbastos5) wrote (last edit ):

> But there seems to be another issue where the log message looks like
> DMAR: ERROR: DMA PTE for vPFN xxxxxx already set (to yyyyyy not zzzzzz)

> @jbastos5 - does this match your observations?

Yes it does, a few examples from different servers:

DMAR: ERROR: DMA PTE for vPFN 0xb5f80 already set (to b5f80003 not 14c3c5803)

DMAR: ERROR: DMA PTE for vPFN 0xb0780 already set (to b0780003 not 28dc74801)

DMAR: ERROR: DMA PTE for vPFN 0xfffc0 already set (to 106340003 not 17bcfa003)

We also confirmed this bug is still present with kernel 5.18.3

Jim Tim (jimtim1169)
information type: Public → Public Security
Jim Tim (jimtim1169)
information type: Public Security → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.