Prevent soft lockups during IOMMU streaming DMA mapping by limiting nvme max_hw_sectors_kb to cache optimised size
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
In Progress
|
Medium
|
Matthew Ruffell |
Bug Description
BugLink: https:/
[Impact]
On systems with IOMMU enabled, every streaming DMA mapping involves an IOVA to be allocated and freed. For small mappings, IOVA sizes are normally cached, so IOVA allocations complete in a reasonable time. For larger mappings, things can be significantly slower to the point where softlockups occur due to lock contention on iova_rbtree_lock.
commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova allocation")
introduced a scalable IOVA cache mechanism, that helps increase performance up to 128kb mappings.
On systems that do larger streaming DMA mappings, e.g. a NVMe device with:
/sys/block/
2048
The 2048kb mapping takes significantly longer, causing lock contention on the iova_rbtree_lock, as other resources such as ethernet NICs are also trying to acquire the lock.
We hit the following soft lockup:
watchdog: BUG: soft lockup - CPU#60 stuck for 24s!
CPU: 60 PID: 608304 Comm: segment-merger- Tainted: P W EL 5.15.0-76-generic #83~20.04.1-Ubuntu
RIP: 0010:_raw_
Call Trace:
<IRQ>
fq_flush_
? fq_ring_
call_timer_
run_timer_
? lapic_next_
? clockevents_
__do_softirq+
irq_exit_
sysvec_
</IRQ>
<TASK>
asm_sysvec_
RIP: 0010:_raw_
...
alloc_
alloc_
iommu_
? __kmalloc+
iommu_
__dma_
dma_map_
nvme_map_
? recalibrate_
? ktime_get+0x46/0xc0
nvme_queue_
? __update_
__blk_
blk_mq_
blk_mq_
blk_mq_
blk_mq_
blk_flush_
blk_mq_
__submit_
? ext4_inode_
submit_
? xa_load+0x61/0xa0
submit_
ext4_mpage_
? __mod_lruvec_
ext4_readahead
read_pages+
page_cache_
do_page_
ondemand_
page_cache_
filemap_
? filemap_
filemap_
generic_
ext4_file_
new_sync_
vfs_read+
ksys_pread64+
__x64_
unload_
do_syscall_
? do_syscall_
? do_syscall_
entry_
A workaround is to disable IOMMU with "iommu=off amd_iommu=off" on the kernel command line.
[Fix]
The fix is to clamp max_hw_sectors to the optimised IOVA size that still fits in the cache, so allocation and freeing of the IOVA is faster, during streaming DMA mapping.
The fix requires two dependency commits, which introduces a function to find the optimal value, dma_opt_
commit a229cc14f339531
Author: John Garry <email address hidden>
Date: Thu Jul 14 19:15:24 2022 +0800
Subject: dma-mapping: add dma_opt_
Link: https:/
commit 6d9870b7e5def24
Author: John Garry <email address hidden>
Date: Thu Jul 14 19:15:25 2022 +0800
Subject: dma-iommu: add iommu_dma_
Link: https:/
The dependencies are present in 6.0-rc1 and later.
The fix itself simply changes max_hw_sectors from dma_max_
commit 3710e2b056cb92a
Author: Adrian Huang <email address hidden>
Date: Fri Apr 21 16:08:00 2023 +0800
Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
Link: https:/
The fix is present in 6.4-rc3 and later.
[Testcase]
The system needs to be extremely busy. So busy infact, that we cannot reproduce it in lab environments, and only in production.
The systems that hit this issue have 64 cores, ~90%+ sustained CPU usage, ~90%+ sustained memory usage, high disk I/O, and nearly saturated network throughput with 100Gb NICs.
The NVMe disk MUST have /sys/block/
128kb, in this case, 2048kb.
Leave the system at sustained load until IOVA allocations slow to a halt and soft or hardlockups occur, waiting for iova_rbtree_lock.
A test kernel is available in the following ppa:
https:/
If you install the kernel and leave it running, the soft lockups will no longer occur.
[Where problems could occur]
We are changing the value of max_hw_sectors_kb for NVMe devices, for systems with IOMMU enabled. For those without IOMMU or IOMMU disabled, it will remain the same as it is now.
The value is the minimum between the maximum supported by hardware, and the largest that fits into cache. For some workloads, this might have a small impact on performance, due to the need to split up larger IOVA allocations into multiple smaller ones, but there should be a larger net gain due to IOVA allocations now fitting into the cache, and completing much faster than a single large one.
If a regression were to occur, users could disable the IOMMU as a workaround.
Changed in linux (Ubuntu): | |
status: | New → Fix Released |
Changed in linux (Ubuntu Jammy): | |
status: | New → In Progress |
importance: | Undecided → Medium |
assignee: | nobody → Matthew Ruffell (mruffell) |
tags: | added: jammy sts |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
Hey folks,
I think we may have encountered this or a variant of this while running extremely strenuous Ceph performance tests on a very high speed cluster we designed for a customer. We have a write-up that includes a section on needing to disable iommu here:
https:/ /ceph.io/ en/news/ blog/2024/ ceph-a- journey- to-1tibps/
Good job figuring this one out to everyone involved!