Ubuntu
linux package

Prevent soft lockups during IOMMU streaming DMA mapping by limiting nvme max_hw_sectors_kb to cache optimised size

Bug #2064999 reported by Matthew Ruffell on 2024-05-07

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	Undecided	Unassigned
	Jammy	In Progress	Medium	Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/2064999

[Impact]

On systems with IOMMU enabled, every streaming DMA mapping involves an IOVA to be allocated and freed. For small mappings, IOVA sizes are normally cached, so IOVA allocations complete in a reasonable time. For larger mappings, things can be significantly slower to the point where softlockups occur due to lock contention on iova_rbtree_lock.

commit 9257b4a206fc ("iommu/iova: introduce per-cpu caching to iova allocation")
introduced a scalable IOVA cache mechanism, that helps increase performance up to 128kb mappings.

On systems that do larger streaming DMA mappings, e.g. a NVMe device with:

/sys/block/nvme0n1/queue/max_hw_sectors_kb
2048

The 2048kb mapping takes significantly longer, causing lock contention on the iova_rbtree_lock, as other resources such as ethernet NICs are also trying to acquire the lock.

We hit the following soft lockup:

watchdog: BUG: soft lockup - CPU#60 stuck for 24s!
CPU: 60 PID: 608304 Comm: segment-merger- Tainted: P W EL 5.15.0-76-generic #83~20.04.1-Ubuntu
RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
Call Trace:
<IRQ>
fq_flush_timeout+0x82/0xc0
? fq_ring_free+0x170/0x170
call_timer_fn+0x2e/0x120
run_timer_softirq+0x433/0x4c0
? lapic_next_event+0x21/0x30
? clockevents_program_event+0xab/0x130
__do_softirq+0xdd/0x2ee
irq_exit_rcu+0x7d/0xa0
sysvec_apic_timer_interrupt+0x80/0x90
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
...
alloc_iova+0x1d8/0x1f0
alloc_iova_fast+0x5c/0x3a0
iommu_dma_alloc_iova.isra.0+0x128/0x170
? __kmalloc+0x1ab/0x4b0
iommu_dma_map_sg+0x1a4/0x4c0
__dma_map_sg_attrs+0x72/0x80
dma_map_sg_attrs+0xe/0x20
nvme_map_data+0xde/0x800 [nvme]
? recalibrate_cpu_khz+0x10/0x10
? ktime_get+0x46/0xc0
nvme_queue_rq+0xaf/0x1f0 [nvme]
? __update_load_avg_se+0x2a2/0x2c0
__blk_mq_try_issue_directly+0x15b/0x200
blk_mq_request_issue_directly+0x51/0xa0
blk_mq_try_issue_list_directly+0x7f/0xf0
blk_mq_sched_insert_requests+0xa4/0xf0
blk_mq_flush_plug_list+0x103/0x1c0
blk_flush_plug_list+0xe3/0x110
blk_mq_submit_bio+0x29d/0x600
__submit_bio+0x1e5/0x220
? ext4_inode_block_valid+0x9f/0xc0
submit_bio_noacct+0xac/0x2c0
? xa_load+0x61/0xa0
submit_bio+0x50/0x140
ext4_mpage_readpages+0x6a2/0xe20
? __mod_lruvec_page_state+0x6b/0xb0
ext4_readahead+0x37/0x40
read_pages+0x95/0x280
page_cache_ra_unbounded+0x161/0x220
do_page_cache_ra+0x3d/0x50
ondemand_readahead+0x137/0x330
page_cache_async_ra+0xa6/0xd0
filemap_get_pages+0x224/0x660
? filemap_get_pages+0x9e/0x660
filemap_read+0xbe/0x410
generic_file_read_iter+0xe5/0x150
ext4_file_read_iter+0x5b/0x190
new_sync_read+0x110/0x1a0
vfs_read+0x102/0x1a0
ksys_pread64+0x71/0xa0
__x64_sys_pread64+0x1e/0x30
unload_network_ops_symbols+0xc4de/0xf750 [falcon_lsm_pinned_15907]
do_syscall_64+0x5c/0xc0
? do_syscall_64+0x69/0xc0
? do_syscall_64+0x69/0xc0
entry_SYSCALL_64_after_hwframe+0x61/0xcb

A workaround is to disable IOMMU with "iommu=off amd_iommu=off" on the kernel command line.

[Fix]

The fix is to clamp max_hw_sectors to the optimised IOVA size that still fits in the cache, so allocation and freeing of the IOVA is faster, during streaming DMA mapping.

The fix requires two dependency commits, which introduces a function to find the optimal value, dma_opt_mapping_size().

commit a229cc14f3395311b899e5e582b71efa8dd01df0
Author: John Garry <email address hidden>
Date: Thu Jul 14 19:15:24 2022 +0800
Subject: dma-mapping: add dma_opt_mapping_size()
Link: https://github.com/torvalds/linux/commit/a229cc14f3395311b899e5e582b71efa8dd01df0

commit 6d9870b7e5def2450e21316515b9efc0529204dd
Author: John Garry <email address hidden>
Date: Thu Jul 14 19:15:25 2022 +0800
Subject: dma-iommu: add iommu_dma_opt_mapping_size()
Link: https://github.com/torvalds/linux/commit/6d9870b7e5def2450e21316515b9efc0529204dd

The dependencies are present in 6.0-rc1 and later.

The fix itself simply changes max_hw_sectors from dma_max_mapping_size() to dma_opt_mapping_size(). The fix needs a backport, as setting dev->ctrl.max_hw_sectors moved from nvme_reset_work() in 5.15 to nvme_pci_alloc_dev() in later releases.

commit 3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
Author: Adrian Huang <email address hidden>
Date: Fri Apr 21 16:08:00 2023 +0800
Subject: nvme-pci: clamp max_hw_sectors based on DMA optimized limitation
Link: https://github.com/torvalds/linux/commit/3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd

The fix is present in 6.4-rc3 and later.

[Testcase]

The system needs to be extremely busy. So busy infact, that we cannot reproduce it in lab environments, and only in production.

The systems that hit this issue have 64 cores, ~90%+ sustained CPU usage, ~90%+ sustained memory usage, high disk I/O, and nearly saturated network throughput with 100Gb NICs.

The NVMe disk MUST have /sys/block/nvme0n1/queue/max_hw_sectors_kb greater than
128kb, in this case, 2048kb.

Leave the system at sustained load until IOVA allocations slow to a halt and soft or hardlockups occur, waiting for iova_rbtree_lock.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf374805-test

If you install the kernel and leave it running, the soft lockups will no longer occur.

[Where problems could occur]

We are changing the value of max_hw_sectors_kb for NVMe devices, for systems with IOMMU enabled. For those without IOMMU or IOMMU disabled, it will remain the same as it is now.

The value is the minimum between the maximum supported by hardware, and the largest that fits into cache. For some workloads, this might have a small impact on performance, due to the need to split up larger IOVA allocations into multiple smaller ones, but there should be a larger net gain due to IOVA allocations now fitting into the cache, and completing much faster than a single large one.

If a regression were to occur, users could disable the IOMMU as a workaround.

See original description

Tags:

Matthew Ruffell (mruffell) on 2024-05-07

Changed in linux (Ubuntu):
status:	New → Fix Released
Changed in linux (Ubuntu Jammy):
status:	New → In Progress
importance:	Undecided → Medium
assignee:	nobody → Matthew Ruffell (mruffell)
tags:	added: jammy sts
description:	updated
description:	updated
description:	updated
description:	updated

Revision history for this message

Mark Nelson (mark-a-nelson) wrote on 2024-05-09:

Hey folks,

I think we may have encountered this or a variant of this while running extremely strenuous Ceph performance tests on a very high speed cluster we designed for a customer. We have a write-up that includes a section on needing to disable iommu here:

https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

Good job figuring this one out to everyone involved!

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Prevent soft lockups during IOMMU streaming DMA mapping by limiting nvme max_hw_sectors_kb to cache optimised size

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
linux package