[UBUNTU 24.04] IOMMU DMA mode changed in kernel config causes massive throughput degradation for PCI-related network workloads

Bug #2071471 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Committed
High
Skipper Bug Screeners
linux (Ubuntu)
Status tracked in Oracular
Noble
Fix Committed
Medium
Unassigned
Oracular
Fix Committed
Medium
Canonical Kernel Team

Bug Description

SRU Justification:

[Impact]

 * With the introduction of c76c067e488c "s390/pci: Use dma-iommu layer"
   (upstream with since kernel v6.7-rc1) there was a move (on s390x only)
   to a different dma-iommu implementation.

 * And with 92bce97f0c34 "s390/pci: Fix reset of IOMMU software counters"
   (again upstream since 6.7(rc-1) the IOMMU_DEFAULT_DMA_LAZY kernel config
   option should now be set to 'yes' by default for s390x.

 * Since CONFIG_IOMMU_DEFAULT_DMA_STRICT and IOMMU_DEFAULT_DMA_LAZY
   are related to each other CONFIG_IOMMU_DEFAULT_DMA_STRICT needs to be
   set to "no" by default, which was upstream done by b2b97a62f055
   "Revert "s390: update defconfigs"".

 * These changes are all upstream, but were not picked up by the Ubuntu
   kernel config.

 * And not having these config options set properly is causing significant
   PCI-related network throughput degradation (up to -72%).

 * This shows for almost all workloads and numbers of connections,
   deteriorating with the number of connections increasing.

 * Especially drastic is the drop for a high number of parallel connections
   (50 and 250) and for small and medium-size transactional workloads.
   However, also for streaming-type workloads the degradation is clearly
   visible (up to 48% degradation).

[Fix]

 * The (upstream accepted) fix is to set
   IOMMU_DEFAULT_DMA_STRICT=no
   and
   IOMMU_DEFAULT_DMA_LAZY=y
   (which is needed for the changed DAM IOMMU implementation since v6.7).

[Test Case]

 * Setup two Ubuntu Server 24.04 systems (with kernel 6.8)
   (one acting as server and as client)
   that have (PCIe attached) RoCE Express devices attached
   and that are connected to each other.

 * Verify if the the iommu_group type of the used PCI device is DMA-FQ:
   cat /sys/bus/pci/devices/<device>\:00\:00.0/iommu_group/type
   DMA-FQ

 * Sample workload rr1c-200x1000-250 with rr1c-200x1000-250.xml:
   <?xml version="1.0"?>
   <profile name="TCP_RR">
           <group nprocs="250">
                   <transaction iterations="1">
                           <flowop type="connect" options="remotehost=<remote IP> protocol=tcp tcp_nodelay" />
                   </transaction>
                   <transaction duration="300">
                           <flowop type="write" options="size=200"/>
                           <flowop type="read" options="size=1000"/>
                   </transaction>
                   <transaction iterations="1">
                           <flowop type="disconnect" />
                   </transaction>
           </group>
   </profile>

 * Install uperf on both systems, client and server.

 * Start uperf at server: uperf -s

 * Start uperf at client: uperf -vai 5 -m uperf-profile.xml

 * Switch from strict to lazy mode
   either using the new kernel (or the test build below)
   or using kernel cmd-line parameter iommu.strict=0.

 * Restart uperf on server and client, like before.

 * Verification will be performed by IBM.

[Regression Potential]

 * The is a certain regression potential, since the behavior with
   the two modified kernel config options will change significantly.

 * This may solve the (network) throughput issue with PCI devices,
   but may also come with side-effects on other PCIe based devices
   (the old compression adapters or the new NVMe carrier cards).

[Other]

 * CCW devices are not affected.

 * This is s390x-specific only, hence will not affect any other architecture.

__________

Symptom:
Comparing Ubuntu 24.04 (kernelversion: 6.8.0-31-generic) against Ubuntu 22.04, all of our PCI-related network measurements on LPAR show massive throughput degradations (up to -72%). This shows for almost all workloads and numbers of connections, detereorating with the number of connections increasing. Especially drastic is the drop for a high number of parallel connections (50 and 250) and for small and medium-size transactional workloads. However, also for streaming-type workloads the degradation is clearly visible (up to 48% degradation).

Problem:
With kernel config setting CONFIG_IOMMU_DEFAULT_DMA_STRICT=y, IOMMU DMA mode changed from lazy to strict, causing these massive degradations.
Behavior can also be changed with a kernel commandline parameter (iommu.strict) for easy verification.

The issue is known and was quickly fixed upstream in December 2023, after being present for little less than two weeks.
Upstream fix: https://github.com/torvalds/linux/commit/b2b97a62f055dd638f7f02087331a8380d8f139a

Repro:
rr1c-200x1000-250 with rr1c-200x1000-250.xml:

<?xml version="1.0"?>
<profile name="TCP_RR">
        <group nprocs="250">
                <transaction iterations="1">
                        <flowop type="connect" options="remotehost=<remote IP> protocol=tcp tcp_nodelay" />
                </transaction>
                <transaction duration="300">
                        <flowop type="write" options="size=200"/>
                        <flowop type="read" options="size=1000"/>
                </transaction>
                <transaction iterations="1">
                        <flowop type="disconnect" />
                </transaction>
        </group>
</profile>

0) Install uperf on both systems, client and server.
1) Start uperf at server: uperf -s
2) Start uperf at client: uperf -vai 5 -m uperf-profile.xml

3) Switch from strict to lazy mode using kernel commandline parameter iommu.strict=0.
4) Repeat steps 1) and 2).

Example:
For the following example, we chose the workload named above (rr1c-200x1000-250):

iommu.strict=1 (strict): 233464.914 TPS
iommu.strict=0 (lazy): 835123.193 TPS

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-207082 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Frank Heimes (fheimes) wrote :

I just had a look at the Ubuntu kernel noble master-next tree and can find commit:
"Revert "s390: update defconfigs"" under the hash b2b97a62f055
and I can see that it got reverted with that:
$ git show b2b97a62f055 | grep CONFIG_IOMMU_DEFAULT_DMA_STRICT
    CONFIG_IOMMU_DEFAULT_DMA_STRICT option needs to be disabled.
-CONFIG_IOMMU_DEFAULT_DMA_STRICT=y
-CONFIG_IOMMU_DEFAULT_DMA_STRICT=y

But git also tells me that it is in since kernel v6.8 and with that since the first ubuntu 6.8 kernel we had: Ubuntu-6.8.0-6 -- so should also be in Ubuntu-6.8.0-31.
But it does not seem to be reflected in the kernel options of the Ubuntu kernel:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04 LTS
Release: 24.04
Codename: noble
$ uname -a
Linux hwe0008 6.8.0-31-generic #31-Ubuntu SMP Sat Apr 20 00:14:26 UTC 2024 s390x s390x s390x GNU/Linux
$ grep CONFIG_IOMMU_DEFAULT_DMA_STRICT /boot/config-6.8.0-31-generic
CONFIG_IOMMU_DEFAULT_DMA_STRICT=y
also not in the current, updated kernel:
Linux hwe0008 6.8.0-36-generic #36-Ubuntu SMP Mon Jun 10 09:59:13 UTC 2024 s390x s390x s390x GNU/Linux
$ grep CONFIG_IOMMU_DEFAULT_DMA_STRICT /boot/config-6.8.0-31-generic
CONFIG_IOMMU_DEFAULT_DMA_STRICT=y
For some reason the change in the upstream commit was not taken over into the Ubuntu kernel configs ...

Changed in ubuntu-z-systems:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Changed in linux (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → nobody
Changed in ubuntu-z-systems:
importance: Undecided → High
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Upstream commit b2b97a62f055dd638f7f02087331a8380d8f139a changes the s390x defconfig's, which don't impact the Ubuntu kernel configs.

Looking at the configs for the 22.04 generic kernel, 'CONFIG_IOMMU_DEFAULT_DMA_STRICT=y' seems to be set for s390x on all Ubuntu 5.15 generic kernel builds as well. Therefore it doesn't seem the config has regressed, and although changing the kernel parameter during the boot time improves the performance it is likely not the only aspect causing the degradation.

Can IBM please confirm which 22.04 kernel version doesn't show the performance degradation and what are the config values for 'CONFIG_IOMMU_DEFAULT_DMA_STRICT' and 'CONFIG_IOMMU_DEFAULT_DMA_LAZY' on the running kernel?

Before changing the kernel config I would like to gather some more information to assess the consequences of the change and whether we need to make other changes as well.

Thank you.

Revision history for this message
Frank Heimes (fheimes) wrote :

Btw. I just checked the jammy kernel config options and jammy/22.04 had and still has strict=yes:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
$ uname -a
Linux maasrrc1 5.15.0-112-generic #122-Ubuntu SMP Thu May 23 08:10:47 UTC 2024 s390x s390x s390x GNU/Linux
ubuntu@maasrrc1:~$ grep CONFIG_IOMMU_DEFAULT_DMA_STRICT /boot/config-5.15.0-112-generic
CONFIG_IOMMU_DEFAULT_DMA_STRICT=y
$ grep CONFIG_IOMMU_DEFAULT_DMA_STRICT /boot/config-5.15.0-112-generic
I also checked older jammy kernels and strict was always yes.

So this has obviously never changed from jammy to noble!

Have you modified/tweaked the setting manually on your side before testing on jammy?!

Revision history for this message
Frank Heimes (fheimes) wrote :

<Looks like Kleber and me commented in parallel ...>

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2024-07-01 04:49 EDT-------
For reference the the CONFIG_DMA_DEFAULT_DMA setting is relevant to s390 only
since upstream commit c76c067e488c ("s390/pci: Use dma-iommu layer").

This commit also includes the following hunk to default to CONFIG_DEFAULT_DMA_LAZY on s390x. Though I guess this wouldn't overwrite a pre-existing config value:

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index cd6727898b11..3199fd54b462 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -91,7 +91,7 @@ config IOMMU_DEBUGFS
choice
prompt "IOMMU default domain type"
depends on IOMMU_API
- default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64
+ default IOMMU_DEFAULT_DMA_LAZY if X86 || IA64 || S390
default IOMMU_DEFAULT_DMA_STRICT
help
Choose the type of IOMMU domain used to manage DMA API usage by

The line has been modified since to drop IA64 but does contain S390 on current
upstream as well.

Revision history for this message
Frank Heimes (fheimes) wrote :

Hi Niklas, thanks for the backgrounds.

So to sum it up:

The Ubuntu kernel config options for IOMMU_DEFAULT_DMA_LAZY and IOMMU_DEFAULT_DMA_STRICT haven't changed since jammy/22.04:

5.15.0-1.1 to 5.15.0-115.125
CONFIG_IOMMU_DEFAULT_DMA_STRICT=y
# CONFIG_IOMMU_DEFAULT_DMA_LAZY is not set

6.8.0-6.6 to 6.8.0-38.38
CONFIG_IOMMU_DEFAULT_DMA_STRICT=y
# CONFIG_IOMMU_DEFAULT_DMA_LAZY is not set

But with the introduction of
c76c067e488c s390/pci: Use dma-iommu layer
which moves to a different dma-iommu implementation and
92bce97f0c34 s390/pci: Fix reset of IOMMU software counters
(both available since 6.7(rc-1)
the IOMMU_DEFAULT_DMA_LAZY kernel config option should be set to 'yes' by default for s390x.

Does CONFIG_IOMMU_DEFAULT_DMA_STRICT need to be set to No on top? I don't think so - if lazy is 'yes'.

So I believe we can go with 'IOMMU_DEFAULT_DMA_LAZY=y' for s390x only and should be good again.

I assume that such situations happen not very often,
so it would be ideal if IBM could give us a quick heads-up in such cases (where a kernel config default value for s390x is changed upstream), so that we can double check and potentially take over changes into the Ubuntu config by hand.

Revision history for this message
Frank Heimes (fheimes) wrote :

Looks like it's not sufficient to have IOMMU_DEFAULT_DMA_LAZY=y only
in addition IOMMU_DEFAULT_DMA_STRICT=n seems to be needed on top (otherwise the build failed for me).
(Which I think is not super great upstream, one option should be fine -- with the two kernel options, there are now two unspecific cases: both n and both y - anyway ...)

I did a test build in PPA here:
https://launchpad.net/~fheimes/+archive/ubuntu/lp2071471
would be great if this can be tried.

Frank Heimes (fheimes)
description: updated
Changed in ubuntu-z-systems:
status: New → In Progress
Changed in linux (Ubuntu):
status: New → In Progress
Frank Heimes (fheimes)
description: updated
Revision history for this message
Frank Heimes (fheimes) wrote :

Modification was sent to kernel teams mailing list:
https://lists.ubuntu.com/archives/kernel-team/2024-July/thread.html#151928

Changed in linux (Ubuntu):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-07-03 05:24 EDT-------
Interesting with "make defconfig" on upstream I get

?
# CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set
CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
?

And that works well.

Revision history for this message
Frank Heimes (fheimes) wrote :

Well, 'make defconfig' is what I cannot use, since we have the Ubuntu specific kernel options (and tools).
But since "not set" defaults to no, I guess it should be okay to explicitly set CONFIG_IOMMU_DEFAULT_DMA_STRICT=n.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-07-03 06:53 EDT-------
(In reply to comment #11)
> Looks like it's not sufficient to have IOMMU_DEFAULT_DMA_LAZY=y only
> in addition IOMMU_DEFAULT_DMA_STRICT=n seems to be needed on top (otherwise
> the build failed for me).
> (Which I think is not super great upstream, one option should be fine --
> with the two kernel options, there are now two unspecific cases: both n and
> both y - anyway ...)
>
> I did a test build in PPA here:
> https://launchpad.net/~fheimes/+archive/ubuntu/lp2071471
> would be great if this can be tried.

I gave this a quick test on a z/VM with a RoCE VF.

root@redacted:~# uname -a
Linux redacted 6.8.0-38-generic #38~lp2071471-Ubuntu SMP Wed Jul 3 07:09:08 UTC 2024 s390x s390x s390x GNU/Linux
root@redacted:~# cat /sys/bus/pci/devices/18d1\:00\:00.0/iommu_group/type
DMA-FQ

So this uses the DMA Flush Queue mechanism as it should. Thanks!

Revision history for this message
Frank Heimes (fheimes) wrote :

Many thanks for confirming, Niklas!

description: updated
Stefan Bader (smb)
Changed in linux (Ubuntu Noble):
importance: Undecided → Medium
status: New → Fix Committed
Changed in linux (Ubuntu Oracular):
importance: Undecided → Medium
status: In Progress → Fix Committed
Changed in linux (Ubuntu Noble):
status: Fix Committed → In Progress
Stefan Bader (smb)
Changed in linux (Ubuntu Noble):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.8.0-40.40 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux' to 'verification-done-noble-linux'. If the problem still exists, change the tag 'verification-needed-noble-linux' to 'verification-failed-noble-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-noble-linux-v2 verification-needed-noble-linux
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.