Improve NVMe guest performance on Bionic QEMU
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| The Ubuntu-power-systems project |
Medium
|
Canonical Server Team | ||
| linux (Ubuntu) |
Undecided
|
Unassigned | ||
| Bionic |
Undecided
|
Unassigned | ||
| qemu (Ubuntu) |
Undecided
|
Ubuntu on IBM Power Systems Bug Triage | ||
| Bionic |
Undecided
|
Unassigned |
Bug Description
[Impact]
* In the past qemu has generally not allowd MSI-X BAR mapping on VFIO.
But there can be platforms (like ppc64 spapr) that can and want to do
exactly that.
* Backport two patches from upstream (in since qemu 2.12 / Disco).
* Due to that there is a tremendous speedup, especially useful with page
size bigger than 4k. This avoids that being split into chunks and makes
direct MMIO access possible for the guest.
[Test Case]
* On ppc64 pass through an NVME device to the guest and run I/O
benchmarks, see below for Details how to set that up.
Note: this needs the HWE kernel or another kernel fixup for [1].
Note: the test should also be done with the non-HWE kernel, the
expectation there is that it would not show the perf benefits, but
still work fine
[Regression Potential]
* Changes:
a) if the host driver allows mapping of MSI-X data the entire BAR is
mapped. This is only done if the kernel reports that capability [1].
This ensures that only on kernels able to do so qemu does expose the
new behavior (safe against regression in that regard)
b) on ppc64 MSI-X emulation is disabled for VFIO devices this is local
to just this HW and will not affect other HW.
Generally the regressions that come to mind are slight changes in
behavior (real HW vs the former emulation) that on some weird/old
guests could cause trouble. But then it is limited to only PPC where
only a small set of certified HW is really allowed.
The mapping that might be added even on other platforms should not
consume too much extra memory as long as it isn't used. Further since
it depends on the kernel capability it isn't randomly issues on kernels
where we expect it to fail.
So while it is quite a change, it seems safe to me.
[Other Info]
* I know, one could as well call that a "feature", but it really is a
performance bug fix more than anything else. Also the SRU policy allows
exploitation
Therefore I think this is fine as SRU.
== Comment: #0 - Murilo Opsfelder Araujo - 2019-10-11 14:16:14 ==
---Problem Description---
Back-port the following patches to Bionic QEMU to improve NVMe guest performance by more than 200%:
?vfio-pci: Allow mmap of MSIX BAR?
https:/
?ppc/spapr, vfio: Turn off MSIX emulation for VFIO devices?
https:/
---uname output---
na
---Additional Hardware Info---
0030:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172Xa/172Xb (rev 01)
Machine Type = AC922
---Debugger---
A debugger is not configured
---Steps to Reproduce---
Install or setup a guest image and boot it.
Once guest is running, passthrough the NVMe disk to the guest using the XML:
host$ cat nvme-disk.xml
<hostdev mode='subsystem' type='pci' managed='no'>
<driver name='vfio'/>
<source>
<address domain='0x0030' bus='0x01' slot='0x00' function='0x0'/>
</source>
</hostdev>
host$ virsh attach-device <domain> nvme-disk.xml --live
On the guest, run fio benchmarks:
guest$ fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --runtime=60 --name=job1 --filename=
Results are similar with numjobs=4 and numjobs=64, respectively:
READ: bw=385MiB/s (404MB/s), 78.0MiB/s-115MiB/s (81.8MB/s-120MB/s), io=11.3GiB (12.1GB), run=30001-30001msec
READ: bw=382MiB/s (400MB/s), 2684KiB/s-12.6MiB/s (2749kB/
With the two patches applied, performance improved significantly for numjobs=4 and numjobs=64 cases, respectively:
READ: bw=1191MiB/s (1249MB/s), 285MiB/s-309MiB/s (299MB/s-324MB/s), io=34.9GiB (37.5GB), run=30001-30001msec
READ: bw=4273MiB/s (4481MB/s), 49.7MiB/s-113MiB/s (52.1MB/s-119MB/s), io=125GiB (134GB), run=30001-30005msec
Userspace tool common name: qemu
Userspace rpm: qemu
The userspace tool has the following bit modes: 64-bit
Userspace tool obtained from project website: na
Related branches
- Rafael David Tinoco (community): Approve on 2019-10-25
- Canonical Server Team: Pending requested 2019-10-15
- Ubuntu Server Dev import team: Pending requested 2019-10-15
-
Diff: 486 lines (+440/-0)7 files modifieddebian/changelog (+10/-0)
debian/patches/lp-1847948-vfio-Use-a-trace-point-when-a-RAM-section-cannot-be-.patch (+81/-0)
debian/patches/lp-1847948-vfio-pci-Relax-DMA-map-errors-for-MMIO-regions.patch (+137/-0)
debian/patches/series (+5/-0)
debian/patches/ubuntu/lp-1842774-s390x-cpumodel-Add-the-z15-name-to-the-description-o.patch (+30/-0)
debian/patches/ubuntu/lp-1847948-ppc-spapr-vfio-Turn-off-MSIX-emulation-for-VFIO-devi.patch (+82/-0)
debian/patches/ubuntu/lp-1847948-vfio-pci-Allow-mmap-of-MSIX-BAR.patch (+95/-0)
tags: | added: architecture-ppc64le bugnameltc-181875 severity-medium targetmilestone-inin18045 |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → qemu (Ubuntu) |
Changed in ubuntu-power-systems: | |
status: | New → Triaged |
importance: | Undecided → Medium |
assignee: | nobody → Canonical Server Team (canonical-server) |
Christian Ehrhardt (paelzer) wrote : | #2 |
FYI kernel commit is [1] I haven't checked if any context is needed.
Christian Ehrhardt (paelzer) wrote : | #3 |
For Qemu the code is in 2.12, therefore as requested we only need to consider Bionic.
@Murilo: do you have a good testcase for this to verify the bug when it became an accepted SRU?
Could you outline what has to be done and if it requires special HW or a rather complex setup maybe even commit to do the tests then?
Changed in qemu (Ubuntu Bionic): | |
status: | New → Triaged |
Changed in linux (Ubuntu): | |
status: | New → Fix Released |
Changed in qemu (Ubuntu): | |
status: | New → Fix Released |
Christian,
The test case for this is to:
- passthrough the NVMe disk to guest;
- and run fio benchmarks against the passed-through NVMe disk in the guest.
I have detailed these steps and pasted the commands in the bug description. Please let me know if steps are not clear.
The only required HW is an NVMe disk. We (IBM) can verify and do the tests, if needed.
Thank you!
Murilo
Christian Ehrhardt (paelzer) wrote : | #5 |
I must have been blind, of course your testcase in the description is fine. Sorry for even asking.
I'll give it a try if it also works to reflect the improvement on some HW I can access, but if not thanks for offer to test it on your side.
Christian Ehrhardt (paelzer) wrote : | #6 |
FYI: Test builds are in [1]. It isn't required, but would be great if you could give it a check as well ahead of the actual SRU.
[1]: https:/
Christian Ehrhardt (paelzer) wrote : | #7 |
All systems I could easily get a hold of already had their NVME drives in use so I couldn't pass them through easily. Therefore I'd like to come back to your offer to check the test build.
Christian Ehrhardt (paelzer) wrote : | #8 |
@Murilo: I was wondering - I think we might want/need also [1] to not run into fatal false positives. What is your opinion on this?
[1]: https:/
description: | updated |
Christian Ehrhardt (paelzer) wrote : | #9 |
And on top of the former question most likely then also [1].
[1]: https:/
Christian,
Using QEMU from comment 6, it improved the performance for numjobs=4 and numjobs=64 cases, respectively:
READ: bw=1140MiB/s (1195MB/s), 271MiB/s-302MiB/s (284MB/s-317MB/s), io=66.8GiB (71.7GB), run=60001-60001msec
READ: bw=4271MiB/s (4479MB/s), 30.1MiB/s-82.6MiB/s (31.6MB/
$ dpkg -l | grep ppa1
ii qemu 1:2.11+
ii qemu-block-
ii qemu-kvm 1:2.11+
ii qemu-system 1:2.11+
ii qemu-system-arm 1:2.11+
ii qemu-system-common 1:2.11+
ii qemu-system-mips 1:2.11+
ii qemu-system-misc 1:2.11+
ii qemu-system-ppc 1:2.11+
ii qemu-system-s390x 1:2.11+
ii qemu-system-sparc 1:2.11+
ii qemu-system-x86 1:2.11+
ii qemu-user 1:2.11+
ii qemu-user-binfmt 1:2.11+
ii qemu-utils 1:2.11+
Note 0: We booted a newer kernel on the host that contained the patch
https:/
Using Bionic kernel 4.15.0-65.74, the performance hasn't changed.
Most likely because Bionic kernel doesn't contain the above patch yet.
Note 1: Kudos to Murilo Vicentini for running the tests and capturing
the numbers.
Christian,
Patch from comment 8 seems yet another nice improvement. We haven't tried it to have a strong opinion.
If it's included, then I think patch from comment 9 won't hurt being added too.
Murilo
Christian Ehrhardt (paelzer) wrote : | #12 |
Thanks for your feedback and tests Murilo*2 (There seems to be a team of Murilo's on this :-) - thanks for your efforts and fast responses bringing this case forward.)
With the main changes requested I really think we need the patch in #8 to avoid issues on other HW setups that due to the new code will end up having non-aligned sections in vfio_memory-
I have respun the build in the PPA [1] which now is at 2.11+dfsg-
[1]: https:/
------- Comment From <email address hidden> 2019-10-16 09:05 EDT-------
Using the 2.11+dfsg-
READ: bw=1155MiB/s (1211MB/s), 285MiB/s-295MiB/s (299MB/s-309MB/s), io=67.7GiB (72.7GB), run=60001-60001msec
READ: bw=4273MiB/s (4481MB/s), 42.7MiB/s-106MiB/s (44.7MB/s-111MB/s), io=250GiB (269GB), run=60001-60005msec
# dpkg -l | grep ppa2
ii qemu 1:2.11+
ii qemu-block-
ii qemu-kvm 1:2.11+
ii qemu-system 1:2.11+
ii qemu-system-arm 1:2.11+
ii qemu-system-common 1:2.11+
ii qemu-system-mips 1:2.11+
ii qemu-system-misc 1:2.11+
ii qemu-system-ppc 1:2.11+
ii qemu-system-s390x 1:2.11+
ii qemu-system-sparc 1:2.11+
ii qemu-system-x86 1:2.11+
ii qemu-user 1:2.11+
ii qemu-user-binfmt 1:2.11+
ii qemu-utils 1:2.11+
As an important note, this respin version was only tested with the newer custom kernel mentioned in the previous comment. Christian, let me know if you need this tested with the vanillla kernel 4.15.0-65.74 as well.
Christian Ehrhardt (paelzer) wrote : | #14 |
Thanks Murilo, I need no test with the basic kernel yet.
But later when we really SRU this it will be good to do both a old and a HWE kernel check.
Let me add that to the verification steps ...
description: | updated |
Christian Ehrhardt (paelzer) wrote : | #15 |
I opened up the MP for review again after pushing the extra patches that you tested.
Once that is done I can push it to the SRU teams queue ...
Frank Heimes (fheimes) wrote : | #16 |
Regarding the kernel part of this LP ticket I plan to set it to Won't Fix - if nobody objects.
And that's because we usually do not add performance improvements to stable kernels, but have the hardware enablement (HWE) kernel instead.
I checked disco's git tree and "vfio-pci: Allow mapping MSIX BAR" a32295c is already included.
Therefore one can already use the HWE kernel that is shipped with 18.04.3 (that is based on disco's kernel 5.0).
Agreed on Frank's last comment.
commit a32295c612c5799
Author: Alexey Kardashevskiy <email address hidden>
Date: Wed Dec 13 00:31:31 2017
vfio-pci: Allow mapping MSIX BAR
By default VFIO disables mapping of MSIX BAR to the userspace as
the userspace may program it in a way allowing spurious interrupts;
instead the userspace uses the VFIO_DEVICE_
In order to eliminate guessing from the userspace about what is
mmapable, VFIO also advertises a sparse list of regions allowed to mmap.
This works fine as long as the system page size equals to the MSIX
alignment requirement which is 4KB. However with a bigger page size
the existing code prohibits mapping non-MSIX parts of a page with MSIX
structures so these parts have to be emulated via slow reads/writes on
a VFIO device fd. If these emulated bits are accessed often, this has
serious impact on performance.
This allows mmap of the entire BAR containing MSIX vector table.
This removes the sparse capability for PCI devices as it becomes useless.
As the userspace needs to know for sure whether mmapping of the MSIX
vector containing data can succeed, this adds a new capability -
VFIO_
that the entire BAR can be mmapped.
This does not touch the MSIX mangling in the BAR read/write handlers as
we are doing this just to enable direct access to non MSIX registers.
Signed-off-by: Alexey Kardashevskiy <email address hidden>
[aw - fixup whitespace, trim function name]
Signed-off-by: Alex Williamson <email address hidden>
Seems a case for HWE kernels.
Changed in linux (Ubuntu Bionic): | |
status: | New → Won't Fix |
After reading this:
* I know, one could as well call that a "feature", but it really is a
performance bug fix more than anything else. Also the SRU policy allows
exploitation
Therefore I think this is fine as SRU.
I'm okay with that interpretation also. My doubt is about allowing userspace to map the MSI-X table, which is not currently allowed.
Actually ... it would be allowed and it would have an "undefined behavior" because of the sparse feature. So, perhaps you're right and it could be seen as a fix both ways: fixing undefined behavior and allowing the MSI-X table to be mmaped, which will allow better performance for QEMU.
After re-thinking, I'm +1 on the backport for Bionic Kernel also if others agree.
Christian Ehrhardt (paelzer) wrote : | #19 |
Uploaded for SRU Team review and acceptance into proposed.
Changed in ubuntu-power-systems: | |
status: | Triaged → Fix Released |
status: | Fix Released → In Progress |
Christian Ehrhardt (paelzer) wrote : | #20 |
This got surpassed by a security upload while waiting for SRU Team.
I have uploaded rebased versions to -unapproved.
Hello bugproxy, or anyone else affected,
Accepted qemu into bionic-proposed. The package will build now and be available at https:/
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-
Further information regarding the verification process can be found at https:/
N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.
Changed in qemu (Ubuntu Bionic): | |
status: | Triaged → Fix Committed |
tags: | added: verification-needed verification-needed-bionic |
Changed in ubuntu-power-systems: | |
status: | In Progress → Fix Committed |
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (qemu/1:2.11+dfsg-1ubuntu7.21) | #22 |
All autopkgtests for the newly accepted qemu (1:2.11+
The following regressions have been reported in tests triggered by the package:
vagrant-
systemd/
Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUp
https:/
[1] https:/
Thank you!
Christian Ehrhardt (paelzer) wrote : | #23 |
I found no obvious regressions with this build in a run of our usual set of tests.
@IBM - for SRu verification - can you give the actual testcase it was intended for a shot in your environment as you had that set up int he past?
------- Comment From <email address hidden> 2019-11-25 08:42 EDT-------
(In reply to comment #24)
> I found no obvious regressions with this build in a run of our usual set of
> tests.
>
> @IBM - for SRu verification - can you give the actual testcase it was
> intended for a shot in your environment as you had that set up int he past?
Yeah, sure, I don't own the setup to do so, but I am trying to schedule some time in it to run the testcase. I'll get back with the results as soon as possible.
bugproxy (bugproxy) wrote : | #25 |
------- Comment From <email address hidden> 2019-11-25 16:06 EDT-------
Apologies for the delay, just tested with the 1:2.11+
# fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --runtime=60 --name=job1 --filename=
READ: bw=1216MiB/s (1275MB/s), 300MiB/s-315MiB/s (314MB/s-330MB/s), io=71.3GiB (76.5GB), run=60001-60002msec
# fio --direct=1 --rw=randrw --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k --rwmixread=100 --iodepth=16 --runtime=60 --name=job1 --filename=
READ: bw=4081MiB/s (4280MB/s), 58.3MiB/s-69.8MiB/s (61.1MB/
The same note as before applies that I am using the patched custom kernel mentioned in previous comments.
$ dpkg -l | grep 1:2.11+
ii qemu 1:2.11+
ii qemu-block-
ii qemu-kvm 1:2.11+
ii qemu-system 1:2.11+
ii qemu-system-arm 1:2.11+
ii qemu-system-common 1:2.11+
ii qemu-system-mips 1:2.11+
ii qemu-system-misc 1:2.11+
ii qemu-system-ppc 1:2.11+
ii qemu-system-s390x 1:2.11+
ii qemu-system-sparc 1:2.11+
ii qemu-system-x86 1:2.11+
ii qemu-user 1:2.11+
ii qemu-user-binfmt 1:2.11+
ii qemu-utils 1:2.11+
Frank Heimes (fheimes) wrote : | #26 |
Thanks for the verification Murillo.
I updated the tags based on comment #25.
tags: |
added: verification-done verification-done-bionic removed: verification-needed verification-needed-bionic |
Launchpad Janitor (janitor) wrote : | #28 |
This bug was fixed in the package qemu - 1:2.11+
---------------
qemu (1:2.11+
* d/p/lp-
update the z15 model name (LP: #1842774)
* d/p/u/lp-1847948-*: allow MSIX BAR mapping on VFIO in general and use that
instead of emulation on ppc64 increasing performance of e.g. NVME
passthrough (LP: #1847948)
-- Christian Ehrhardt <email address hidden> Tue, 15 Oct 2019 11:23:23 +0200
Changed in qemu (Ubuntu Bionic): | |
status: | Fix Committed → Fix Released |
The verification of the Stable Release Update for qemu has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
The kernel part of this is in since 4.16 so it either needs a HWE kernel or to add a kernel Task to backport the related change for the kernel as well. I'll add a task but it depends the Kernelteam (doability) and the reporters request (for which kernel it is needed).