Add basic support to NVLink2 passthrough
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| The Ubuntu-power-systems project |
High
|
Canonical Kernel Team | ||
| linux (Ubuntu) |
Undecided
|
Unassigned | ||
| Bionic |
Undecided
|
Unassigned |
Bug Description
This bug exists to track the basic support to NVLink2 passthrough on Ubuntu 18.04 - for the guest side only. There's a relative small patchset that I'm going to send to Canonical Kernel Team using this buglink.
On the host side we'll be running a custom version of Ubuntu 18.04 (kernel + qemu). However on the guest side it will be *very important* for clients to simply download the Ubuntu 18.04 from Canonical's website and have the NVLink2 working out of the box.
For that, we have worked on a small patchset using only upstream patches without changing beyond our area.
As soon as I send the patchset to the mailing list I'll update this bug with a link to that message.
Thank you very much,
Jose R. Ziviani
Changed in linux (Ubuntu): | |
status: | New → Incomplete |
Jose Ricardo Ziviani (joserz) wrote : | #2 |
The patchset is in the mailing list for review:
https:/
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | Confirmed → Invalid |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in ubuntu-power-systems: | |
status: | New → Fix Committed |
importance: | Undecided → High |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
Jose Ricardo Ziviani (joserz) wrote : | #3 |
SRU:
[Impact]
* An important feature was developed for PowerPC upstream and backported to a custom version of Ubuntu Bionic 18.04.1. The feature is known as nvlink2[1] passthrough, it allows physical GPUs to be accessed from any QEMU/KVM virtual machine. The problem happens when clients want to use that feature within their virtual machines: They will need to use a custom version, not the standard Ubuntu Bionic for PowerPC that everyone knows where it's and how to install it.
We understand that it's a huge impact in the user experience, not only the extra-difficulty to find/install the correct version but users that misunderstand the need of a custom version will think that the feature is simply broken.
Due to the fact that the guest part (the code that will run in the virtual machine) is a way simpler than the host part we decided to send the patches as a SRU. Fixing the user-experience problem without impacting existing use-cases.
[1] https:/
[Test Case]
* In order to reproduce the issue, it's required a Power9 system with NVLink2 + NVidia GPU and the customized Ubuntu Bionic installed (kernel + qemu).
* Then, create a virtual machine like:
Create a disk image:
$ qemu-img create sda.qcow2 -f qcow2 100G
Find the devices to be attached:
$ lspci | grep NVIDIA
...
0004:04:00.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 SXM2] (rev a1)
...
Detach all devices (including devices that belong to the same IOMMU group) to be passed to the virtual machine (script detach.sh attached):
$ sudo ./detach.sh 0004:04:00.0
Run the virtual machine:
$ sudo qemu-system-ppc64 -nodefaults \
-chardev stdio,id=
-device spapr-vty,
-mon id=MON0,
-nographic -vga none -enable-kvm \
-device nec-usb-
-m 16384M \
-chardev socket,
-mon chardev=
-smp 16,threads=4 \
-netdev "user,id=
-device "virtio-
-drive file=sda.
-device virtio-
-device "vfio-pci,
-device "vfio-pci,
-device "vfio-pci,
-device "vfio-pci,
-global spapr-pci-
-global spapr-pci-
-cdrom ubuntu-
-machine pseries
Install the system in the virtual machine and reboot. After booting in the installed virtual machine, download and install the drivers from cuda-repo-
With all nvidia drivers installed, check the result of the following commands:
$ nvidia-smi
on Nov 5 21:11:33 2018
+------
| NVIDIA-SMI 410.72 Driver Version: 410.72 C...
Jose Ricardo Ziviani (joserz) wrote : | #4 |
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-
If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.
See https:/
tags: | added: verification-needed-bionic |
Jose Ricardo Ziviani (joserz) wrote : | #6 |
Hello, I tested the kernel with the changes and it works nice!
Thank you
root@ubuntu:~# numactl -H
available: 3 nodes (0,251-252)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 0 size: 392745 MB
node 0 free: 390074 MB
node 251 cpus:
node 251 size: 32256 MB
node 251 free: 32253 MB
node 252 cpus:
node 252 size: 32256 MB
node 252 free: 32252 MB
node distances:
node 0 251 252
0: 10 40 40
251: 40 10 40
252: 40 40 10
root@ubuntu:~# nvidia-smi
Fri Apr 12 13:53:42 2019
+------
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|------
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|======
| 0 Tesla V100-SXM2... On | 00000001:00:02.0 Off | 0 |
| N/A 34C P0 41W / 300W | 3MiB / 32256MiB | 0% Default |
+------
| 1 Tesla V100-SXM2... On | 00000001:00:08.0 Off | 0 |
| N/A 38C P0 43W / 300W | 3MiB / 32256MiB | 0% Default |
+------
+------
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|======
| No running processes found |
+------
tags: |
added: verification-done-bionic removed: verification-needed-bionic |
Launchpad Janitor (janitor) wrote : | #7 |
This bug was fixed in the package linux - 4.15.0-48.51
---------------
linux (4.15.0-48.51) bionic; urgency=medium
* linux: 4.15.0-48.51 -proposed tracker (LP: #1822820)
* Packaging resync (LP: #1786013)
- [Packaging] update helper scripts
- [Packaging] resync retpoline extraction
* 3b080b2564287be
triggers system hang on i386 (LP: #1812845)
- btrfs: raid56: properly unmap parity page in finish_
* [P9][LTCTest]
(LP: #1719545)
- cpupower : Fix header name to read idle state name
* [amdgpu] screen corruption when using touchpad (LP: #1818617)
- drm/amdgpu/gmc: steal the appropriate amount of vram for fw hand-over (v3)
- drm/amdgpu: Free VGA stolen memory as soon as possible.
* [SRU][B/
- ACPICA: AML parser: attempt to continue loading table after error
- ACPI / property: Allow multiple property compatible _DSD entries
- PCI / ACPI: Identify untrusted PCI devices
- iommu/vt-d: Force IOMMU on for platform opt in hint
- iommu/vt-d: Do not enable ATS for untrusted devices
- thunderbolt: Export IOMMU based DMA protection support to userspace
- iommu/vt-d: Disable ATS support on untrusted devices
* Add basic support to NVLink2 passthrough (LP: #1819989)
- powerpc/
enabled
- powerpc/powernv: call OPAL_QUIESCE before OPAL_SIGNAL_
- powerpc/powernv: Export opal_check_token symbol
- powerpc/powernv: Make possible for user to force a full ipl cec reboot
- powerpc/
- powerpc/powernv: Move npu struct from pnv_phb to pci_controller
- powerpc/
- powerpc/
- powerpc/
- powerpc/pseries: Remove IOMMU API support for non-LPAR systems
- powerpc/
- powerpc/
* Huawei Hi1822 NIC has poor performance (LP: #1820187)
- net-next: hinic: fix a problem in free_tx_poll()
- hinic: remove ndo_poll_controller
- net-next/hinic: add checksum offload and TSO support
- hinic: Fix l4_type parameter in hinic_task_
- net-next/
- net-next/hinic:add rx checksum offload for HiNIC
- net-next/hinic:fix a bug in set mac address
- net-next/hinic: fix a bug in rx data flow
- net: hinic: fix null pointer dereference on pointer hwdev
- hinic: optmize rx refill buffer mechanism
- net-next/hinic:add shutdown callback
- net-next/hinic: replace disable_
* [CONFIG] please enable highdpi font FONT_TER16x32 (LP: #1819881)
- Fonts: New Terminus large console font
- [Config]: enable highdpi Terminus 16x32 font support
* [19.04 FEAT] qeth: Enhanced link...
Changed in linux (Ubuntu Bionic): | |
status: | Fix Committed → Fix Released |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1819989
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.