Comment 3 for bug 1819989

Revision history for this message
Jose Ricardo Ziviani (joserz) wrote :

SRU:

[Impact]

 * An important feature was developed for PowerPC upstream and backported to a custom version of Ubuntu Bionic 18.04.1. The feature is known as nvlink2[1] passthrough, it allows physical GPUs to be accessed from any QEMU/KVM virtual machine. The problem happens when clients want to use that feature within their virtual machines: They will need to use a custom version, not the standard Ubuntu Bionic for PowerPC that everyone knows where it's and how to install it.

We understand that it's a huge impact in the user experience, not only the extra-difficulty to find/install the correct version but users that misunderstand the need of a custom version will think that the feature is simply broken.

Due to the fact that the guest part (the code that will run in the virtual machine) is a way simpler than the host part we decided to send the patches as a SRU. Fixing the user-experience problem without impacting existing use-cases.

[1] https://wccftech.com/nvidia-volta-gv100-gpu-fast-pascal-gp100/

[Test Case]

 * In order to reproduce the issue, it's required a Power9 system with NVLink2 + NVidia GPU and the customized Ubuntu Bionic installed (kernel + qemu).

 * Then, create a virtual machine like:

Create a disk image:
$ qemu-img create sda.qcow2 -f qcow2 100G

Find the devices to be attached:
$ lspci | grep NVIDIA
...
0004:04:00.0 3D controller: NVIDIA Corporation GV100 [Tesla V100 SXM2] (rev a1)
...

Detach all devices (including devices that belong to the same IOMMU group) to be passed to the virtual machine (script detach.sh attached):
$ sudo ./detach.sh 0004:04:00.0

Run the virtual machine:
$ sudo qemu-system-ppc64 -nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000010,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline \
-nographic -vga none -enable-kvm \
-device nec-usb-xhci,id=nec-usb-xhci0 \
-m 16384M \
-chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
-mon chardev=SOCKET0,mode=control \
-smp 16,threads=4 \
-netdev "user,id=USER0,hostfwd=tcp::2222-:22" \
-device "virtio-net-pci,id=vnet0,mac=C0:41:49:4b:00:00,netdev=USER0" \
-drive file=sda.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
-device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
-device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
-device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
-global spapr-pci-host-bridge.pgsz=0x10011000 \
-global spapr-pci-vfio-host-bridge.pgsz=0x10011000 \
-cdrom ubuntu-18.04.1-server-ppc64el.iso \
-machine pseries

Install the system in the virtual machine and reboot. After booting in the installed virtual machine, download and install the drivers from cuda-repo-ubuntu1804-10-1-local-10.1.91-418.29_1.0-1_ppc64el.deb (NVidia website).

With all nvidia drivers installed, check the result of the following commands:

$ nvidia-smi
on Nov 5 21:11:33 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000001:00:00.0 Off | 0 |
| N/A 32C P0 51W / 300W | 0MiB / 32480MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
...
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

# numactl -H
available: 2 nodes (0,255)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 16332 MB
node 0 free: 13781 MB
node 255 cpus:
node 255 size: 16128 MB
node 255 free: 15190 MB
node distances:
node 0 255
  0: 10 40
 255: 40 10

[Fix]
 * The patchset can be found here: https://lists.ubuntu.com/archives/kernel-team/2019-March/099243.html

There are 12 patches but the most important pieces are:
 - 8/12: https://lists.ubuntu.com/archives/kernel-team/2019-March/099250.html
 - 9/12: https://lists.ubuntu.com/archives/kernel-team/2019-March/099254.html

NPU code already exists, so much of the work consists in update the code to NPU2 and to add the GPU memory in the numa layout.

[Regression Potential]

 * Potential points of problems are physical devices passed through. The code in this patchset is isolated, mostly about NPU. So regressions are more likely to happen on the host side. Anyway, there two use cases that are very important to mention:

    - CPU hotplug
        * Based on the command line above, use the following smp argument:
          -smp 8,sockets=1,cores=2,threads=8,maxcpus=16

          After boot, check if nvidia -smi and numactl -H looks good.
          Go to QEMU monitor and type:
          (qemu) device_add host-spapr-cpu-core,id=core8,core-id=8
          Return back to the virtual machine console and check if your new core is plugged in, also check
          the numactl -H. Then, back on QEMU monitor, remove that core.
          (qemu) device_del 8
          On the virtual machine console, check if that core is removed and if numactl -H is ok.

    - Memory hotplug
      * Considering that same command line, include the maxmem argument.
        -m size=16384M,slots=256,maxmem=32768M

        After boot, check if nvidia -smi and numactl -H looks good.
        Go to QEMU monitor and type:
        (qemu) object_add memory-backend-ram,id=mem1,size=1G
        (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
        Return back to the virtual machine console and check if your new dimm is plugged in, also check
        the numactl -H. Then, back on QEMU monitor, remove that memory.
        (qemu) device_del dimm1
        On the virtual machine console, check if that memory is removed and if numactl -H is ok.