Ubuntu
libvirt package

Ubuntu 17.04 KVM: Can not do hotplug attach

Bug #1678322 reported by bugproxy on 2017-03-31

This bug report is a duplicate of: Bug #1679704: libvirt profile is blocking global setrlimit despite having no rlimit rule. Edit Remove

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	libvirt (Ubuntu)	Confirmed	Medium	Christian Ehrhardt 

Bug Description

---Problem Description---
I am trying to do hotplug attach with Mellanox CX3 card to a guest but I get failure.
virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml
error: Failed to attach device from ./add_cx3.xml
error: internal error: unable to execute QEMU command 'device_add': vfio error: 0044:01:00.0: failed to setup container for group 6: RAM memory listener initialization failed for container

from the log file from qemu I see this:
2017-02-14T22:55:40.721108Z qemu-system-ppc64: backend does not support BE vnet headers; falling back on use rspace virtio

This is with kernel 4.9.0-15-generic and qemu level:
dpkg --list| grep qemu
ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu2 all PXE boot firmware - ROM images for qemu
ii qemu 1:2.8+dfsg-2ubuntu1 ppc64el fast processor emulator
ii qemu-block-extra:ppc64el 1:2.8+dfsg-2ubuntu1 ppc64el extra block backend modules for qemu-system and qemu-utils
ii qemu-kvm 1:2.8+dfsg-2ubuntu1 ppc64el QEMU Full virtualization
ii qemu-slof 20161019+dfsg-1 all Slimline Open Firmware -- QEMU PowerPC version
ii qemu-system 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries
ii qemu-system-arm 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries (arm)
ii qemu-system-common 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries (common files)
ii qemu-system-mips 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries (mips)
ii qemu-system-misc 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries (miscellaneous)
ii qemu-system-ppc 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries (ppc)
ii qemu-system-sparc 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries (sparc)
ii qemu-system-x86 1:2.8+dfsg-2ubuntu1 ppc64el QEMU full system emulation binaries (x86)
ii qemu-user 1:2.8+dfsg-2ubuntu1 ppc64el QEMU user mode emulation binaries
ii qemu-user-binfmt 1:2.8+dfsg-2ubuntu1 ppc64el QEMU user mode binfmt registration for qemu-user
ii qemu-utils 1:2.8+dfsg-2ubuntu1 ppc64el QEMU utilities

---uname output---
4.9.0-15-generic #16-Ubuntu SMP Fri Jan 20 15:28:49 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = P8

---Steps to Reproduce---
bring up a guest and then try to attach device like this:
virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml --live
error: Failed to attach device from ./add_cx3.xml
error: internal error: unable to execute QEMU command 'device_add': vfio error: 0044:01:00.0: failed to setup container for group 6: RAM memory listener initialization failed for container

When I retried the steps for add_cx3.xml on the same machine I noticed the following in the host logs:

[ 1374.276210] KVM guest htab at c000001e56000000 (order 26), LPID 1
[ 1383.824281] hrtimer: interrupt took 923 ns
[ 1447.479194] audit_printk_skb: 15 callbacks suppressed
[ 1447.479198] audit: type=1400 audit(1487194729.006:17): apparmor="DENIED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd" rlimit=memlock value=8694792192
[ 1447.481927] pci 0044:01 : [PE# 002] Disabling 64-bit DMA bypass
[ 1447.481935] pci 0044:01 : [PE# 002] Removing DMA window #0
[ 1447.481978] pci 0044:01 : [PE# 002] Removing DMA window #0
[ 1447.481980] pci 0044:01 : [PE# 002] Removing DMA window #1
[ 1447.485667] pci 0044:01 : [PE# 002] Setting up window#0 0..7fffffff pg=1000
[ 1447.485670] pci 0044:01 : [PE# 002] Enabling 64-bit DMA bypass
[ 1517.030701] audit: type=1400 audit(1487194798.559:18): apparmor="DENIED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd" rlimit=memlock value=8694792192
[ 1517.033286] pci 0044:01 : [PE# 002] Disabling 64-bit DMA bypass
[ 1517.033290] pci 0044:01 : [PE# 002] Removing DMA window #0
[ 1517.033322] pci 0044:01 : [PE# 002] Removing DMA window #0
[ 1517.033325] pci 0044:01 : [PE# 002] Removing DMA window #1
[ 1517.036971] pci 0044:01 : [PE# 002] Setting up window#0 0..7fffffff pg=1000
[ 1517.036974] pci 0044:01 : [PE# 002] Enabling 64-bit DMA bypass

I'm not sure if the apparmor issues are affecting functionality or not. That may be worth looking into a separate bug, or a dupe of https://bugzilla.linux.ibm.com/show_bug.cgi?id=146192

As noted there I did the following to work around it:

sudo aa-complain /usr/sbin/libvirtd
sudo aa-complain /etc/apparmor.d/libvirt/libvirt-????????-????-????-????-????????????

I still got the VFIO memory listener error however. If I install QEMU 2.7.0 I no longer see the VFIO error and things seems to succeed from a host perspective:

root@powerio-le11:/etc/libvirt/qemu# virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml --live
Device attached successfully

root@powerio-le11:/etc/libvirt/qemu# dmesg | tail -6
[ 3880.813971] KVM guest htab at c000001e56000000 (order 26), LPID 1
[ 3917.656384] audit: type=1400 audit(1487197199.210:26): apparmor="ALLOWED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd" rlimit=memlock value=8694792192
[ 3917.659276] pci 0044:01 : [PE# 002] Disabling 64-bit DMA bypass
[ 3917.659284] pci 0044:01 : [PE# 002] Removing DMA window #0
[ 3917.688803] vfio-pci 0044:01:00.0: enabling device (0400 -> 0402)
[ 3917.800106] vfio_ecap_init: 0044:01:00.0 hiding ecap 0x19@0x18c

In the guest things look okay initially:

[ 28.797667] RTAS: event: 1, Type: Unknown, Severity: 1
[ 29.062821] pci 0000:00:05.0: [15b3:1007] type 00 class 0x020000
[ 29.063118] pci 0000:00:05.0: reg 0x10: [mem 0x100a0000000-0x100a00fffff 64bit]
[ 29.063341] pci 0000:00:05.0: reg 0x18: [mem 0x2c0200000000-0x2c0201ffffff 64bit pref]
[ 29.063701] pci 0000:00:05.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
[ 29.065237] iommu: Adding device 0000:00:05.0 to group 0
[ 29.065332] pci 0000:00:05.0: BAR 2: assigned [mem 0x10122000000-0x10123ffffff 64bit pref]
[ 29.065675] pci 0000:00:05.0: BAR 0: assigned [mem 0x10121800000-0x101218fffff 64bit]
[ 29.066010] pci 0000:00:05.0: BAR 6: assigned [mem 0x100a0000000-0x100a00fffff pref]
[ 29.066105] mlx4_core: Mellanox ConnectX core driver v4.0-1.0.1 (29 Jan 2017)
[ 29.066127] mlx4_core: Initializing 0000:00:05.0
[ 29.066210] mlx4_core 0000:00:05.0: enabling device (0000 -> 0002)
[ 29.076273] mlx4_core 0000:00:05.0: Using 64-bit direct DMA at offset 800000000000000

but eventually I see the following error:

[ 89.925954] mlx4_core 0000:00:05.0: device is going to be reset
[ 99.923755] mlx4_core 0000:00:05.0: Failed to obtain HW semaphore, aborting
[ 99.924052] mlx4_core 0000:00:05.0: Fail to reset HCA
[ 99.924305] kernel BUG at /var/lib/dkms/mlnx-ofed-kernel/4.0/build/drivers/net/ethernet/mellanox/mlx4/catas.c:193!
[ 99.924643] Oops: Exception in kernel mode, sig: 5 [#1]
[ 99.924811] SMP NR_CPUS=2048 [ 99.924889] NUMA
[ 99.924968] pSeries
[ 99.925048] Modules linked in: rdma_ucm(OE) ib_ucm(OE) ib_ipoib(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) devlink vmx_crypto ib_iser rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_core(OE) mlx_compat(OE) configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi knem(OE) ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi crc32c_vpmsum virtio_net virtio_blk
[ 99.927029] CPU: 10 PID: 4600 Comm: drmgr Tainted: G OE 4.9.0-12-generic #13-Ubuntu
[ 99.927316] task: c0000001dfc27e00 task.stack: c0000001dd630000
[ 99.927515] NIP: d000000003c62794 LR: d000000003c6277c CTR: c0000000006c4a80
[ 99.927752] REGS: c0000001dd6332a0 TRAP: 0700 Tainted: G OE (4.9.0-12-generic)
[ 99.928029] MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>[ 99.928645] CR: 48022222 XER: 20000000
[ 99.928764] CFAR: c000000000710c28 SOFTE: 1
GPR00: d000000003c6277c c0000001dd633520 d000000003cbca7c 0000000000000029
GPR04: 0000000000000001 000000000000014d 6573657420484341 0d0a6c20746f2072
GPR08: 0000000000000007 0000000000000001 c0000001e2a94300 0000000000000006
GPR12: 0000000000002200 c00000000fb85a00 c0000001dd9a0060 0000000000000000
GPR16: 00000000024080c0 00000000024202c2 0000000000000000 d000000003cb7418
GPR20: 0000000000000000 d0000800802c0680 c0000001dd9a04e8 c0000001dd9a0518
GPR24: 000000000000ea60 0000000000000000 c000000001443a00 c0000001dd6336f0
GPR28: 0000000000000000 0000000000000004 c0000001e2a94360 c0000001dd9a0060
NIP [d000000003c62794] mlx4_enter_error_state.part.0+0x35c/0x460 [mlx4_core]
[ 99.931952] LR [d000000003c6277c] mlx4_enter_error_state.part.0+0x344/0x460 [mlx4_core]
[ 99.932190] Call Trace:
[ 99.932278] [c0000001dd633520] [d000000003c6277c] mlx4_enter_error_state.part.0+0x344/0x460 [mlx4_core] (unreliable)
[ 99.932647] [c0000001dd6335b0] [d000000003c66df8] __mlx4_cmd+0x720/0x970 [mlx4_core]
[ 99.932946] [c0000001dd633680] [d000000003c73d88] mlx4_QUERY_FW+0x90/0x420 [mlx4_core]
[ 99.933238] [c0000001dd633730] [d000000003c7fd28] mlx4_load_one+0x440/0x1ac0 [mlx4_core]
[ 99.933520] [c0000001dd633850] [d000000003c81a40] mlx4_init_one+0x698/0x7c0 [mlx4_core]
[ 99.933922] [c0000001dd633960] [c00000000063049c] local_pci_probe+0x6c/0x140
[ 99.934171] [c0000001dd6339f0] [c0000000006312e8] pci_device_probe+0x178/0x200
[ 99.934430] [c0000001dd633a50] [c000000000716970] driver_probe_device+0x240/0x540
[ 99.934657] [c0000001dd633ae0] [c00000000071344c] bus_for_each_drv+0x8c/0xf0
[ 99.934848] [c0000001dd633b30] [c0000000007164f0] __device_attach+0x140/0x210
[ 99.935057] [c0000001dd633bc0] [c000000000621d38] pci_bus_add_device+0x78/0x100
[ 99.935270] [c0000001dd633c30] [c000000000621e20] pci_bus_add_devices+0x60/0xe0
[ 99.935488] [c0000001dd633c70] [c000000000625b44] pci_rescan_bus+0x44/0x70
[ 99.935666] [c0000001dd633ca0] [c000000000631ee4] bus_rescan_store+0x84/0xb0
[ 99.935840] [c0000001dd633ce0] [c000000000712fb4] bus_attr_store+0x44/0x70
[ 99.936039] [c0000001dd633d00] [c0000000003d52b8] sysfs_kf_write+0x68/0xa0
[ 99.936210] [c0000001dd633d20] [c0000000003d417c] kernfs_fop_write+0x17c/0x250
[ 99.936407] [c0000001dd633d70] [c00000000031924c] __vfs_write+0x3c/0x70
[ 99.936583] [c0000001dd633d90] [c00000000031a4b4] vfs_write+0xd4/0x240
[ 99.936760] [c0000001dd633de0] [c00000000031c018] SyS_write+0x68/0x110
[ 99.936934] [c0000001dd633e30] [c00000000000bd84] system_call+0x38/0xe0
[ 99.937102] Instruction dump:
[ 99.937188] e93f0000 3d020000 e8888078 e8690000 386300a0 4803f8f1 e8410018 e95f0000
[ 99.937472] e92a0000 81290098 2f890001 409efea0 <0fe00000> 60000000 60420000 e93f0000
[ 99.937726] ---[ end trace 66826e43e8c8b7ba ]---
[ 99.937832]

It's not clear to me if this new guest issue is specific to QEMU 2.7, or something that would also be present on 2.8 if not for the VFIO issue originally noted in this bug. First step I think will be to root-cause the VFIO issue, fix it, and see if the guest issue remains afterward. If it does we can track that as a separate bug (or perhaps we already seen this somewhere? seems vaguely familiar).

Need to hop of machine for today, but can look at it more tomorrow.

(In reply to comment #10)

> [ 1517.030701] audit: type=1400 audit(1487194798.559:18): apparmor="DENIED"
> operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd"
> rlimit=memlock value=8694792192

> I'm not sure if the apparmor issues are affecting functionality or not. That
> may be worth looking into a separate bug, or a dupe of
> https://bugzilla.linux.ibm.com/show_bug.cgi?id=146192
>
Let me check again the Ubuntu 16.10 system because I did the same steps to update the /etc/libvirt/qemu.conf in Ubuntu 17.04 like I did in 16.10 but still see it. Not sure if I did something else.

>
> It's not clear to me if this new guest issue is specific to QEMU 2.7, or
> something that would also be present on 2.8 if not for the VFIO issue
> originally noted in this bug. First step I think will be to root-cause the
> VFIO issue, fix it, and see if the guest issue remains afterward. If it does
> we can track that as a separate bug (or perhaps we already seen this
> somewhere? seems vaguely familiar).
>
> Need to hop of machine for today, but can look at it more tomorrow.
For this I see it with Ubuntu 16.10 KVM and the issue is the command are timing out like the dmas are not getting to the HW. I can see this with any Mellanox card I had tried. I can open separate bug more specific to 16.10 if you want.

== Comment: #15 - MICHAEL D. ROTH <email address hidden> - 2017-02-22 13:22:53 ==
I tried a bisect between 2.7.0 and 2.8.0/hostos to find the origin of these errors:

root@powerio-le11:/etc/libvirt/qemu# virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml --live
error: Failed to attach device from ./add_cx3.xml
error: internal error: unable to execute QEMU command 'device_add': Device initialization failed

The commit that caused the "breakage" was:

root@powerio-le11:~/mdroth/qemu.git# git bisect good
01905f58f166646619c35a2ebfc3ca3ed4cad62d is the first bad commit
commit 01905f58f166646619c35a2ebfc3ca3ed4cad62d
Author: Eric Auger <email address hidden>
Date: Mon Oct 17 10:57:59 2016 -0600

vfio: Pass an Error object to vfio_connect_container

However all that does is turn vfio init errors into fatal errors that are passed on to libvirt, as opposed to just logging them in background and continuing execution. If I go back to 2.7.0 and re-test, I find that while libvirt reports the attach is successful, the log file still shows:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name guest=powerio-le12-ubuntu-17.04,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-17-powerio-le12-ubuntu-/master-key.aes -machine pseries-2.7,accel=kvm,usb=off,dump-guest-core=off -m 8192 -realtime mlock=off -smp 16,sockets=1,cores=2,threads=8 -uuid bd3248c2-5686-4e18-b86e-799292bf4ad3 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-17-powerio-le12-ubuntu-/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device pci-ohci,id=usb,bus=pci.0,addr=0x2 -device spapr-vscsi,id=scsi0,reg=0x2000 -drive file=/var/lib/libvirt/images/powerio-le12-ubuntu-17.04.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-scsi0-0-0-0,readonly=on -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:eb:a9:da,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
Domain id=17 is tainted: high-privileges
char device redirected to /dev/pts/5 (label charserial0)
vfio: RAM memory listener initialization failed for container

So this issue seems to have existed since before 2.7.0, assuming it is stemming from QEMU and not related to kernel. Will look into it more.

== Comment: #16 - MICHAEL D. ROTH <email address hidden> - 2017-02-22 18:02:36 ==
I think this is some sort of permissions/rlimit issue after all.

If I invoke QEMU directly without libvirt, then to the attach from the QEMU monitor, I see the device added successfully with no error, and I also don't see the subsequent crashes within the guest relating to mlx_QUERY_FW:

root@powerio-le11:~/mdroth/qemu-build# ppc64-softmmu/qemu-system-ppc64 -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-powerio-le12-ubuntu-/master-key.aes -machine pseries-2.7,accel=kvm,usb=off,dump-guest-core=off -m 8192 -realtime mlock=off -smp 16,sockets=1,cores=2,threads=8 -uuid bd3248c2-5686-4e18-b86e-799292bf4ad3 -display none -no-user-config -nodefaults -rtc base=utc -no-shutdown -boot strict=on -device pci-ohci,id=usb,bus=pci.0,addr=0x2 -device spapr-vscsi,id=scsi0,reg=0x2000 -drive file=/var/lib/libvirt/images/powerio-le12-ubuntu-17.04.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-scsi0-0-0-0,readonly=on -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:eb:a9:da,bus=pci.0,addr=0x1 -device spapr-vty,chardev=charserial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on -vga none -nographic -chardev stdio,mux=on,id=charserial0 -monitor chardev:charserial0

root@powerio-le11:~/mdroth# ./vfio-bind 0044:01:00.0
unbinding 0044:01:00.0 via /sys/bus/pci/devices/0044:01:00.0/driver/unbind
binding 0044:01:00.0
echo 0x15b3 0x1007 >/sys/bus/pci/drivers/vfio-pci/new_id

(qemu) device_add vfio-pci,host=0044:01:00.0,id=hp0

root@powerio-le12:~# dmesg | tail -36
[ 236.294903] RTAS: event: 1, Type: Unknown, Severity: 1
[ 236.574958] pci 0000:00:00.0: [15b3:1007] type 00 class 0x020000
[ 236.575630] pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit]
[ 236.575986] pci 0000:00:00.0: reg 0x18: [mem 0x00000000-0x01ffffff 64bit pref]
[ 236.576592] pci 0000:00:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
[ 236.578890] iommu: Adding device 0000:00:00.0 to group 0
[ 236.578985] pci 0000:00:00.0: BAR 2: assigned [mem 0x10122000000-0x10123ffffff 64bit pref]
[ 236.580466] pci 0000:00:00.0: BAR 0: assigned [mem 0x10121800000-0x101218fffff 64bit]
[ 236.580921] pci 0000:00:00.0: BAR 6: assigned [mem 0x100a0000000-0x100a00fffff pref]
[ 236.581011] mlx4_core: Mellanox ConnectX core driver v4.0-1.0.1 (29 Jan 2017)
[ 236.581162] mlx4_core: Initializing 0000:00:00.0
[ 236.581272] mlx4_core 0000:00:00.0: enabling device (0000 -> 0002)
[ 236.583876] mlx4_core 0000:00:00.0: Using 64-bit direct DMA at offset 800000000000000
[ 242.122882] mlx4_core: device is working in RoCE mode: Roce V1
[ 242.122884] mlx4_core: UD QP Gid type is: V1
[ 243.652901] mlx4_core 0000:00:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 243.652904] mlx4_core 0000:00:00.0: PCIe link width is x8, device supports x8
[ 243.877392] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-1.0.1 (29 Jan 2017)
[ 243.877592] mlx4_en 0000:00:00.0: Activating port:1
[ 243.904087] mlx4_en: 0000:00:00.0: Port 1: Using 128 TX rings
[ 243.904090] mlx4_en: 0000:00:00.0: Port 1: Using 8 RX rings
[ 243.904093] mlx4_en: 0000:00:00.0: Port 1: frag:0 - size:1522 prefix:0 stride:1536
[ 243.904770] mlx4_en: 0000:00:00.0: Port 1: Initializing port
[ 243.905354] mlx4_en 0000:00:00.0: registered PHC clock
[ 243.906985] mlx4_en 0000:00:00.0: Activating port:2
[ 243.917716] mlx4_core 0000:00:00.0 enp0s0: renamed from eth0
[ 243.919899] mlx4_en: 0000:00:00.0: Port 2: Using 128 TX rings
[ 243.919901] mlx4_en: 0000:00:00.0: Port 2: Using 8 RX rings
[ 243.919903] mlx4_en: 0000:00:00.0: Port 2: frag:0 - size:1522 prefix:0 stride:1536
[ 243.920694] mlx4_en: 0000:00:00.0: Port 2: Initializing port
[ 243.941713] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-1.0.1 (29 Jan 2017)
[ 244.039494] <mlx4_ib> mlx4_ib_add: counter index 2 for port 1 allocated 1
[ 244.039520] <mlx4_ib> mlx4_ib_add: counter index 3 for port 2 allocated 1
[ 244.098796] mlx4_core 0000:00:00.0 enp0s0d1: renamed from eth0
[ 245.266775] mlx4_en: enp0s0: Link Up
[ 245.266891] mlx4_en: enp0s0d1: Link Up

Everything appears to be functioning. Also worth noting, the host doesn't report any apparmor messages:

[ 3683.945997] KVM guest htab at c000001e5a000000 (order 26), LPID 2
[ 3878.433033] br0: port 2(vnet0) entered disabled state
[ 3878.436993] device vnet0 left promiscuous mode
[ 3878.436995] br0: port 2(vnet0) entered disabled state
[ 3927.505181] pci 0044:01 : [PE# 02] Disabling 64-bit DMA bypass
[ 3927.505188] pci 0044:01 : [PE# 02] Removing DMA window #0
[ 3928.018862] pci 0044:01 : [PE# 02] Setting up window#0 0..3fffffff pg=1000
[ 3928.024266] pci 0044:01 : [PE# 02] Setting up window#1 800000000000000..8000001ffffffff pg=10000
[ 3928.403651] vfio-pci 0044:01:00.0: enabling device (0400 -> 0402)
[ 3928.514975] vfio_ecap_init: 0044:01:00.0 hiding ecap 0x19@0x18c

If I try to hotplug the device via libvirt, I see the vfio listener registration failure originally noted. If I enabled traces in qemu, i see where that listener failure is stemming from:

C_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name guest=powerio-le12-ubuntu-17.04,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-powerio-le12-ubuntu-/master-key.aes -machine pseries-2.7,accel=kvm,usb=off,dump-guest-core=off -m 8192 -realtime mlock=off -smp 16,sockets=1,cores=2,threads=8 -uuid bd3248c2-5686-4e18-b86e-799292bf4ad3 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-2-powerio-le12-ubuntu-/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device pci-ohci,id=usb,bus=pci.0,addr=0x2 -device spapr-vscsi,id=scsi0,reg=0x2000 -drive file=/var/lib/libvirt/images/powerio-le12-ubuntu-17.04.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-scsi0-0-0-0,readonly=on -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:eb:a9:da,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
Domain id=2 is tainted: high-privileges
2017-02-22T23:01:35.080908Z qemu-system-ppc64: -chardev pty,id=charserial0: char device redirected to /dev/pts/6 (label charserial0)
9697@1487804633.619707:vfio_realize (0044:01:00.0) group 6
9697@1487804633.619783:vfio_prereg_register va=3ffd2bff0000 size=200000000 ret=-12
9697@1487804633.619788:vfio_prereg_listener_region_add_skip 10080000020 - 1008000003f
9697@1487804633.619791:vfio_prereg_listener_region_add_skip 10080000040 - 1008000007f
9697@1487804633.619794:vfio_prereg_listener_region_add_skip 10080000080 - 1008000009f
9697@1487804633.619797:vfio_prereg_listener_region_add_skip 100e0000000 - 100e000001f
9697@1487804633.619799:vfio_prereg_listener_region_add_skip 100e0000800 - 100e0000807
9697@1487804633.619802:vfio_prereg_listener_region_add_skip 100e0001000 - 100e00010ff
9697@1487804633.619804:vfio_prereg_listener_region_add_skip 100e0002000 - 100e000202f
9697@1487804633.619806:vfio_prereg_listener_region_add_skip 100e0002800 - 100e0002807
9697@1487804633.619809:vfio_prereg_listener_region_add_skip 10120000000 - 10120000fff
9697@1487804633.619811:vfio_prereg_listener_region_add_skip 10120001000 - 10120001fff
9697@1487804633.619814:vfio_prereg_listener_region_add_skip 10120002000 - 10120002fff
9697@1487804633.619816:vfio_prereg_listener_region_add_skip 10120003000 - 10120402fff
9697@1487804633.619819:vfio_prereg_listener_region_add_skip 10120800000 - 10120800fff
9697@1487804633.619821:vfio_prereg_listener_region_add_skip 10120801000 - 10120801fff
9697@1487804633.619823:vfio_prereg_listener_region_add_skip 10120802000 - 10120802fff
9697@1487804633.619826:vfio_prereg_listener_region_add_skip 10120803000 - 10120c02fff
9697@1487804633.619828:vfio_prereg_listener_region_add_skip 10121000000 - 10121000fff
9697@1487804633.619831:vfio_prereg_listener_region_add_skip 10121001000 - 10121001fff
9697@1487804633.619833:vfio_prereg_listener_region_add_skip 10121002000 - 10121002fff
9697@1487804633.619835:vfio_prereg_listener_region_add_skip 10121003000 - 10121402fff

vfio_prereg_register's ret=-12 is the errno value set by:

ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);

which indicates that VFIO_IOMMU_SPAPR_REGISTER_MEMORY is failing with "Cannot allocate memory". In the host, I see an apparmor message:

[ 1607.260426] KVM guest htab at c000001e56000000 (order 26), LPID 1
[ 1745.761165] audit: type=1400 audit(1487804633.611:18): apparmor="ALLOWED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=5329 comm="libvirtd" rlimit=memlock value=8694792192
[ 1745.763764] pci 0044:01 : [PE# 02] Disabling 64-bit DMA bypass
[ 1745.763771] pci 0044:01 : [PE# 02] Removing DMA window #0
[ 1745.763864] pci 0044:01 : [PE# 02] Removing DMA window #0
[ 1745.763867] pci 0044:01 : [PE# 02] Removing DMA window #1
[ 1745.767676] pci 0044:01 : [PE# 02] Setting up window#0 0..7fffffff pg=1000
[ 1745.767679] pci 0044:01 : [PE# 02] Enabling 64-bit DMA bypass

Originally these were "DENIED" errors, but In comment #10 i noted I'd worked around that via:

sudo aa-complain /usr/sbin/libvirtd
sudo aa-complain /etc/apparmor.d/libvirt/libvirt-????????-????-????-????-????????????

as noted in https://bugzilla.linux.ibm.com/show_bug.cgi?id=146192

But either that workaround is insufficient, or there's some other issue relating to libvirt priviledge levels that seems to be at issue, given that QEMU doesn't have any issues when using directly as root.

Can u try now because I was using the system in the weekend and the card was dead plus the guest was doing pci passthru of the card also. So I took out the card from the guest xml and I can recreate again.
virsh attach-device powerio-le12-ubuntu-17.04 ./add_hydepark.xml --live
error: Failed to attach device from ./add_hydepark.xml
error: internal error: unable to execute QEMU command 'device_add': vfio error: 0040:01:00.0: failed to setup container for group 5: RAM memory listener initialization failed for container

This is because of the memlock hard limits that libvirt does. The upstream 2.5.0 doesnt have the problem.

The libvirt starts with a certain value for max memlock and adjusts it during the hotplug. The upstream 2.5.0 is adjusting it correctly for my guest having <memory unit='KiB'>16777216</memory>
to Max locked memory 17368612864 17368612864 bytes on hotplug, where as the ubuntu libvirt is not.

The same can be worked around by hard coding the max limits with the below tag for the guest powerio-le14-ubuntu-17.04
  <memtune>
    <hard_limit unit='KiB'>16961536</hard_limit>
    <soft_limit unit='KiB'>16961536</soft_limit>
  </memtune>

Trying to figure out the patch which might be missing on Ubuntu libvirt.

I went through the code and figured the required patches are all there. The package apparmor-profiles was missing and I installed that.

I had to add #include <abstractions/libvirt-qemu> to /etc/apprmor.d/usr.bin.libvirt and add /dev/vfio/vfio rw, to /etc/apparmor.d/abstractions/libvirt-qemu so I could get the hotplug working

I did above three together to get it working and not sure which of the them actually fixed(mosty including libvirt-qemu) as the appromor keeps the profiles in cache and reinstalling libvirt-daemon-system(which provides the /etc/apprmor.d/usr.bin.libvirt) didnt reinstall the file(!!).

The apparmor is kind of keeping the profiles in cache somewhere and relioading is not helping. Everything seems to be working fine now that is making it hard to say exactly which of the two steps fixed it. Or having the apparmor-profiles made the trick.

Carol, Let me know if you are planning for re-image sometime so we can see exactly which of the 3 helps get rid of the problem.

Would it be sufficient to just document this issue?

For now may be we can document the steps.

All steps except the step3 (3. Add /dev/vfio/vfio rw in abstractions/libvirt-qemu ), are not avoidable. The Step3 can be avoided if we can make changes to the default libvirt-qemu file on the distro.

Tags:

bugproxy (bugproxy) on 2017-03-31

tags:	added: architecture-ppc64le bugnameltc-151486 severity-high targetmilestone-inin1704
Changed in ubuntu:
assignee:	nobody → Taco Screen team (taco-screen-team)
affects:	ubuntu → qemu (Ubuntu)

Revision history for this message

Michael Hohnbaum (hohnbaum) wrote on 2017-03-31: Re: [Bug 1678322] [NEW] Ubuntu 17.04 KVM: Can not do hotplug attach

Download full text (29.7 KiB)

Jon,

Looks like a qemu issue for the Server Team to take a look at.

Michael

Jon,

Looks like a qemu issue for the Server Team to take a look at.

Michael

On 03/31/2017 02:59 PM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> ---Problem Description---
> I am trying to do hotplug attach with Mellanox CX3 card to a guest but I get failure.
> virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml
> error: Failed to attach device from ./add_cx3.xml
> error: internal error: unable to execute QEMU command 'device_add': vfio error: 0044:01:00.0: failed to setup container for group 6: RAM memory listener initialization failed for container
>
>
> from the log file from qemu I see this:
> 2017-02-14T22:55:40.721108Z qemu-system-ppc64: backend does not support BE vnet headers; falling back on use rspace virtio
>
> This is with kernel 4.9.0-15-generic and qemu level:
> dpkg --list| grep qemu
> ii  ipxe-qemu                                     1.0.0+git-20150424.a25a16d-1ubuntu2      all          PXE boot firmware - ROM images for qemu
> ii  qemu                                          1:2.8+dfsg-2ubuntu1                      ppc64el      fast processor emulator
> ii  qemu-block-extra:ppc64el                      1:2.8+dfsg-2ubuntu1                      ppc64el      extra block backend modules for qemu-system and qemu-utils
> ii  qemu-kvm                                      1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU Full virtualization
> ii  qemu-slof                                     20161019+dfsg-1                          all          Slimline Open Firmware -- QEMU PowerPC version
> ii  qemu-system                                   1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries
> ii  qemu-system-arm                               1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries (arm)
> ii  qemu-system-common                            1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries (common files)
> ii  qemu-system-mips                              1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries (mips)
> ii  qemu-system-misc                              1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries (miscellaneous)
> ii  qemu-system-ppc                               1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries (ppc)
> ii  qemu-system-sparc                             1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries (sparc)
> ii  qemu-system-x86                               1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU full system emulation binaries (x86)
> ii  qemu-user                                     1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU user mode emulation binaries
> ii  qemu-user-binfmt                              1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU user mode binfmt registration for qemu-user
> ii  qemu-utils                                    1:2.8+dfsg-2ubuntu1                      ppc64el      QEMU utilities
>
>  
> ---uname output---
> 4.9.0-15-generic #16-Ubuntu SMP Fri Jan 20 15:28:49 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
>  
> Machine Type = P8 
>  
> ---Steps to Reproduce---
>  bring up a guest and then try to attach device like this:
>  virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml --live
> error: Failed to attach device from ./add_cx3.xml
> error: internal error: unable to execute QEMU command 'device_add': vfio error: 0044:01:00.0: failed to setup container for group 6: RAM memory listener initialization failed for container
>
> When I retried the steps for add_cx3.xml on the same machine I noticed
> the following in the host logs:
>
> [ 1374.276210] KVM guest htab at c000001e56000000 (order 26), LPID 1
> [ 1383.824281] hrtimer: interrupt took 923 ns
> [ 1447.479194] audit_printk_skb: 15 callbacks suppressed
> [ 1447.479198] audit: type=1400 audit(1487194729.006:17): apparmor="DENIED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd" rlimit=memlock value=8694792192
> [ 1447.481927] pci 0044:01     : [PE# 002] Disabling 64-bit DMA bypass
> [ 1447.481935] pci 0044:01     : [PE# 002] Removing DMA window #0
> [ 1447.481978] pci 0044:01     : [PE# 002] Removing DMA window #0
> [ 1447.481980] pci 0044:01     : [PE# 002] Removing DMA window #1
> [ 1447.485667] pci 0044:01     : [PE# 002] Setting up window#0 0..7fffffff pg=1000
> [ 1447.485670] pci 0044:01     : [PE# 002] Enabling 64-bit DMA bypass
> [ 1517.030701] audit: type=1400 audit(1487194798.559:18): apparmor="DENIED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd" rlimit=memlock value=8694792192
> [ 1517.033286] pci 0044:01     : [PE# 002] Disabling 64-bit DMA bypass
> [ 1517.033290] pci 0044:01     : [PE# 002] Removing DMA window #0
> [ 1517.033322] pci 0044:01     : [PE# 002] Removing DMA window #0
> [ 1517.033325] pci 0044:01     : [PE# 002] Removing DMA window #1
> [ 1517.036971] pci 0044:01     : [PE# 002] Setting up window#0 0..7fffffff pg=1000
> [ 1517.036974] pci 0044:01     : [PE# 002] Enabling 64-bit DMA bypass
>
> I'm not sure if the apparmor issues are affecting functionality or not.
> That may be worth looking into a separate bug, or a dupe of
> https://bugzilla.linux.ibm.com/show_bug.cgi?id=146192
>
> As noted there I did the following to work around it:
>
> sudo aa-complain /usr/sbin/libvirtd
> sudo aa-complain /etc/apparmor.d/libvirt/libvirt-????????-????-????-????-????????????
>
> I still got the VFIO memory listener error however. If I install QEMU
> 2.7.0 I no longer see the VFIO error and things seems to succeed from a
> host perspective:
>
> root@powerio-le11:/etc/libvirt/qemu# virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml --live
> Device attached successfully
>
> root@powerio-le11:/etc/libvirt/qemu# dmesg | tail -6
> [ 3880.813971] KVM guest htab at c000001e56000000 (order 26), LPID 1
> [ 3917.656384] audit: type=1400 audit(1487197199.210:26): apparmor="ALLOWED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd" rlimit=memlock value=8694792192
> [ 3917.659276] pci 0044:01     : [PE# 002] Disabling 64-bit DMA bypass
> [ 3917.659284] pci 0044:01     : [PE# 002] Removing DMA window #0
> [ 3917.688803] vfio-pci 0044:01:00.0: enabling device (0400 -> 0402)
> [ 3917.800106] vfio_ecap_init: 0044:01:00.0 hiding ecap 0x19@0x18c
>
> In the guest things look okay initially:
>
> [   28.797667] RTAS: event: 1, Type: Unknown, Severity: 1
> [   29.062821] pci 0000:00:05.0: [15b3:1007] type 00 class 0x020000
> [   29.063118] pci 0000:00:05.0: reg 0x10: [mem 0x100a0000000-0x100a00fffff 64bit]
> [   29.063341] pci 0000:00:05.0: reg 0x18: [mem 0x2c0200000000-0x2c0201ffffff 64bit pref]
> [   29.063701] pci 0000:00:05.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
> [   29.065237] iommu: Adding device 0000:00:05.0 to group 0
> [   29.065332] pci 0000:00:05.0: BAR 2: assigned [mem 0x10122000000-0x10123ffffff 64bit pref]
> [   29.065675] pci 0000:00:05.0: BAR 0: assigned [mem 0x10121800000-0x101218fffff 64bit]
> [   29.066010] pci 0000:00:05.0: BAR 6: assigned [mem 0x100a0000000-0x100a00fffff pref]
> [   29.066105] mlx4_core: Mellanox ConnectX core driver v4.0-1.0.1 (29 Jan 2017)
> [   29.066127] mlx4_core: Initializing 0000:00:05.0
> [   29.066210] mlx4_core 0000:00:05.0: enabling device (0000 -> 0002)
> [   29.076273] mlx4_core 0000:00:05.0: Using 64-bit direct DMA at offset 800000000000000
>
>
> but eventually I see the following error:
>
>
> [   89.925954] mlx4_core 0000:00:05.0: device is going to be reset
> [   99.923755] mlx4_core 0000:00:05.0: Failed to obtain HW semaphore, aborting
> [   99.924052] mlx4_core 0000:00:05.0: Fail to reset HCA
> [   99.924305] kernel BUG at /var/lib/dkms/mlnx-ofed-kernel/4.0/build/drivers/net/ethernet/mellanox/mlx4/catas.c:193!
> [   99.924643] Oops: Exception in kernel mode, sig: 5 [#1]
> [   99.924811] SMP NR_CPUS=2048 [   99.924889] NUMA 
> [   99.924968] pSeries
> [   99.925048] Modules linked in: rdma_ucm(OE) ib_ucm(OE) ib_ipoib(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) devlink vmx_crypto ib_iser rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_core(OE) mlx_compat(OE) configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi knem(OE) ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ibmvscsi crc32c_vpmsum virtio_net virtio_blk
> [   99.927029] CPU: 10 PID: 4600 Comm: drmgr Tainted: G           OE   4.9.0-12-generic #13-Ubuntu
> [   99.927316] task: c0000001dfc27e00 task.stack: c0000001dd630000
> [   99.927515] NIP: d000000003c62794 LR: d000000003c6277c CTR: c0000000006c4a80
> [   99.927752] REGS: c0000001dd6332a0 TRAP: 0700   Tainted: G           OE    (4.9.0-12-generic)
> [   99.928029] MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>[   99.928645]   CR: 48022222  XER: 20000000
> [   99.928764] CFAR: c000000000710c28 SOFTE: 1 
> GPR00: d000000003c6277c c0000001dd633520 d000000003cbca7c 0000000000000029 
> GPR04: 0000000000000001 000000000000014d 6573657420484341 0d0a6c20746f2072 
> GPR08: 0000000000000007 0000000000000001 c0000001e2a94300 0000000000000006 
> GPR12: 0000000000002200 c00000000fb85a00 c0000001dd9a0060 0000000000000000 
> GPR16: 00000000024080c0 00000000024202c2 0000000000000000 d000000003cb7418 
> GPR20: 0000000000000000 d0000800802c0680 c0000001dd9a04e8 c0000001dd9a0518 
> GPR24: 000000000000ea60 0000000000000000 c000000001443a00 c0000001dd6336f0 
> GPR28: 0000000000000000 0000000000000004 c0000001e2a94360 c0000001dd9a0060 
> NIP [d000000003c62794] mlx4_enter_error_state.part.0+0x35c/0x460 [mlx4_core]
> [   99.931952] LR [d000000003c6277c] mlx4_enter_error_state.part.0+0x344/0x460 [mlx4_core]
> [   99.932190] Call Trace:
> [   99.932278] [c0000001dd633520] [d000000003c6277c] mlx4_enter_error_state.part.0+0x344/0x460 [mlx4_core] (unreliable)
> [   99.932647] [c0000001dd6335b0] [d000000003c66df8] __mlx4_cmd+0x720/0x970 [mlx4_core]
> [   99.932946] [c0000001dd633680] [d000000003c73d88] mlx4_QUERY_FW+0x90/0x420 [mlx4_core]
> [   99.933238] [c0000001dd633730] [d000000003c7fd28] mlx4_load_one+0x440/0x1ac0 [mlx4_core]
> [   99.933520] [c0000001dd633850] [d000000003c81a40] mlx4_init_one+0x698/0x7c0 [mlx4_core]
> [   99.933922] [c0000001dd633960] [c00000000063049c] local_pci_probe+0x6c/0x140
> [   99.934171] [c0000001dd6339f0] [c0000000006312e8] pci_device_probe+0x178/0x200
> [   99.934430] [c0000001dd633a50] [c000000000716970] driver_probe_device+0x240/0x540
> [   99.934657] [c0000001dd633ae0] [c00000000071344c] bus_for_each_drv+0x8c/0xf0
> [   99.934848] [c0000001dd633b30] [c0000000007164f0] __device_attach+0x140/0x210
> [   99.935057] [c0000001dd633bc0] [c000000000621d38] pci_bus_add_device+0x78/0x100
> [   99.935270] [c0000001dd633c30] [c000000000621e20] pci_bus_add_devices+0x60/0xe0
> [   99.935488] [c0000001dd633c70] [c000000000625b44] pci_rescan_bus+0x44/0x70
> [   99.935666] [c0000001dd633ca0] [c000000000631ee4] bus_rescan_store+0x84/0xb0
> [   99.935840] [c0000001dd633ce0] [c000000000712fb4] bus_attr_store+0x44/0x70
> [   99.936039] [c0000001dd633d00] [c0000000003d52b8] sysfs_kf_write+0x68/0xa0
> [   99.936210] [c0000001dd633d20] [c0000000003d417c] kernfs_fop_write+0x17c/0x250
> [   99.936407] [c0000001dd633d70] [c00000000031924c] __vfs_write+0x3c/0x70
> [   99.936583] [c0000001dd633d90] [c00000000031a4b4] vfs_write+0xd4/0x240
> [   99.936760] [c0000001dd633de0] [c00000000031c018] SyS_write+0x68/0x110
> [   99.936934] [c0000001dd633e30] [c00000000000bd84] system_call+0x38/0xe0
> [   99.937102] Instruction dump:
> [   99.937188] e93f0000 3d020000 e8888078 e8690000 386300a0 4803f8f1 e8410018 e95f0000 
> [   99.937472] e92a0000 81290098 2f890001 409efea0 <0fe00000> 60000000 60420000 e93f0000 
> [   99.937726] ---[ end trace 66826e43e8c8b7ba ]---
> [   99.937832]
>
> It's not clear to me if this new guest issue is specific to QEMU 2.7, or
> something that would also be present on 2.8 if not for the VFIO issue
> originally noted in this bug. First step I think will be to root-cause
> the VFIO issue, fix it, and see if the guest issue remains afterward. If
> it does we can track that as a separate bug (or perhaps we already seen
> this somewhere? seems vaguely familiar).
>
> Need to hop of machine for today, but can look at it more tomorrow.
>
> (In reply to comment #10)
>
>> [ 1517.030701] audit: type=1400 audit(1487194798.559:18): apparmor="DENIED"
>> operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd"
>> rlimit=memlock value=8694792192
>> I'm not sure if the apparmor issues are affecting functionality or not. That
>> may be worth looking into a separate bug, or a dupe of
>> https://bugzilla.linux.ibm.com/show_bug.cgi?id=146192
>>
> Let me check again the Ubuntu 16.10 system because I did the same steps to update the /etc/libvirt/qemu.conf in Ubuntu 17.04 like I did in 16.10 but still see it. Not sure if I did something else. 
>
>> It's not clear to me if this new guest issue is specific to QEMU 2.7, or
>> something that would also be present on 2.8 if not for the VFIO issue
>> originally noted in this bug. First step I think will be to root-cause the
>> VFIO issue, fix it, and see if the guest issue remains afterward. If it does
>> we can track that as a separate bug (or perhaps we already seen this
>> somewhere? seems vaguely familiar).
>>
>> Need to hop of machine for today, but can look at it more tomorrow.
> For this I see it with Ubuntu 16.10 KVM and the issue is the command are timing out like the dmas are not getting to the HW. I can see this with any Mellanox card I had tried. I can open separate bug more specific to 16.10 if you want.
>
> == Comment: #15 - MICHAEL D. ROTH <mdroth@us.ibm.com> - 2017-02-22 13:22:53 ==
> I tried a bisect between 2.7.0 and 2.8.0/hostos to find the origin of these errors:
>
> root@powerio-le11:/etc/libvirt/qemu# virsh attach-device powerio-le12-ubuntu-17.04 ./add_cx3.xml --live
> error: Failed to attach device from ./add_cx3.xml
> error: internal error: unable to execute QEMU command 'device_add': Device initialization failed
>
> The commit that caused the "breakage" was:
>
> root@powerio-le11:~/mdroth/qemu.git# git bisect good
> 01905f58f166646619c35a2ebfc3ca3ed4cad62d is the first bad commit
> commit 01905f58f166646619c35a2ebfc3ca3ed4cad62d
> Author: Eric Auger <eric.auger@redhat.com>
> Date:   Mon Oct 17 10:57:59 2016 -0600
>
>     vfio: Pass an Error object to vfio_connect_container
>
>
> However all that does is turn vfio init errors into fatal errors that are passed on to libvirt, as opposed to just logging them in background and continuing execution. If I go back to 2.7.0 and re-test, I find that while libvirt reports the attach is successful, the log file still shows:
>
> LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name guest=powerio-le12-ubuntu-17.04,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-17-powerio-le12-ubuntu-/master-key.aes -machine pseries-2.7,accel=kvm,usb=off,dump-guest-core=off -m 8192 -realtime mlock=off -smp 16,sockets=1,cores=2,threads=8 -uuid bd3248c2-5686-4e18-b86e-799292bf4ad3 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-17-powerio-le12-ubuntu-/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device pci-ohci,id=usb,bus=pci.0,addr=0x2 -device spapr-vscsi,id=scsi0,reg=0x2000 -drive file=/var/lib/libvirt/images/powerio-le12-ubuntu-17.04.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-scsi0-0-0-0,readonly=on -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:eb:a9:da,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
> Domain id=17 is tainted: high-privileges
> char device redirected to /dev/pts/5 (label charserial0)
> vfio: RAM memory listener initialization failed for container
>
> So this issue seems to have existed since before 2.7.0, assuming it is
> stemming from QEMU and not related to kernel. Will look into it more.
>
> == Comment: #16 - MICHAEL D. ROTH <mdroth@us.ibm.com> - 2017-02-22 18:02:36 ==
> I think this is some sort of permissions/rlimit issue after all.
>
> If I invoke QEMU directly without libvirt, then to the attach from the
> QEMU monitor, I see the device added successfully with no error, and I
> also don't see the subsequent crashes within the guest relating to
> mlx_QUERY_FW:
>
> root@powerio-le11:~/mdroth/qemu-build# ppc64-softmmu/qemu-system-ppc64
> -object
> secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2
> -powerio-le12-ubuntu-/master-key.aes -machine
> pseries-2.7,accel=kvm,usb=off,dump-guest-core=off -m 8192 -realtime
> mlock=off -smp 16,sockets=1,cores=2,threads=8 -uuid bd3248c2-5686-4e18
> -b86e-799292bf4ad3 -display none -no-user-config -nodefaults -rtc
> base=utc -no-shutdown -boot strict=on -device pci-
> ohci,id=usb,bus=pci.0,addr=0x2 -device spapr-vscsi,id=scsi0,reg=0x2000
> -drive file=/var/lib/libvirt/images/powerio-
> le12-ubuntu-17.04.qcow2,format=qcow2,if=none,id=drive-virtio-disk0
> -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-
> disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-
> scsi0-0-0-0,readonly=on -device scsi-cd,bus=scsi0.0,channel=0,scsi-
> id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev
> tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-
> pci,netdev=hostnet0,id=net0,mac=52:54:00:eb:a9:da,bus=pci.0,addr=0x1
> -device spapr-vty,chardev=charserial0,reg=0x30000000 -device virtio-
> balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on -vga none
> -nographic -chardev stdio,mux=on,id=charserial0 -monitor
> chardev:charserial0
>
> root@powerio-le11:~/mdroth# ./vfio-bind 0044:01:00.0 
> unbinding 0044:01:00.0 via /sys/bus/pci/devices/0044:01:00.0/driver/unbind
> binding 0044:01:00.0
> echo 0x15b3 0x1007 >/sys/bus/pci/drivers/vfio-pci/new_id
>
> (qemu) device_add vfio-pci,host=0044:01:00.0,id=hp0
>
> root@powerio-le12:~# dmesg | tail -36
> [  236.294903] RTAS: event: 1, Type: Unknown, Severity: 1
> [  236.574958] pci 0000:00:00.0: [15b3:1007] type 00 class 0x020000
> [  236.575630] pci 0000:00:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit]
> [  236.575986] pci 0000:00:00.0: reg 0x18: [mem 0x00000000-0x01ffffff 64bit pref]
> [  236.576592] pci 0000:00:00.0: reg 0x30: [mem 0x00000000-0x000fffff pref]
> [  236.578890] iommu: Adding device 0000:00:00.0 to group 0
> [  236.578985] pci 0000:00:00.0: BAR 2: assigned [mem 0x10122000000-0x10123ffffff 64bit pref]
> [  236.580466] pci 0000:00:00.0: BAR 0: assigned [mem 0x10121800000-0x101218fffff 64bit]
> [  236.580921] pci 0000:00:00.0: BAR 6: assigned [mem 0x100a0000000-0x100a00fffff pref]
> [  236.581011] mlx4_core: Mellanox ConnectX core driver v4.0-1.0.1 (29 Jan 2017)
> [  236.581162] mlx4_core: Initializing 0000:00:00.0
> [  236.581272] mlx4_core 0000:00:00.0: enabling device (0000 -> 0002)
> [  236.583876] mlx4_core 0000:00:00.0: Using 64-bit direct DMA at offset 800000000000000
> [  242.122882] mlx4_core: device is working in RoCE mode: Roce V1
> [  242.122884] mlx4_core: UD QP Gid type is: V1
> [  243.652901] mlx4_core 0000:00:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
> [  243.652904] mlx4_core 0000:00:00.0: PCIe link width is x8, device supports x8
> [  243.877392] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-1.0.1 (29 Jan 2017)
> [  243.877592] mlx4_en 0000:00:00.0: Activating port:1
> [  243.904087] mlx4_en: 0000:00:00.0: Port 1: Using 128 TX rings
> [  243.904090] mlx4_en: 0000:00:00.0: Port 1: Using 8 RX rings
> [  243.904093] mlx4_en: 0000:00:00.0: Port 1:   frag:0 - size:1522 prefix:0 stride:1536
> [  243.904770] mlx4_en: 0000:00:00.0: Port 1: Initializing port
> [  243.905354] mlx4_en 0000:00:00.0: registered PHC clock
> [  243.906985] mlx4_en 0000:00:00.0: Activating port:2
> [  243.917716] mlx4_core 0000:00:00.0 enp0s0: renamed from eth0
> [  243.919899] mlx4_en: 0000:00:00.0: Port 2: Using 128 TX rings
> [  243.919901] mlx4_en: 0000:00:00.0: Port 2: Using 8 RX rings
> [  243.919903] mlx4_en: 0000:00:00.0: Port 2:   frag:0 - size:1522 prefix:0 stride:1536
> [  243.920694] mlx4_en: 0000:00:00.0: Port 2: Initializing port
> [  243.941713] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v4.0-1.0.1 (29 Jan 2017)
> [  244.039494] <mlx4_ib> mlx4_ib_add: counter index 2 for port 1 allocated 1
> [  244.039520] <mlx4_ib> mlx4_ib_add: counter index 3 for port 2 allocated 1
> [  244.098796] mlx4_core 0000:00:00.0 enp0s0d1: renamed from eth0
> [  245.266775] mlx4_en: enp0s0: Link Up
> [  245.266891] mlx4_en: enp0s0d1: Link Up
>
> Everything appears to be functioning. Also worth noting, the host
> doesn't report any apparmor messages:
>
> [ 3683.945997] KVM guest htab at c000001e5a000000 (order 26), LPID 2
> [ 3878.433033] br0: port 2(vnet0) entered disabled state
> [ 3878.436993] device vnet0 left promiscuous mode
> [ 3878.436995] br0: port 2(vnet0) entered disabled state
> [ 3927.505181] pci 0044:01     : [PE# 02] Disabling 64-bit DMA bypass
> [ 3927.505188] pci 0044:01     : [PE# 02] Removing DMA window #0
> [ 3928.018862] pci 0044:01     : [PE# 02] Setting up window#0 0..3fffffff pg=1000
> [ 3928.024266] pci 0044:01     : [PE# 02] Setting up window#1 800000000000000..8000001ffffffff pg=10000
> [ 3928.403651] vfio-pci 0044:01:00.0: enabling device (0400 -> 0402)
> [ 3928.514975] vfio_ecap_init: 0044:01:00.0 hiding ecap 0x19@0x18c
>
> If I try to hotplug the device via libvirt, I see the vfio listener
> registration failure originally noted. If I enabled traces in qemu, i
> see where that listener failure is stemming from:
>
> C_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm -name guest=powerio-le12-ubuntu-17.04,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-powerio-le12-ubuntu-/master-key.aes -machine pseries-2.7,accel=kvm,usb=off,dump-guest-core=off -m 8192 -realtime mlock=off -smp 16,sockets=1,cores=2,threads=8 -uuid bd3248c2-5686-4e18-b86e-799292bf4ad3 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-2-powerio-le12-ubuntu-/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device pci-ohci,id=usb,bus=pci.0,addr=0x2 -device spapr-vscsi,id=scsi0,reg=0x2000 -drive file=/var/lib/libvirt/images/powerio-le12-ubuntu-17.04.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,id=drive-scsi0-0-0-0,readonly=on -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:eb:a9:da,bus=pci.0,addr=0x1 -chardev pty,id=charserial0 -device spapr-vty,chardev=charserial0,reg=0x30000000 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
> Domain id=2 is tainted: high-privileges
> 2017-02-22T23:01:35.080908Z qemu-system-ppc64: -chardev pty,id=charserial0: char device redirected to /dev/pts/6 (label charserial0)
> 9697@1487804633.619707:vfio_realize  (0044:01:00.0) group 6
> 9697@1487804633.619783:vfio_prereg_register va=3ffd2bff0000 size=200000000 ret=-12
> 9697@1487804633.619788:vfio_prereg_listener_region_add_skip 10080000020 - 1008000003f
> 9697@1487804633.619791:vfio_prereg_listener_region_add_skip 10080000040 - 1008000007f
> 9697@1487804633.619794:vfio_prereg_listener_region_add_skip 10080000080 - 1008000009f
> 9697@1487804633.619797:vfio_prereg_listener_region_add_skip 100e0000000 - 100e000001f
> 9697@1487804633.619799:vfio_prereg_listener_region_add_skip 100e0000800 - 100e0000807
> 9697@1487804633.619802:vfio_prereg_listener_region_add_skip 100e0001000 - 100e00010ff
> 9697@1487804633.619804:vfio_prereg_listener_region_add_skip 100e0002000 - 100e000202f
> 9697@1487804633.619806:vfio_prereg_listener_region_add_skip 100e0002800 - 100e0002807
> 9697@1487804633.619809:vfio_prereg_listener_region_add_skip 10120000000 - 10120000fff
> 9697@1487804633.619811:vfio_prereg_listener_region_add_skip 10120001000 - 10120001fff
> 9697@1487804633.619814:vfio_prereg_listener_region_add_skip 10120002000 - 10120002fff
> 9697@1487804633.619816:vfio_prereg_listener_region_add_skip 10120003000 - 10120402fff
> 9697@1487804633.619819:vfio_prereg_listener_region_add_skip 10120800000 - 10120800fff
> 9697@1487804633.619821:vfio_prereg_listener_region_add_skip 10120801000 - 10120801fff
> 9697@1487804633.619823:vfio_prereg_listener_region_add_skip 10120802000 - 10120802fff
> 9697@1487804633.619826:vfio_prereg_listener_region_add_skip 10120803000 - 10120c02fff
> 9697@1487804633.619828:vfio_prereg_listener_region_add_skip 10121000000 - 10121000fff
> 9697@1487804633.619831:vfio_prereg_listener_region_add_skip 10121001000 - 10121001fff
> 9697@1487804633.619833:vfio_prereg_listener_region_add_skip 10121002000 - 10121002fff
> 9697@1487804633.619835:vfio_prereg_listener_region_add_skip 10121003000 - 10121402fff
>
> vfio_prereg_register's ret=-12 is the errno value set by:
>
>     ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
>
> which indicates that VFIO_IOMMU_SPAPR_REGISTER_MEMORY is failing with
> "Cannot allocate memory". In the host, I see an apparmor message:
>
> [ 1607.260426] KVM guest htab at c000001e56000000 (order 26), LPID 1
> [ 1745.761165] audit: type=1400 audit(1487804633.611:18): apparmor="ALLOWED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=5329 comm="libvirtd" rlimit=memlock value=8694792192
> [ 1745.763764] pci 0044:01     : [PE# 02] Disabling 64-bit DMA bypass
> [ 1745.763771] pci 0044:01     : [PE# 02] Removing DMA window #0
> [ 1745.763864] pci 0044:01     : [PE# 02] Removing DMA window #0
> [ 1745.763867] pci 0044:01     : [PE# 02] Removing DMA window #1
> [ 1745.767676] pci 0044:01     : [PE# 02] Setting up window#0 0..7fffffff pg=1000
> [ 1745.767679] pci 0044:01     : [PE# 02] Enabling 64-bit DMA bypass
>
> Originally these were "DENIED" errors, but In comment #10 i noted I'd
> worked around that via:
>
> sudo aa-complain /usr/sbin/libvirtd
> sudo aa-complain /etc/apparmor.d/libvirt/libvirt-????????-????-????-????-????????????
>
> as noted in https://bugzilla.linux.ibm.com/show_bug.cgi?id=146192
>
> But either that workaround is insufficient, or there's some other issue
> relating to libvirt priviledge levels that seems to be at issue, given
> that QEMU doesn't have any issues when using directly as root.
>
>
> Can u try now because I was using the system in the weekend and the card was dead plus the guest was doing pci  passthru of the card also. So I took out the card from the guest xml and I can recreate again. 
> virsh attach-device powerio-le12-ubuntu-17.04 ./add_hydepark.xml --live
> error: Failed to attach device from ./add_hydepark.xml
> error: internal error: unable to execute QEMU command 'device_add': vfio error: 0040:01:00.0: failed to setup container for group 5: RAM memory listener initialization failed for container
>
> This is because of the memlock hard limits that libvirt does. The
> upstream 2.5.0 doesnt have the problem.
>
> The libvirt starts with a certain value for max memlock and adjusts it during the hotplug. The upstream 2.5.0 is adjusting it correctly for my guest having   <memory unit='KiB'>16777216</memory>
> to Max locked memory         17368612864          17368612864          bytes     	 on hotplug, where as the ubuntu libvirt is not. 
>
> The same can be worked around by hard coding the max limits with the below tag for the guest powerio-le14-ubuntu-17.04
>   <memtune>
>     <hard_limit unit='KiB'>16961536</hard_limit>
>     <soft_limit unit='KiB'>16961536</soft_limit>
>   </memtune>
>
> Trying to figure out the patch which might be missing on Ubuntu libvirt.
>
> I went through the code and figured the required patches are all there.
> The package apparmor-profiles was missing and I installed that.
>
> I had to add #include <abstractions/libvirt-qemu>  to
> /etc/apprmor.d/usr.bin.libvirt and add /dev/vfio/vfio rw, to
> /etc/apparmor.d/abstractions/libvirt-qemu so I could get the hotplug
> working
>
> I did above three together to get it working and not sure which of the
> them actually fixed(mosty including libvirt-qemu) as the appromor keeps
> the profiles in cache and reinstalling libvirt-daemon-system(which
> provides the /etc/apprmor.d/usr.bin.libvirt) didnt reinstall the
> file(!!).
>
> The apparmor is kind of keeping the profiles in cache somewhere and
> relioading is not helping. Everything seems to be working fine now that
> is making it hard to say exactly which of the two steps fixed it. Or
> having the apparmor-profiles made the trick.
>
> Carol, Let me know if you are planning for re-image sometime so we can
> see exactly which of the 3 helps get rid of the problem.
>
> Would it be sufficient to just document this issue?
>
> For now may be we can document the steps.
>
> All steps except the step3 (3. Add /dev/vfio/vfio rw in abstractions
> /libvirt-qemu ), are not avoidable. The Step3 can be avoided if we can
> make changes to the default libvirt-qemu file on the distro.
>
> ** Affects: ubuntu
>      Importance: Undecided
>      Assignee: Taco Screen team (taco-screen-team)
>          Status: New
>
>
> ** Tags: architecture-ppc64le bugnameltc-151486 severity-high targetmilestone-inin1704

-- 
Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-03:

Hi,
wow a lot of info to pass through - thank you for your report!

I really think we need to understand the apparmor DENY.
apparmor="DENIED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=6853 comm="libvirtd" rlimit=memlock value=8694792192

#1 Until then you can either use the disable of apparmor that you used:
$ sudo aa-complain /usr/sbin/libvirtd
$ sudo aa-complain /etc/apparmor.d/libvirt/libvirt-<UUID>

#2 Or as Michael oultined for you the libvirt xml change, but to make all three here the next workaround would be:
  <memtune>
    <hard_limit unit='KiB'>16961536</hard_limit>
    <soft_limit unit='KiB'>16961536</soft_limit>
  </memtune>
But I'd assume that hits the same apparmor block (I'll check)

#3 Or you can just raise the limit beforehand on the running qemu
# get the qemu pid
$ prlimit --pid <qemu pid> --memlock=unlimited

IIRC /dev/vfio/vfio rw, should be in via cgroup_acl in qemu.conf, but I''ll recreate your case and make sure.
Back with more updates somewhen later today.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-03:

I was struggling with HW issues that made my iommu capable system failing, so I'll continue tomorrow.
Some questions for now already.

You mentioned you edited /etc/libvirt/qemu.conf but didn't outline what you changed.
On vfio terms libvirt/qemu already knows and has access to "/dev/vfio/vfio" by default.
But when you create your new group it will need to be made known as well.
So e.g. if you got /dev/vfio/6 this will have to be added in /etc/libvirt/qemu.conf
at the cgroup_device_acl statement.

Since this is hot add virt-aa-helper fixes wont help as this is brought to the guest
after the initial profile is created.

BTW also while not 100% needed a share of your add_cx3.xml would be nice.
I have other devices anyway but to make sure any options format are the same it might help.

From here I need to sort out:
1. which of the changes is fixing it
  1.1 apparmor-profiles not installed? This is a no-op for your case as libvirt brings all the profiles needed and thereby not the fix.
  1.2 vfio in apparmor abstractions
  1.3 limit in the guest XML

I'll try to do so and get back to you, but answering the questions above will help to not get stuck on missing info.

Changed in qemu (Ubuntu):
assignee:	Taco Screen team (taco-screen-team) → ChristianEhrhardt (paelzer)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

Got over the HW issues by fixing Firmware on that §$%&/ server.
Now ready to test your case, thanks for your patience.

I mostly did VF attaches via monitor commands recently and limits there are not solved by libvirt.
Thereby I didn't realize it might be an issue, reproducing your case now to confirm.

Obviously this is not on power, but I had no matching machine around atm.
But I was able to prove this is not arch dpeendent which helps testing in general.

I realized that older howto's guide users to change conf files, I'll sort out and we might include some of that in the new profile depending how much it is exposing the system to potential misuse on vfio.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

# on x86 for vfio set kernel parm have iommu (if not default for you)
$ echo "intel_iommu=on" | sudo tee /etc/default/grub.d/99-force-iommu.cfg
$ sudo update-grub
# reboot

# prep one of my net devices for VF with vfio-pci
$ sudo rmmod ixgbe
$ sudo modprobe ixgbe max_vfs=8
$ lspci -n -s 0000:04:10.0
04:10.0 0200: 8086:1515 (rev 01)
$ sudo modprobe vfio-pci
# assign id to vfio-pci
$ echo 8086 1515 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
# unbind old driver
$ echo "0000:04:10.0" | sudo tee /sys/bus/pci/devices/0000\:04\:10.0/driver/unbind
# usually auto-bound now but be sure
$ echo "0000:04:10.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
0000:04:10.0

#1 get a basic guest
$ sudo apt install uvtool-libvirt
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=zesty
# if you have no keys around also run "ssh-keygen"
$ uvt-kvm create --password=ubuntu zesty-vfio release=zesty arch=amd64 label=daily

# Prep and attach the device to the guest
$ cat vf-04.10.0-pci.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x04' slot='0x10' function='0x0'/>
  </source>
</hostdev>
$ virsh attach-device z-testguest vf-04.10.0-pci.xml

With that in place I could confirm your report:
I see
1. the setrlimit deny (all worarounds mentioned before work, but they should not have to)
2. I also see he deny to /dev/vfio/vfio (I'll check the old guides and if we want to make them work more out of the box)
3. I also see an apparmor profile reload which I need to check on its content

Changed in qemu (Ubuntu):
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

On 2.
The profile reload automatically gave the guest (and only that guest) the right device - in my case /dev/vfio/41. So the profile reload is good.

[ 2652.751699] audit: type=1400 audit(1491303691.711:24): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-17a61b87-5132-497c-b928-421ac2ee0c8a" pid=8757 comm="apparmor_parser"

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

actually the former was on #3 - I should not change my numbers too often.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

[ 2652.756712] audit: type=1400 audit(1491303691.719:25): apparmor="DENIED" operation="open" profile="libvirt-17a61b87-5132-497c-b928-421ac2ee0c8a" name="/dev/vfio/vfio" pid=8486 comm="qemu-system-x86" requested_mask="wr" denied_mask="wr" fsuid=64055 ouid=0

Usually guides said a user who wants to provide vfio uncomment the default provided but commented cgroup_device_acl setting. I was able to confirm that even with that the case fails with the apparmor aformentioned deny.

As suggested the right solution is to add it to the base abstraction being /etc/apparmor.d/abstractions/libvirt-qemu like:
# allow guest access to the generic base vfio interface (LP: #1678322)
/dev/vfio/vfio rw,

The base device should be safe as it has "all but a couple version and extension query interfaces locked away" [1].

This is not new, the open on this is since 2014 in the code, so I wonder if all using that just disabled it or manually tweaked.
This part shall surely be added to the base profile

Looking into the setrlimit next.

[1]: https://www.kernel.org/doc/Documentation/vfio.txt

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

Doing a double check with the security team on adding the base vfio ...

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

#10

Ok, I have an ack from security on that change - going on with the setrlimit now

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

#11

on #1
profile for /usr/sbin/libvirtd
[ 2652.571679] audit: type=1400 audit(1491303691.531:23): apparmor="DENIED" operation="setrlimit" profile="/usr/sbin/libvirtd" pid=7587 comm="libvirtd" rlimit=memlock value=1610612736

It isn't really clear to me why/where Apparmor is blocking that access.
After a decent debugging session with the security Team it turned out that even if it would work it would not help. When allowed it changes global limits but not those of the qemu process - and thereby the failure of vfio allocation issues stays.

The setrlimit will change the global limit and not the one of the qemu.
It actually is a bug that it is blocked, but even when allowed it does not increase the limit of the target qemu. And by that fixing to allow that does not get us any further.
Never the less I created a spin-off bug 1679704 for that.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

#12

The call to setrlimit should be from qemuDomainAdjustMaxMemLock->virProcessSetMaxMemLock
And it has a pid switch:
0 - increase the global limit via setrlimit
<pid> increase the limit of a process via prlimit

That should have a pid set in our case, then use prlmit and thereby work to actually increase the limit we need. I'll continue debugging in that in log&gdb to see where this is coming from and why we don't have a vm-pid.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-04:

#13

After another breakdown of my usual testbed I went to ppc64el with an Emulex to continue tests.
Never should have trusted in x86 right :-)

# Prep simple guest
$ sudo apt install uvtool-libvirt
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=ppc64el label=daily release=zesty
$ cat template-ppc.xml
<domain type='kvm'>
  <os>
    <type arch="ppc64le" machine="pseries">hvm</type>
    <boot dev='hd' />
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <devices>
    <interface type='network'>
      <source network='default'/>
      <model type='virtio'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/3'/>
      <target port='0'/>
    </serial>
    <graphics type='vnc' autoport='yes' listen='127.0.0.1'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
    <video/>
  </devices>
</domain>
$ ssh-keygen
$ sudo ppc64_cpu --smt=off
$ uvt-kvm create --password=ubuntu --template template-ppc.xml z-test release=zesty arch=ppc64el label=daily

# Prep VFs and attach
$ echo 4 | sudo tee /sys/bus/pci/devices/0005\:01\:00.0/sriov_numvfs

$ sudo modprobe vfio-pci
$ lspci -n -s 0005:01:01.3
0005:01:01.3 0200: 10df:e228 (rev 10)

$ cat VF-5.1.1.3.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0005' bus='0x01' slot='0x01' function='0x3'/>
  </source>
</hostdev>
$ virsh attach-device z-test VF-5.1.1.3.xml

While the Emulex card I have struggles with being a proper VF, it is still enough (for now) to continue debug.
[ 864.164676] be2net 0005:01:01.3: MSIx enable failed
[ 864.164699] be2net 0005:01:01.3: Emulex OneConnect(Lancer) initialization failed
[ 864.164852] be2net: probe of 0005:01:01.3 failed with error -34

If you happen to know the Emulex issue, let me know.
Maybe just some special tweak on using those devices VFs on ppc?
Otherwise I'll keep it as-is until I understood why setrlimit is used instead of prlimit to change the target qemu attributes.

After another breakdown of my usual testbed I went to ppc64el with an Emulex to continue tests.
Never should have trusted in x86 right :-)

# Prep VFs and attach
$ echo 4 | sudo tee /sys/bus/pci/devices/0005\:01\:00.0/sriov_numvfs

$ sudo modprobe vfio-pci
$ lspci -n -s 0005:01:01.3
0005:01:01.3 0200: 10df:e228 (rev 10)

While the Emulex card I have struggles with being a proper VF, it is still enough (for now) to continue debug.
[  864.164676] be2net 0005:01:01.3: MSIx enable failed
[  864.164699] be2net 0005:01:01.3: Emulex OneConnect(Lancer) initialization failed
[  864.164852] be2net: probe of 0005:01:01.3 failed with error -34

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-04: Comment bridged from LTC Bugzilla

#14

------- Comment From <email address hidden> 2017-04-04 14:36 EDT-------
(In reply to comment #50)
> If you happen to know the Emulex issue, let me know.
> Maybe just some special tweak on using those devices VFs on ppc?
> Otherwise I'll keep it as-is until I understood why setrlimit is used
> instead of prlimit to change the target qemu attributes.

When we opened this issue it was doing SRIOV but I can hit this without SRIOV. Like if I use Mellanox CX3 without any SRIOV.
Maybe try Emulex card, just using the card in dedicated mode.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-06:

#15

Note to myself: After former pre-setup minimum steps after reboot
$ sudo ppc64_cpu --smt=off
$ modprobe vfio-pci
# unsure if needed on ppc
$ echo 10df e228 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
$ echo 4 | sudo tee /sys/bus/pci/devices/0005\:01\:00.0/sriov_numvfs
$ virsh attach-device z-test VF-5.1.1.3.xml

Christian Ehrhardt  (paelzer) on 2017-04-06

affects:

qemu (Ubuntu) → libvirt (Ubuntu)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-06:

#16

I've verified that e.g. setting the profile to aa-complain to let the setprlimit through the issue is not fixed. So while it is an issue that this shows up as Denied it would not get the VF attachement working.

What "fixed" it in your case was adding the memtune options that raise the limits when qmeu is started.
Another alternative to get it working is to raise them via "sudo prlimit ..." dynamically as libvirt would do.

Both confirm that as I assumed we have to debug (or understand as I might be off here still) why virProcessSetMaxMemLock is not having te pid available to set the target limit via prlimit. There should be the root cause of this issue.
This will be the effort that is continued to be tracked in this bug.

I've forked off several of the issues in bugs of their own.
- bug 1679704 against apparmor for the blocking of setrlimit
- bug 1680384 against libvirt to add missing apparmor profile statements
- bug 1680386 against libvirt to add virt-aa-helper code for devspec
I'd ask you to reverse mirror them so you can track and work on them as needed.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-06:

#17

Taking out the memtune and attach gdb to libvirtd and breaks on:
- virProcessSetMaxMemLock
- virProcessPrLimit
Check limits, qemu by default is on "16777216"

Then run
virsh attach-device z-test ~/VF-5.1.1.3.xml

Thread 2 "libvirtd" hit Breakpoint 1, 0x00003fffa85e58f8 in virProcessSetMaxMemLock (pid=35967, bytes=2164260864)

At this time we have:
- 35967 is the correct pid of the target qemu
- 2164260864 would be higher and might be what libvirt thinks it needs now

So all should be right and in fact it is - virProcessPrLimit is auto-inlined which makes it less obvious.
But it goes the right path to call virProcessPrLimit.
This is implemented as:
prlimit(pid, resource, new_limit, old_limit);

Exactly on this call I see the setrlimit DENY appear in the log.
A qucik check revealed that this is how prlimit is implemented in glibc.
So the direct setrlimit call in virProcessSetMaxMemLock was a bit of a red herring.
It went the right path via prlimit and then the apparmor block kills it.

On prlimit I see correctly:
$4 = {rlim_cur = 2164260864, rlim_max = 2164260864}

According to the doc of prlimit that means capabilities are needed:
To set or get the resources of a process other than itself, the caller must have
"the CAP_SYS_RESOURCE capability, or the real, effective, and saved set user IDs of the target process must match the real user ID of the caller and the real, effective, and saved set group IDs of the target process must match the real group ID of the caller."

WIll discuss that on the spn-off apparmor bug 1679704

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-04-06:

#18

Prepared a Test PPA for the apparmor profile misses in bug 1680384 at:
https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2708/

Once there is also a solution at hand for the apparmor issue at bug 1679704 a test with both together would be the next big step to do.

For now waiting ...

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-05-09:

#19

- Added and filled associated SRU Template
- checked once more the dep8 tests on the test bileto ppa
- checked regression tests
- added tasks for affected releases

With that in place uploaded to the unapproved queue of these releases

Please do note that Trusty is a bit old for the SRU as-is, I'll need to revaluate it once the others passed.
Reasoning: a) no complains so far about that release
           b) I'll need to re-use my test system and heavily modify it as it usually isn't working
              on trusty.
           c) not stalling the fixes for those we can verify more easily

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-05-09:

#20

uh my mistake - that update should have gone to the sibling bug about the apparmor rules.
For this one here we also need the apparmor but about setrlimit resolved before testing again.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-05-24:

#21

The apparmor rules delivered by libvirt are fixed and released now, what remains blocking this is apparmor bug 1679704 - once that is resolved this will work.

Not sure how you want to track - close this as a dup to the remaining blocker would be my preferred way - opinions?

Revision history for this message

bugproxy (bugproxy) wrote on 2017-05-24:

#22

------- Comment From <email address hidden> 2017-05-24 11:56 EDT-------
(In reply to comment #57)
> The apparmor rules delivered by libvirt are fixed and released now, what
> remains blocking this is apparmor bug 1679704 - once that is resolved this
> will work.
>
> Not sure how you want to track - close this as a dup to the remaining
> blocker would be my preferred way - opinions?

Maybe dup to the remaining blocker so we get notification. Thanks.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2017-05-29:

#23

Ok, will do so - just didn't want to mess up the proxying for you.
Dupping now to the remaining blocker.
Once that is resolved we can (and should) also verify this case over here.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-06-19:

#24

------- Comment From <email address hidden> 2017-06-19 02:57 EDT-------
Owning team - canonical.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-06-21:

#25

------- Comment From <email address hidden> 2017-06-20 20:06 EDT-------
*** This bug has been marked as a duplicate of bug 153457 ***

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1679704 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-bugzilla.linux.ibm.com #146192 Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntulibvirt package

Ubuntu 17.04 KVM: Can not do hotplug attach

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
libvirt package