Crash on Cavium ThunderX when using Openvswitch-DPDK: nicvf_eth_dev_init(): Failed to get ready message from PF / eal-intr-thread[41505]: unhandled level 2 translation fault"

Bug #1718638 reported by Christian Ehrhardt 
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dpdk (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Crash on Cavium ThunderX when using Openvswitch-DPDK.

Please don't mind the "failed to bind to /var/run/openvswitch/vhost-user-1" that is just due to openvswitch restarting over and over again due to the crash.

The real issue seems to be around:
2017-09-21T10:38:20.785Z|00044|dpdk|ERR|PMD: nicvf_eth_dev_init(): Failed to get ready message from PF
And then on "the way out" from this fail it even crashes:
2017-09-21T10:38:20.786Z|00049|dpdk|ERR|EAL: Driver cannot attach the device (0002:01:01.0)
2017-09-21T10:38:20.786Z|00050|netdev_dpdk|WARN|Error attaching device '0002:01:01.0' to DPDK
[...]
2017-09-21T10:38:23.364Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Segmentation fault), core dumped)

In the kernel there is:
[ 8268.140181] eal-intr-thread[41505]: unhandled level 2 translation fault (11) at 0xffff92e00200, esr 0x92000006, in librte_pmd_thunderx_nicvf.so.17.05[ffff95281000+11000]

Openvswitch Log:
Log is from starting an OVS that has a thunderX port configured:
2017-09-21T10:38:06.413Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-09-21T10:38:06.424Z|00002|ovs_numa|INFO|Discovered 48 CPU cores on NUMA node 0
2017-09-21T10:38:06.424Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 48 CPU cores
2017-09-21T10:38:06.424Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-09-21T10:38:06.424Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-09-21T10:38:06.433Z|00006|dpdk|INFO|DPDK Enabled - initializing...
2017-09-21T10:38:06.433Z|00007|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-09-21T10:38:06.433Z|00008|dpdk|INFO|EAL ARGS: ovs-vswitchd -m 2048 --pci-whitelist 0002:01:01.0 --vhost-owner libvirt-qemu:kvm --vhost-perm 0666 -c 0x00000001
2017-09-21T10:38:06.440Z|00009|dpdk|INFO|EAL: Detected 48 lcore(s)
2017-09-21T10:38:06.442Z|00010|dpdk|INFO|EAL: socket owner specified as libvirt-qemu:kvm (64055:117)
2017-09-21T10:38:06.442Z|00011|dpdk|INFO|EAL: socket perm specified as '0666' from '0666'
2017-09-21T10:38:06.481Z|00012|dpdk|INFO|EAL: Probing VFIO support...
2017-09-21T10:38:06.481Z|00013|dpdk|INFO|EAL: VFIO support initialized
2017-09-21T10:38:19.208Z|00014|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:38:19.208Z|00015|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:38:19.208Z|00016|dpdk|INFO|EAL: using IOMMU type 1 (Type 1)
2017-09-21T10:38:19.458Z|00017|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=7 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:38:19.458Z|00018|dpdk|INFO|DPDK pdump packet capture enabled
2017-09-21T10:38:19.468Z|00019|dpdk|INFO|DPDK Enabled - initialized
2017-09-21T10:38:19.484Z|00020|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
2017-09-21T10:38:19.484Z|00021|ofproto_dpif|INFO|netdev@ovs-netdev: VLAN header stack length probed as 1
2017-09-21T10:38:19.484Z|00022|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3
2017-09-21T10:38:19.484Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports truncate action
2017-09-21T10:38:19.484Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids
2017-09-21T10:38:19.484Z|00025|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports clone action
2017-09-21T10:38:19.484Z|00026|ofproto_dpif|INFO|netdev@ovs-netdev: Max sample nesting level probed as 10
2017-09-21T10:38:19.484Z|00027|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports eventmask in conntrack action
2017-09-21T10:38:19.484Z|00028|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state
2017-09-21T10:38:19.485Z|00029|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_zone
2017-09-21T10:38:19.485Z|00030|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_mark
2017-09-21T10:38:19.485Z|00031|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_label
2017-09-21T10:38:19.485Z|00032|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state_nat
2017-09-21T10:38:19.485Z|00033|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple
2017-09-21T10:38:19.485Z|00034|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple6
2017-09-21T10:38:19.768Z|00035|bridge|INFO|bridge ovsdpdkbr0: added interface ovsdpdkbr0 on port 65534
2017-09-21T10:38:19.773Z|00036|dpdk|INFO|VHOST_CONFIG: vhost-user server: socket created, fd: 68
2017-09-21T10:38:19.773Z|00037|netdev_dpdk|INFO|Socket /var/run/openvswitch/vhost-user-1 created for vhost-user port vhost-user-1
2017-09-21T10:38:19.778Z|00038|dpdk|ERR|VHOST_CONFIG: failed to bind to /var/run/openvswitch/vhost-user-1: Address already in use; remove it and try again
2017-09-21T10:38:19.778Z|00039|netdev_dpdk|ERR|rte_vhost_driver_start failed for vhost user port: vhost-user-1
2017-09-21T10:38:19.778Z|00040|netdev_dpdk|WARN|dpdkvhostuser ports are considered deprecated; please migrate to dpdkvhostuserclient ports.
2017-09-21T10:38:19.778Z|00041|bridge|WARN|could not open network device vhost-user-1 (Unknown error -1)
2017-09-21T10:38:19.779Z|00042|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:38:19.779Z|00043|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:38:20.785Z|00044|dpdk|ERR|PMD: nicvf_eth_dev_init(): Failed to get ready message from PF
2017-09-21T10:38:20.785Z|00045|dpdk|INFO|EAL: Releasing pci mapped resource for 0002:01:01.0
2017-09-21T10:38:20.785Z|00046|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff92e00000
2017-09-21T10:38:20.785Z|00047|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff93000000
2017-09-21T10:38:20.785Z|00048|dpdk|WARN|EAL: Requested device 0002:01:01.0 cannot be used
2017-09-21T10:38:20.786Z|00049|dpdk|ERR|EAL: Driver cannot attach the device (0002:01:01.0)
2017-09-21T10:38:20.786Z|00050|netdev_dpdk|WARN|Error attaching device '0002:01:01.0' to DPDK
2017-09-21T10:38:20.786Z|00051|netdev|WARN|dpdk0: could not set configuration (Invalid argument)
2017-09-21T10:38:20.786Z|00052|bridge|INFO|bridge ovsdpdkbr0: using datapath ID 0000e699e66bd34d
2017-09-21T10:38:20.786Z|00053|connmgr|INFO|ovsdpdkbr0: added service controller "punix:/var/run/openvswitch/ovsdpdkbr0.mgmt"
2017-09-21T10:38:23.364Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Segmentation fault), core dumped)
2017-09-21T10:38:23.364Z|00003|daemon_unix|EMER|could not initiate process monitoring

Along that there is a kernel bug showing up:
[ 8266.735308] vfio-pci 0002:01:01.0: enabling device (0004 -> 0006)
[ 8268.140181] eal-intr-thread[41505]: unhandled level 2 translation fault (11) at 0xffff92e00200, esr 0x92000006, in librte_pmd_thunderx_nicvf.so.17.05[ffff95281000+11000]
[ 8268.140196] CPU: 0 PID: 41505 Comm: eal-intr-thread Not tainted 4.13.0-11-generic #12-Ubuntu
[ 8268.140198] Hardware name: GIGABYTE R120-T33/MT30-GS1, BIOS T45 06/14/2017
[ 8268.140201] task: ffff800aaf4d1e00 task.stack: ffff800aeac18000
[ 8268.140205] PC is at 0xffff95284014
[ 8268.140208] LR is at 0xffff95284d00
[ 8268.140210] pc : [<0000ffff95284014>] lr : [<0000ffff95284d00>] pstate: 20000000
[ 8268.140212] sp : 0000ffff94d1d3b0
[ 8268.140214] x29: 0000ffff94d1d3d0 x28: 0000ffff961044b0
[ 8268.140219] x27: 0000ffff94d1d4c8 x26: 0000000000000001
[ 8268.140224] x25: 0000000000000001 x24: 0000ffff94d1d488
[ 8268.140229] x23: 0000ffff960cd000 x22: 0000ffff96104000
[ 8268.140234] x21: 0000000000000001 x20: 0000ffff92dca980
[ 8268.140239] x19: 0000ffff961044d0 x18: 0000000000000000
[ 8268.140244] x17: 0000ffff95d308f0 x16: 0000ffff960ce5c0
[ 8268.140249] x15: 00003add06000000 x14: 000793bc30000000
[ 8268.140253] x13: 000000000000204c x12: 0000000000000018
[ 8268.140258] x11: 0000000007ce9936 x10: 000000000000204c
[ 8268.140263] x9 : 003b9aca00000000 x8 : 00ffffffffffffff
[ 8268.140268] x7 : 00000000001eb026 x6 : 0000ffff96361000
[ 8268.140273] x5 : 0000ffff94d1ddc0 x4 : 0000ffff96363948
[ 8268.140278] x3 : bb6c0f87d4ebcca4 x2 : 0000000007ce9936
[ 8268.140283] x1 : 0000ffff92e00000 x0 : 0000ffff92e00200

Maybe related are messages that show up when assigning devices to vfio-pci:
[ 7761.499229] Failed to set up IOMMU for device 0002:01:02.0; retaining platform DMA ops
[ 7762.017296] Failed to set up IOMMU for device 0002:01:01.1; retaining platform DMA ops
[...]

After talking with Jerin I was assigning all but my ssh connection card to vfio-pci (to avoid secondary queues not being available). So I assigned all of them:

$ dpdk-devbind --status
Network devices using DPDK-compatible driver
============================================
0002:01:00.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.7 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.0 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.1 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.7 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.0 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.1 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.7 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.0 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.1 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=

Network devices using kernel driver
===================================
0000:01:10.0 'THUNDERX BGX (Common Ethernet Interface) a026' if= drv=thunder-BGX unused=vfio-pci
0000:01:10.1 'THUNDERX BGX (Common Ethernet Interface) a026' if= drv=thunder-BGX unused=vfio-pci
0002:01:00.0 'THUNDERX Network Interface Controller a01e' if= drv=thunder-nic unused=vfio-pci
0002:01:00.1 'THUNDERX Network Interface Controller virtual function a034' if=enP2p1s0f1 drv=thunder-nicvf unused=vfio-pci *Active*

And all have a different iommu group:
$ find /sys/kernel/iommu_groups/ -type l | sort -n
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/1/devices/0000:00:09.0
/sys/kernel/iommu_groups/10/devices/0000:01:06.1
/sys/kernel/iommu_groups/11/devices/0000:01:06.2
/sys/kernel/iommu_groups/12/devices/0000:01:06.3
/sys/kernel/iommu_groups/13/devices/0000:01:06.4
/sys/kernel/iommu_groups/14/devices/0000:01:06.5
/sys/kernel/iommu_groups/15/devices/0000:01:06.6
/sys/kernel/iommu_groups/16/devices/0000:01:06.7
/sys/kernel/iommu_groups/17/devices/0000:01:07.0
/sys/kernel/iommu_groups/18/devices/0000:01:07.1
/sys/kernel/iommu_groups/19/devices/0000:01:07.2
/sys/kernel/iommu_groups/2/devices/0000:00:10.0
/sys/kernel/iommu_groups/20/devices/0000:01:07.3
/sys/kernel/iommu_groups/21/devices/0000:01:07.4
/sys/kernel/iommu_groups/22/devices/0000:01:07.5
/sys/kernel/iommu_groups/23/devices/0000:01:07.6
/sys/kernel/iommu_groups/24/devices/0000:01:07.7
/sys/kernel/iommu_groups/25/devices/0000:01:09.2
/sys/kernel/iommu_groups/26/devices/0000:01:09.4
/sys/kernel/iommu_groups/27/devices/0000:01:0a.0
/sys/kernel/iommu_groups/28/devices/0000:01:0a.1
/sys/kernel/iommu_groups/29/devices/0000:01:10.0
/sys/kernel/iommu_groups/3/devices/0000:00:11.0
/sys/kernel/iommu_groups/30/devices/0000:01:10.1
/sys/kernel/iommu_groups/31/devices/0000:02:00.0
/sys/kernel/iommu_groups/32/devices/0001:00:08.0
/sys/kernel/iommu_groups/33/devices/0001:00:09.0
/sys/kernel/iommu_groups/34/devices/0001:00:0a.0
/sys/kernel/iommu_groups/35/devices/0001:00:0b.0
/sys/kernel/iommu_groups/36/devices/0002:00:02.0
/sys/kernel/iommu_groups/37/devices/0002:00:03.0
/sys/kernel/iommu_groups/38/devices/0002:01:00.0
/sys/kernel/iommu_groups/39/devices/0004:1f:00.0
/sys/kernel/iommu_groups/39/devices/0004:20:00.0
/sys/kernel/iommu_groups/39/devices/0004:21:00.0
/sys/kernel/iommu_groups/4/devices/0000:00:14.0
/sys/kernel/iommu_groups/40/devices/0002:01:00.1
/sys/kernel/iommu_groups/41/devices/0002:01:00.2
/sys/kernel/iommu_groups/42/devices/0002:01:00.3
/sys/kernel/iommu_groups/43/devices/0002:01:00.4
/sys/kernel/iommu_groups/44/devices/0002:01:00.5
/sys/kernel/iommu_groups/45/devices/0002:01:00.6
/sys/kernel/iommu_groups/46/devices/0002:01:00.7
/sys/kernel/iommu_groups/47/devices/0002:01:01.0
/sys/kernel/iommu_groups/48/devices/0002:01:01.1
/sys/kernel/iommu_groups/49/devices/0002:01:01.2
/sys/kernel/iommu_groups/5/devices/0000:01:00.0
/sys/kernel/iommu_groups/50/devices/0002:01:01.3
/sys/kernel/iommu_groups/51/devices/0002:01:01.4
/sys/kernel/iommu_groups/52/devices/0002:01:01.5
/sys/kernel/iommu_groups/53/devices/0002:01:01.6
/sys/kernel/iommu_groups/54/devices/0002:01:01.7
/sys/kernel/iommu_groups/55/devices/0002:01:02.0
/sys/kernel/iommu_groups/56/devices/0002:01:02.1
/sys/kernel/iommu_groups/57/devices/0002:01:02.2
/sys/kernel/iommu_groups/58/devices/0002:01:02.3
/sys/kernel/iommu_groups/59/devices/0002:01:02.4
/sys/kernel/iommu_groups/6/devices/0000:01:00.1
/sys/kernel/iommu_groups/60/devices/0002:01:02.5
/sys/kernel/iommu_groups/61/devices/0002:01:02.6
/sys/kernel/iommu_groups/62/devices/0002:01:02.7
/sys/kernel/iommu_groups/63/devices/0002:01:03.0
/sys/kernel/iommu_groups/64/devices/0002:01:03.1
/sys/kernel/iommu_groups/65/devices/0002:01:03.2
/sys/kernel/iommu_groups/66/devices/0002:01:03.3
/sys/kernel/iommu_groups/67/devices/0002:01:03.4
/sys/kernel/iommu_groups/68/devices/0002:01:03.5
/sys/kernel/iommu_groups/69/devices/0002:01:03.6
/sys/kernel/iommu_groups/7/devices/0000:01:01.3
/sys/kernel/iommu_groups/70/devices/0000:00:09.1
/sys/kernel/iommu_groups/8/devices/0000:01:01.4
/sys/kernel/iommu_groups/9/devices/0000:01:06.0

Since in the past it was required to whitelist some cards in DPDK I still had that, so see: ... --pci-whitelist 0002:01:01.0 ...
With that removed the logs are slightly longer since it can probe more devices, but the issue is essentially the same:

Openvswitch:
2017-09-21T10:52:48.200Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-09-21T10:52:48.207Z|00002|ovs_numa|INFO|Discovered 48 CPU cores on NUMA node 0
2017-09-21T10:52:48.207Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 48 CPU cores
2017-09-21T10:52:48.207Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-09-21T10:52:48.207Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-09-21T10:52:48.215Z|00006|dpdk|INFO|DPDK Enabled - initializing...
2017-09-21T10:52:48.215Z|00007|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-09-21T10:52:48.216Z|00008|dpdk|INFO|EAL ARGS: ovs-vswitchd -m 2048 --vhost-owner libvirt-qemu:kvm --vhost-perm 0666 -c 0x00000001
2017-09-21T10:52:48.222Z|00009|dpdk|INFO|EAL: Detected 48 lcore(s)
2017-09-21T10:52:48.223Z|00010|dpdk|INFO|EAL: socket owner specified as libvirt-qemu:kvm (64055:117)
2017-09-21T10:52:48.223Z|00011|dpdk|INFO|EAL: socket perm specified as '0666' from '0666'
2017-09-21T10:52:48.263Z|00012|dpdk|INFO|EAL: Probing VFIO support...
2017-09-21T10:52:48.263Z|00013|dpdk|INFO|EAL: VFIO support initialized
2017-09-21T10:53:00.964Z|00014|dpdk|INFO|EAL: PCI device 0002:01:00.1 on NUMA socket 0
2017-09-21T10:53:00.964Z|00015|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:00.964Z|00016|dpdk|INFO|EAL: PCI device 0002:01:00.2 on NUMA socket 0
2017-09-21T10:53:00.964Z|00017|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:00.965Z|00018|dpdk|INFO|EAL: using IOMMU type 1 (Type 1)
2017-09-21T10:53:01.214Z|00019|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=1 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.264Z|00020|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 0 (177d:a034) mac=1c:1b:0d:0d:52:d7
2017-09-21T10:53:01.264Z|00021|dpdk|INFO|EAL: PCI device 0002:01:00.3 on NUMA socket 0
2017-09-21T10:53:01.264Z|00022|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.314Z|00023|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=2 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.364Z|00024|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 1 (177d:a034) mac=1c:1b:0d:0d:52:d8
2017-09-21T10:53:01.364Z|00025|dpdk|INFO|EAL: PCI device 0002:01:00.4 on NUMA socket 0
2017-09-21T10:53:01.365Z|00026|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.415Z|00027|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=3 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.465Z|00028|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 2 (177d:a034) mac=1c:1b:0d:0d:52:d9
2017-09-21T10:53:01.465Z|00029|dpdk|INFO|EAL: PCI device 0002:01:00.5 on NUMA socket 0
2017-09-21T10:53:01.465Z|00030|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.516Z|00031|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=4 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.566Z|00032|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 3 (177d:a034) mac=1c:1b:0d:0d:52:da
2017-09-21T10:53:01.566Z|00033|dpdk|INFO|EAL: PCI device 0002:01:00.6 on NUMA socket 0
2017-09-21T10:53:01.566Z|00034|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.617Z|00035|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=5 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.617Z|00036|dpdk|INFO|EAL: PCI device 0002:01:00.7 on NUMA socket 0
2017-09-21T10:53:01.617Z|00037|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.667Z|00038|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=6 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.667Z|00039|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:53:01.667Z|00040|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.718Z|00041|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=7 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.718Z|00042|dpdk|INFO|EAL: PCI device 0002:01:01.1 on NUMA socket 0
2017-09-21T10:53:01.718Z|00043|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.769Z|00044|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=8 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.769Z|00045|dpdk|INFO|EAL: PCI device 0002:01:01.2 on NUMA socket 0
2017-09-21T10:53:01.769Z|00046|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.819Z|00047|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=9 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.819Z|00048|dpdk|INFO|EAL: PCI device 0002:01:01.3 on NUMA socket 0
2017-09-21T10:53:01.819Z|00049|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.870Z|00050|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=10 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.870Z|00051|dpdk|INFO|EAL: PCI device 0002:01:01.4 on NUMA socket 0
2017-09-21T10:53:01.870Z|00052|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.920Z|00053|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=11 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.920Z|00054|dpdk|INFO|EAL: PCI device 0002:01:01.5 on NUMA socket 0
2017-09-21T10:53:01.920Z|00055|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.971Z|00056|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=12 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.971Z|00057|dpdk|INFO|EAL: PCI device 0002:01:01.6 on NUMA socket 0
2017-09-21T10:53:01.971Z|00058|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.022Z|00059|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=13 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.022Z|00060|dpdk|INFO|EAL: PCI device 0002:01:01.7 on NUMA socket 0
2017-09-21T10:53:02.022Z|00061|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.072Z|00062|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=14 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.072Z|00063|dpdk|INFO|EAL: PCI device 0002:01:02.0 on NUMA socket 0
2017-09-21T10:53:02.072Z|00064|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.139Z|00065|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=15 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.139Z|00066|dpdk|INFO|EAL: PCI device 0002:01:02.1 on NUMA socket 0
2017-09-21T10:53:02.139Z|00067|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.189Z|00068|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=16 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.189Z|00069|dpdk|INFO|EAL: PCI device 0002:01:02.2 on NUMA socket 0
2017-09-21T10:53:02.189Z|00070|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.240Z|00071|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=17 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.240Z|00072|dpdk|INFO|EAL: PCI device 0002:01:02.3 on NUMA socket 0
2017-09-21T10:53:02.240Z|00073|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.290Z|00074|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=18 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.290Z|00075|dpdk|INFO|EAL: PCI device 0002:01:02.4 on NUMA socket 0
2017-09-21T10:53:02.290Z|00076|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.341Z|00077|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=19 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.341Z|00078|dpdk|INFO|EAL: PCI device 0002:01:02.5 on NUMA socket 0
2017-09-21T10:53:02.341Z|00079|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.391Z|00080|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=20 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.391Z|00081|dpdk|INFO|EAL: PCI device 0002:01:02.6 on NUMA socket 0
2017-09-21T10:53:02.391Z|00082|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.442Z|00083|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=21 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.442Z|00084|dpdk|INFO|EAL: PCI device 0002:01:02.7 on NUMA socket 0
2017-09-21T10:53:02.442Z|00085|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.493Z|00086|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=22 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.493Z|00087|dpdk|INFO|EAL: PCI device 0002:01:03.0 on NUMA socket 0
2017-09-21T10:53:02.493Z|00088|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.543Z|00089|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=23 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.543Z|00090|dpdk|INFO|EAL: PCI device 0002:01:03.1 on NUMA socket 0
2017-09-21T10:53:02.543Z|00091|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.594Z|00092|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=24 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.594Z|00093|dpdk|INFO|EAL: PCI device 0002:01:03.2 on NUMA socket 0
2017-09-21T10:53:02.594Z|00094|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.644Z|00095|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=25 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.644Z|00096|dpdk|INFO|EAL: PCI device 0002:01:03.3 on NUMA socket 0
2017-09-21T10:53:02.644Z|00097|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.695Z|00098|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=26 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.695Z|00099|dpdk|INFO|EAL: PCI device 0002:01:03.4 on NUMA socket 0
2017-09-21T10:53:02.695Z|00100|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.745Z|00101|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=27 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.745Z|00102|dpdk|INFO|EAL: PCI device 0002:01:03.5 on NUMA socket 0
2017-09-21T10:53:02.745Z|00103|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.796Z|00104|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=28 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.796Z|00105|dpdk|INFO|EAL: PCI device 0002:01:03.6 on NUMA socket 0
2017-09-21T10:53:02.796Z|00106|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.847Z|00107|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=29 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.847Z|00108|dpdk|INFO|DPDK pdump packet capture enabled
2017-09-21T10:53:02.857Z|00109|dpdk|INFO|DPDK Enabled - initialized
2017-09-21T10:53:02.873Z|00110|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
2017-09-21T10:53:02.873Z|00111|ofproto_dpif|INFO|netdev@ovs-netdev: VLAN header stack length probed as 1
2017-09-21T10:53:02.873Z|00112|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3
2017-09-21T10:53:02.873Z|00113|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports truncate action
2017-09-21T10:53:02.874Z|00114|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids
2017-09-21T10:53:02.874Z|00115|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports clone action
2017-09-21T10:53:02.874Z|00116|ofproto_dpif|INFO|netdev@ovs-netdev: Max sample nesting level probed as 10
2017-09-21T10:53:02.874Z|00117|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports eventmask in conntrack action
2017-09-21T10:53:02.874Z|00118|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state
2017-09-21T10:53:02.874Z|00119|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_zone
2017-09-21T10:53:02.874Z|00120|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_mark
2017-09-21T10:53:02.874Z|00121|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_label
2017-09-21T10:53:02.874Z|00122|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state_nat
2017-09-21T10:53:02.874Z|00123|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple
2017-09-21T10:53:02.874Z|00124|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple6
2017-09-21T10:53:03.146Z|00125|bridge|INFO|bridge ovsdpdkbr0: added interface ovsdpdkbr0 on port 65534
2017-09-21T10:53:03.150Z|00126|dpdk|INFO|VHOST_CONFIG: vhost-user server: socket created, fd: 152
2017-09-21T10:53:03.150Z|00127|netdev_dpdk|INFO|Socket /var/run/openvswitch/vhost-user-1 created for vhost-user port vhost-user-1
2017-09-21T10:53:03.155Z|00128|dpdk|INFO|VHOST_CONFIG: bind to /var/run/openvswitch/vhost-user-1
2017-09-21T10:53:03.155Z|00129|dpdk|INFO|EAL: Socket /var/run/openvswitch/vhost-user-1 changed permissions to 0666
2017-09-21T10:53:03.155Z|00130|dpdk|INFO|EAL: Socket /var/run/openvswitch/vhost-user-1 changed ownership to 64055:117.
2017-09-21T10:53:03.155Z|00131|netdev_dpdk|WARN|dpdkvhostuser ports are considered deprecated; please migrate to dpdkvhostuserclient ports.
2017-09-21T10:53:03.163Z|00132|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 13 created.
2017-09-21T10:53:03.163Z|00133|dpif_netdev|INFO|There are 1 pmd threads on numa node 0
2017-09-21T10:53:03.339Z|00134|bridge|INFO|bridge ovsdpdkbr0: added interface vhost-user-1 on port 1
2017-09-21T10:53:03.340Z|00135|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:53:03.340Z|00136|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:04.349Z|00137|dpdk|ERR|PMD: nicvf_eth_dev_init(): Failed to get ready message from PF
2017-09-21T10:53:04.349Z|00138|dpdk|INFO|EAL: Releasing pci mapped resource for 0002:01:01.0
2017-09-21T10:53:04.349Z|00139|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff8be00000
2017-09-21T10:53:04.349Z|00140|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff8bc00000
2017-09-21T10:53:04.349Z|00141|dpdk|WARN|EAL: Requested device 0002:01:01.0 cannot be used
2017-09-21T10:53:04.349Z|00142|dpdk|ERR|EAL: Driver cannot attach the device (0002:01:01.0)
2017-09-21T10:53:04.349Z|00143|netdev_dpdk|WARN|Error attaching device '0002:01:01.0' to DPDK
2017-09-21T10:53:04.349Z|00144|netdev|WARN|dpdk0: could not set configuration (Invalid argument)
2017-09-21T10:53:04.349Z|00145|bridge|INFO|bridge ovsdpdkbr0: using datapath ID 0000e699e66bd34d
2017-09-21T10:53:04.350Z|00146|connmgr|INFO|ovsdpdkbr0: added service controller "punix:/var/run/openvswitch/ovsdpdkbr0.mgmt"
2017-09-21T10:53:07.500Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Segmentation fault), core dumped)
2017-09-21T10:53:07.500Z|00003|daemon_unix|EMER|could not initiate process monitoring

Dmesg:
[ 9148.473072] vfio-pci 0002:01:00.2: enabling device (0004 -> 0006)
[ 9148.574273] vfio-pci 0002:01:00.3: enabling device (0004 -> 0006)
[ 9148.674919] vfio-pci 0002:01:00.4: enabling device (0004 -> 0006)
[ 9148.775782] vfio-pci 0002:01:00.5: enabling device (0004 -> 0006)
[ 9148.876680] vfio-pci 0002:01:00.6: enabling device (0004 -> 0006)
[ 9148.927253] vfio-pci 0002:01:00.7: enabling device (0004 -> 0006)
[ 9148.977817] vfio-pci 0002:01:01.0: enabling device (0004 -> 0006)
[ 9149.028613] vfio-pci 0002:01:01.1: enabling device (0004 -> 0006)
[ 9149.079179] vfio-pci 0002:01:01.2: enabling device (0004 -> 0006)
[ 9149.129748] vfio-pci 0002:01:01.3: enabling device (0004 -> 0006)
[ 9149.180317] vfio-pci 0002:01:01.4: enabling device (0004 -> 0006)
[ 9149.230885] vfio-pci 0002:01:01.5: enabling device (0004 -> 0006)
[ 9149.281482] vfio-pci 0002:01:01.6: enabling device (0004 -> 0006)
[ 9149.332049] vfio-pci 0002:01:01.7: enabling device (0004 -> 0006)
[ 9149.398419] vfio-pci 0002:01:02.0: enabling device (0004 -> 0006)
[ 9149.448986] vfio-pci 0002:01:02.1: enabling device (0004 -> 0006)
[ 9149.499555] vfio-pci 0002:01:02.2: enabling device (0004 -> 0006)
[ 9149.550134] vfio-pci 0002:01:02.3: enabling device (0004 -> 0006)
[ 9149.600727] vfio-pci 0002:01:02.4: enabling device (0004 -> 0006)
[ 9149.651296] vfio-pci 0002:01:02.5: enabling device (0004 -> 0006)
[ 9149.701863] vfio-pci 0002:01:02.6: enabling device (0004 -> 0006)
[ 9149.752432] vfio-pci 0002:01:02.7: enabling device (0004 -> 0006)
[ 9149.803001] vfio-pci 0002:01:03.0: enabling device (0004 -> 0006)
[ 9149.853567] vfio-pci 0002:01:03.1: enabling device (0004 -> 0006)
[ 9149.904159] vfio-pci 0002:01:03.2: enabling device (0004 -> 0006)
[ 9149.954725] vfio-pci 0002:01:03.3: enabling device (0004 -> 0006)
[ 9150.005295] vfio-pci 0002:01:03.4: enabling device (0004 -> 0006)
[ 9150.055868] vfio-pci 0002:01:03.5: enabling device (0004 -> 0006)
[ 9150.106453] vfio-pci 0002:01:03.6: enabling device (0004 -> 0006)
[ 9151.681855] eal-intr-thread[42086]: unhandled level 2 translation fault (11) at 0xffff8be00200, esr 0x92000006, in librte_pmd_thunderx_nicvf.so.17.05[ffff8ea1d000+11000]
[ 9151.681870] CPU: 0 PID: 42086 Comm: eal-intr-thread Not tainted 4.13.0-11-generic #12-Ubuntu
[ 9151.681872] Hardware name: GIGABYTE R120-T33/MT30-GS1, BIOS T45 06/14/2017
[ 9151.681874] task: ffff800aefc15a00 task.stack: ffff800aeceec000
[ 9151.681879] PC is at 0xffff8ea20014
[ 9151.681881] LR is at 0xffff8ea20d00
[ 9151.681884] pc : [<0000ffff8ea20014>] lr : [<0000ffff8ea20d00>] pstate: 20000000
[ 9151.681885] sp : 0000ffff8e4b93b0
[ 9151.681887] x29: 0000ffff8e4b93d0 x28: 0000ffff8f8a04b0
[ 9151.681893] x27: 0000ffff8e4b94c8 x26: 0000000000000001
[ 9151.681898] x25: 0000000000000001 x24: 0000ffff8e4b9488
[ 9151.681902] x23: 0000ffff8f869000 x22: 0000ffff8f8a0000
[ 9151.681907] x21: 0000000000000001 x20: 0000ffff8c5a7800
[ 9151.681912] x19: 0000ffff8f8a04d0 x18: 0000000000000000
[ 9151.681917] x17: 0000ffff8f4cc8f0 x16: 0000ffff8f86a5c0
[ 9151.681922] x15: 00002d9ad8000000 x14: 0028008430000000
[ 9151.681927] x13: 00000000000023bf x12: 0000000000000018
[ 9151.681932] x11: 00000000282e1f08 x10: 00000000000023bf
[ 9151.681937] x9 : 003b9aca00000000 x8 : 00ffffffffffffff
[ 9151.681941] x7 : 0000000000208616 x6 : 0000ffff8fafd000
[ 9151.681946] x5 : 0000ffff8e4b9dc0 x4 : 0000ffff8faff948
[ 9151.681951] x3 : ea6fa8ce98f91d98 x2 : 00000000282e1f08
[ 9151.681956] x1 : 0000ffff8be00000 x0 : 0000ffff8be00200

In general Openvswitch-dpdk does not segfault - e.g. if I take away vfio from the device I want to initialize then it works.
Also the same build of openvswitch and dpdk just passed x86 regression checks last week, so for now I doubt that is a generic bug.
So to me it really seems to be tied to the initialization of that thunderX port.

Revision history for this message
dann frazier (dannf) wrote :

Just as a test, could you try the kernel from ppa:yarmouth-team/next? There is a known IOMMU-related issue (LP: #1718734) that would be good to rule out.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Upgraded from 4.13.0-11.12 to 4.13.0-11.12+yarmouth2.1

Nothing changed:
- on initial card probing with kernel driver "Failed to set up IOMMU" messages
- on vfio assign again "Failed to set up IOMMU" messages
- on usage of the PMD by Openvswitch Kernel fault
- on usage of the PMD by Openvswitch OVS crash

[ 1322.211137] device ovsdpdkbr0 entered promiscuous mode
[ 1323.420792] eal-intr-thread[3188]: unhandled level 2 translation fault (11) at 0xfffc17400200, esr 0x92000006, in librte_pmd_thunderx_nicvf.so.17.05[ffffaf0a8000+11000]
[ 1323.424214] CPU: 0 PID: 3188 Comm: eal-intr-thread Not tainted 4.13.0-11-generic #12+yarmouth2.1-Ubuntu
[ 1323.424217] Hardware name: GIGABYTE R120-T33/MT30-GS1, BIOS T45 06/14/2017
[ 1323.424219] task: ffff800fb0692d00 task.stack: ffff800fab60c000
[...]

Since we had issues with secondary groups in the past I picked one of the few that came up as SQS=true and used that as dpdk port:
$ ovs-vsctl del-port dpdk0
$ ovs-vsctl add-port ovsdpdkbr0 dpdk0 -- set Interface dpdk0 type=dpdk "options:dpdk-devargs=0002:01:00.5"
# To be able to do so with a crashing OVS you need to unbind all devices from vfio--pci so they can not initialize and crash.
# Then after the config change bind all to vfio again and start OVS

It scans all devices but initializes only the one I have passed as dpdk dev.
In this case everything works which confirms this being an issue of secondary queues.

P.S. it also breaks out assumptions on the box for our testing, we need to be able to use more ports.
The only current SQS=true are 4 VFs of port 0, but we need the other ports.
But I expect it needs to be resolved with a fix to get those initializing correctly.

Revision history for this message
Jan Glauber (jan-glauber-i) wrote :

Comments from Jerin Jacob:

Secondary queue set (>8 queues per port) won't work in upstream kernel.
Can you please test with <=8 queues per port?

If more queues are needed use XFI over XAUI to make it 10G *4 instead
of 1 * 40G.

Revision history for this message
Antonio Rosales (arosales) wrote :

@Christian,

Hello, and thanks for the help on this bug. Were you able or do you have a target to retest secondary queue set (>8 queues per port) [see comment 3].

-thanks,
Antonio

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Antonio for the ping, this update was lost in my PTO :-/

But most important - hi Jan!!!
Another case of the small world effect :-)
Feel free to catch "cpaelzer" on IRC all around on Ubuntu/virtualization/dpdk channels.

About the bug:
After sorting out primary and secondary queues and their assignment to the devices (not 100% sure but working better now) and upgrading to the most recent kernels I no more get the translation fault, but a more sane error message.

With that I could use up to 7 rx queues, if I use more I get an error like this:
  dpdk|EMER|Cannot assign sufficient number of secondary queues to primary VF3

And since none of us can change this HW limitation that is not a bug, but just unfortunate and a bit misleading.

While I'd appreciate if there would be a better way to see which of all your VFs to use other than trial-and-error (Why on earth is it 0002:01:00.[45] out of 0002:0[0-3]:00.[0-7]) I think the bug itself is solved by understanding more of the limitations.

Therefore setting bug status to invalid.

Changed in dpdk (Ubuntu):
status: New → Invalid
Revision history for this message
winson.lin (winson.lin) wrote :

Hi ALL,

Additional information. Current BIOS version is T48 BIOS.

http://b2b.gigabyte.com/ARM-Server/R270-T61-rev-100#support-dl

BR, Winson

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.