Crash on Cavium ThunderX when using Openvswitch-DPDK.
Please don't mind the "failed to bind to /var/run/openvswitch/vhost-user-1" that is just due to openvswitch restarting over and over again due to the crash.
The real issue seems to be around:
2017-09-21T10:38:20.785Z|00044|dpdk|ERR|PMD: nicvf_eth_dev_init(): Failed to get ready message from PF
And then on "the way out" from this fail it even crashes:
2017-09-21T10:38:20.786Z|00049|dpdk|ERR|EAL: Driver cannot attach the device (0002:01:01.0)
2017-09-21T10:38:20.786Z|00050|netdev_dpdk|WARN|Error attaching device '0002:01:01.0' to DPDK
[...]
2017-09-21T10:38:23.364Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Segmentation fault), core dumped)
In the kernel there is:
[ 8268.140181] eal-intr-thread[41505]: unhandled level 2 translation fault (11) at 0xffff92e00200, esr 0x92000006, in librte_pmd_thunderx_nicvf.so.17.05[ffff95281000+11000]
Openvswitch Log:
Log is from starting an OVS that has a thunderX port configured:
2017-09-21T10:38:06.413Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-09-21T10:38:06.424Z|00002|ovs_numa|INFO|Discovered 48 CPU cores on NUMA node 0
2017-09-21T10:38:06.424Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 48 CPU cores
2017-09-21T10:38:06.424Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-09-21T10:38:06.424Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-09-21T10:38:06.433Z|00006|dpdk|INFO|DPDK Enabled - initializing...
2017-09-21T10:38:06.433Z|00007|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-09-21T10:38:06.433Z|00008|dpdk|INFO|EAL ARGS: ovs-vswitchd -m 2048 --pci-whitelist 0002:01:01.0 --vhost-owner libvirt-qemu:kvm --vhost-perm 0666 -c 0x00000001
2017-09-21T10:38:06.440Z|00009|dpdk|INFO|EAL: Detected 48 lcore(s)
2017-09-21T10:38:06.442Z|00010|dpdk|INFO|EAL: socket owner specified as libvirt-qemu:kvm (64055:117)
2017-09-21T10:38:06.442Z|00011|dpdk|INFO|EAL: socket perm specified as '0666' from '0666'
2017-09-21T10:38:06.481Z|00012|dpdk|INFO|EAL: Probing VFIO support...
2017-09-21T10:38:06.481Z|00013|dpdk|INFO|EAL: VFIO support initialized
2017-09-21T10:38:19.208Z|00014|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:38:19.208Z|00015|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:38:19.208Z|00016|dpdk|INFO|EAL: using IOMMU type 1 (Type 1)
2017-09-21T10:38:19.458Z|00017|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=7 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:38:19.458Z|00018|dpdk|INFO|DPDK pdump packet capture enabled
2017-09-21T10:38:19.468Z|00019|dpdk|INFO|DPDK Enabled - initialized
2017-09-21T10:38:19.484Z|00020|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
2017-09-21T10:38:19.484Z|00021|ofproto_dpif|INFO|netdev@ovs-netdev: VLAN header stack length probed as 1
2017-09-21T10:38:19.484Z|00022|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3
2017-09-21T10:38:19.484Z|00023|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports truncate action
2017-09-21T10:38:19.484Z|00024|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids
2017-09-21T10:38:19.484Z|00025|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports clone action
2017-09-21T10:38:19.484Z|00026|ofproto_dpif|INFO|netdev@ovs-netdev: Max sample nesting level probed as 10
2017-09-21T10:38:19.484Z|00027|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports eventmask in conntrack action
2017-09-21T10:38:19.484Z|00028|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state
2017-09-21T10:38:19.485Z|00029|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_zone
2017-09-21T10:38:19.485Z|00030|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_mark
2017-09-21T10:38:19.485Z|00031|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_label
2017-09-21T10:38:19.485Z|00032|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state_nat
2017-09-21T10:38:19.485Z|00033|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple
2017-09-21T10:38:19.485Z|00034|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple6
2017-09-21T10:38:19.768Z|00035|bridge|INFO|bridge ovsdpdkbr0: added interface ovsdpdkbr0 on port 65534
2017-09-21T10:38:19.773Z|00036|dpdk|INFO|VHOST_CONFIG: vhost-user server: socket created, fd: 68
2017-09-21T10:38:19.773Z|00037|netdev_dpdk|INFO|Socket /var/run/openvswitch/vhost-user-1 created for vhost-user port vhost-user-1
2017-09-21T10:38:19.778Z|00038|dpdk|ERR|VHOST_CONFIG: failed to bind to /var/run/openvswitch/vhost-user-1: Address already in use; remove it and try again
2017-09-21T10:38:19.778Z|00039|netdev_dpdk|ERR|rte_vhost_driver_start failed for vhost user port: vhost-user-1
2017-09-21T10:38:19.778Z|00040|netdev_dpdk|WARN|dpdkvhostuser ports are considered deprecated; please migrate to dpdkvhostuserclient ports.
2017-09-21T10:38:19.778Z|00041|bridge|WARN|could not open network device vhost-user-1 (Unknown error -1)
2017-09-21T10:38:19.779Z|00042|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:38:19.779Z|00043|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:38:20.785Z|00044|dpdk|ERR|PMD: nicvf_eth_dev_init(): Failed to get ready message from PF
2017-09-21T10:38:20.785Z|00045|dpdk|INFO|EAL: Releasing pci mapped resource for 0002:01:01.0
2017-09-21T10:38:20.785Z|00046|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff92e00000
2017-09-21T10:38:20.785Z|00047|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff93000000
2017-09-21T10:38:20.785Z|00048|dpdk|WARN|EAL: Requested device 0002:01:01.0 cannot be used
2017-09-21T10:38:20.786Z|00049|dpdk|ERR|EAL: Driver cannot attach the device (0002:01:01.0)
2017-09-21T10:38:20.786Z|00050|netdev_dpdk|WARN|Error attaching device '0002:01:01.0' to DPDK
2017-09-21T10:38:20.786Z|00051|netdev|WARN|dpdk0: could not set configuration (Invalid argument)
2017-09-21T10:38:20.786Z|00052|bridge|INFO|bridge ovsdpdkbr0: using datapath ID 0000e699e66bd34d
2017-09-21T10:38:20.786Z|00053|connmgr|INFO|ovsdpdkbr0: added service controller "punix:/var/run/openvswitch/ovsdpdkbr0.mgmt"
2017-09-21T10:38:23.364Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Segmentation fault), core dumped)
2017-09-21T10:38:23.364Z|00003|daemon_unix|EMER|could not initiate process monitoring
Along that there is a kernel bug showing up:
[ 8266.735308] vfio-pci 0002:01:01.0: enabling device (0004 -> 0006)
[ 8268.140181] eal-intr-thread[41505]: unhandled level 2 translation fault (11) at 0xffff92e00200, esr 0x92000006, in librte_pmd_thunderx_nicvf.so.17.05[ffff95281000+11000]
[ 8268.140196] CPU: 0 PID: 41505 Comm: eal-intr-thread Not tainted 4.13.0-11-generic #12-Ubuntu
[ 8268.140198] Hardware name: GIGABYTE R120-T33/MT30-GS1, BIOS T45 06/14/2017
[ 8268.140201] task: ffff800aaf4d1e00 task.stack: ffff800aeac18000
[ 8268.140205] PC is at 0xffff95284014
[ 8268.140208] LR is at 0xffff95284d00
[ 8268.140210] pc : [<0000ffff95284014>] lr : [<0000ffff95284d00>] pstate: 20000000
[ 8268.140212] sp : 0000ffff94d1d3b0
[ 8268.140214] x29: 0000ffff94d1d3d0 x28: 0000ffff961044b0
[ 8268.140219] x27: 0000ffff94d1d4c8 x26: 0000000000000001
[ 8268.140224] x25: 0000000000000001 x24: 0000ffff94d1d488
[ 8268.140229] x23: 0000ffff960cd000 x22: 0000ffff96104000
[ 8268.140234] x21: 0000000000000001 x20: 0000ffff92dca980
[ 8268.140239] x19: 0000ffff961044d0 x18: 0000000000000000
[ 8268.140244] x17: 0000ffff95d308f0 x16: 0000ffff960ce5c0
[ 8268.140249] x15: 00003add06000000 x14: 000793bc30000000
[ 8268.140253] x13: 000000000000204c x12: 0000000000000018
[ 8268.140258] x11: 0000000007ce9936 x10: 000000000000204c
[ 8268.140263] x9 : 003b9aca00000000 x8 : 00ffffffffffffff
[ 8268.140268] x7 : 00000000001eb026 x6 : 0000ffff96361000
[ 8268.140273] x5 : 0000ffff94d1ddc0 x4 : 0000ffff96363948
[ 8268.140278] x3 : bb6c0f87d4ebcca4 x2 : 0000000007ce9936
[ 8268.140283] x1 : 0000ffff92e00000 x0 : 0000ffff92e00200
Maybe related are messages that show up when assigning devices to vfio-pci:
[ 7761.499229] Failed to set up IOMMU for device 0002:01:02.0; retaining platform DMA ops
[ 7762.017296] Failed to set up IOMMU for device 0002:01:01.1; retaining platform DMA ops
[...]
After talking with Jerin I was assigning all but my ssh connection card to vfio-pci (to avoid secondary queues not being available). So I assigned all of them:
$ dpdk-devbind --status
Network devices using DPDK-compatible driver
============================================
0002:01:00.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:00.7 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.0 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.1 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:01.7 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.0 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.1 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:02.7 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.0 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.1 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.2 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.3 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.4 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.5 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
0002:01:03.6 'THUNDERX Network Interface Controller virtual function a034' drv=vfio-pci unused=
Network devices using kernel driver
===================================
0000:01:10.0 'THUNDERX BGX (Common Ethernet Interface) a026' if= drv=thunder-BGX unused=vfio-pci
0000:01:10.1 'THUNDERX BGX (Common Ethernet Interface) a026' if= drv=thunder-BGX unused=vfio-pci
0002:01:00.0 'THUNDERX Network Interface Controller a01e' if= drv=thunder-nic unused=vfio-pci
0002:01:00.1 'THUNDERX Network Interface Controller virtual function a034' if=enP2p1s0f1 drv=thunder-nicvf unused=vfio-pci *Active*
And all have a different iommu group:
$ find /sys/kernel/iommu_groups/ -type l | sort -n
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/1/devices/0000:00:09.0
/sys/kernel/iommu_groups/10/devices/0000:01:06.1
/sys/kernel/iommu_groups/11/devices/0000:01:06.2
/sys/kernel/iommu_groups/12/devices/0000:01:06.3
/sys/kernel/iommu_groups/13/devices/0000:01:06.4
/sys/kernel/iommu_groups/14/devices/0000:01:06.5
/sys/kernel/iommu_groups/15/devices/0000:01:06.6
/sys/kernel/iommu_groups/16/devices/0000:01:06.7
/sys/kernel/iommu_groups/17/devices/0000:01:07.0
/sys/kernel/iommu_groups/18/devices/0000:01:07.1
/sys/kernel/iommu_groups/19/devices/0000:01:07.2
/sys/kernel/iommu_groups/2/devices/0000:00:10.0
/sys/kernel/iommu_groups/20/devices/0000:01:07.3
/sys/kernel/iommu_groups/21/devices/0000:01:07.4
/sys/kernel/iommu_groups/22/devices/0000:01:07.5
/sys/kernel/iommu_groups/23/devices/0000:01:07.6
/sys/kernel/iommu_groups/24/devices/0000:01:07.7
/sys/kernel/iommu_groups/25/devices/0000:01:09.2
/sys/kernel/iommu_groups/26/devices/0000:01:09.4
/sys/kernel/iommu_groups/27/devices/0000:01:0a.0
/sys/kernel/iommu_groups/28/devices/0000:01:0a.1
/sys/kernel/iommu_groups/29/devices/0000:01:10.0
/sys/kernel/iommu_groups/3/devices/0000:00:11.0
/sys/kernel/iommu_groups/30/devices/0000:01:10.1
/sys/kernel/iommu_groups/31/devices/0000:02:00.0
/sys/kernel/iommu_groups/32/devices/0001:00:08.0
/sys/kernel/iommu_groups/33/devices/0001:00:09.0
/sys/kernel/iommu_groups/34/devices/0001:00:0a.0
/sys/kernel/iommu_groups/35/devices/0001:00:0b.0
/sys/kernel/iommu_groups/36/devices/0002:00:02.0
/sys/kernel/iommu_groups/37/devices/0002:00:03.0
/sys/kernel/iommu_groups/38/devices/0002:01:00.0
/sys/kernel/iommu_groups/39/devices/0004:1f:00.0
/sys/kernel/iommu_groups/39/devices/0004:20:00.0
/sys/kernel/iommu_groups/39/devices/0004:21:00.0
/sys/kernel/iommu_groups/4/devices/0000:00:14.0
/sys/kernel/iommu_groups/40/devices/0002:01:00.1
/sys/kernel/iommu_groups/41/devices/0002:01:00.2
/sys/kernel/iommu_groups/42/devices/0002:01:00.3
/sys/kernel/iommu_groups/43/devices/0002:01:00.4
/sys/kernel/iommu_groups/44/devices/0002:01:00.5
/sys/kernel/iommu_groups/45/devices/0002:01:00.6
/sys/kernel/iommu_groups/46/devices/0002:01:00.7
/sys/kernel/iommu_groups/47/devices/0002:01:01.0
/sys/kernel/iommu_groups/48/devices/0002:01:01.1
/sys/kernel/iommu_groups/49/devices/0002:01:01.2
/sys/kernel/iommu_groups/5/devices/0000:01:00.0
/sys/kernel/iommu_groups/50/devices/0002:01:01.3
/sys/kernel/iommu_groups/51/devices/0002:01:01.4
/sys/kernel/iommu_groups/52/devices/0002:01:01.5
/sys/kernel/iommu_groups/53/devices/0002:01:01.6
/sys/kernel/iommu_groups/54/devices/0002:01:01.7
/sys/kernel/iommu_groups/55/devices/0002:01:02.0
/sys/kernel/iommu_groups/56/devices/0002:01:02.1
/sys/kernel/iommu_groups/57/devices/0002:01:02.2
/sys/kernel/iommu_groups/58/devices/0002:01:02.3
/sys/kernel/iommu_groups/59/devices/0002:01:02.4
/sys/kernel/iommu_groups/6/devices/0000:01:00.1
/sys/kernel/iommu_groups/60/devices/0002:01:02.5
/sys/kernel/iommu_groups/61/devices/0002:01:02.6
/sys/kernel/iommu_groups/62/devices/0002:01:02.7
/sys/kernel/iommu_groups/63/devices/0002:01:03.0
/sys/kernel/iommu_groups/64/devices/0002:01:03.1
/sys/kernel/iommu_groups/65/devices/0002:01:03.2
/sys/kernel/iommu_groups/66/devices/0002:01:03.3
/sys/kernel/iommu_groups/67/devices/0002:01:03.4
/sys/kernel/iommu_groups/68/devices/0002:01:03.5
/sys/kernel/iommu_groups/69/devices/0002:01:03.6
/sys/kernel/iommu_groups/7/devices/0000:01:01.3
/sys/kernel/iommu_groups/70/devices/0000:00:09.1
/sys/kernel/iommu_groups/8/devices/0000:01:01.4
/sys/kernel/iommu_groups/9/devices/0000:01:06.0
Since in the past it was required to whitelist some cards in DPDK I still had that, so see: ... --pci-whitelist 0002:01:01.0 ...
With that removed the logs are slightly longer since it can probe more devices, but the issue is essentially the same:
Openvswitch:
2017-09-21T10:52:48.200Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-09-21T10:52:48.207Z|00002|ovs_numa|INFO|Discovered 48 CPU cores on NUMA node 0
2017-09-21T10:52:48.207Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 48 CPU cores
2017-09-21T10:52:48.207Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-09-21T10:52:48.207Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-09-21T10:52:48.215Z|00006|dpdk|INFO|DPDK Enabled - initializing...
2017-09-21T10:52:48.215Z|00007|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-09-21T10:52:48.216Z|00008|dpdk|INFO|EAL ARGS: ovs-vswitchd -m 2048 --vhost-owner libvirt-qemu:kvm --vhost-perm 0666 -c 0x00000001
2017-09-21T10:52:48.222Z|00009|dpdk|INFO|EAL: Detected 48 lcore(s)
2017-09-21T10:52:48.223Z|00010|dpdk|INFO|EAL: socket owner specified as libvirt-qemu:kvm (64055:117)
2017-09-21T10:52:48.223Z|00011|dpdk|INFO|EAL: socket perm specified as '0666' from '0666'
2017-09-21T10:52:48.263Z|00012|dpdk|INFO|EAL: Probing VFIO support...
2017-09-21T10:52:48.263Z|00013|dpdk|INFO|EAL: VFIO support initialized
2017-09-21T10:53:00.964Z|00014|dpdk|INFO|EAL: PCI device 0002:01:00.1 on NUMA socket 0
2017-09-21T10:53:00.964Z|00015|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:00.964Z|00016|dpdk|INFO|EAL: PCI device 0002:01:00.2 on NUMA socket 0
2017-09-21T10:53:00.964Z|00017|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:00.965Z|00018|dpdk|INFO|EAL: using IOMMU type 1 (Type 1)
2017-09-21T10:53:01.214Z|00019|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=1 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.264Z|00020|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 0 (177d:a034) mac=1c:1b:0d:0d:52:d7
2017-09-21T10:53:01.264Z|00021|dpdk|INFO|EAL: PCI device 0002:01:00.3 on NUMA socket 0
2017-09-21T10:53:01.264Z|00022|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.314Z|00023|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=2 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.364Z|00024|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 1 (177d:a034) mac=1c:1b:0d:0d:52:d8
2017-09-21T10:53:01.364Z|00025|dpdk|INFO|EAL: PCI device 0002:01:00.4 on NUMA socket 0
2017-09-21T10:53:01.365Z|00026|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.415Z|00027|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=3 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.465Z|00028|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 2 (177d:a034) mac=1c:1b:0d:0d:52:d9
2017-09-21T10:53:01.465Z|00029|dpdk|INFO|EAL: PCI device 0002:01:00.5 on NUMA socket 0
2017-09-21T10:53:01.465Z|00030|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.516Z|00031|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=4 mode=tns-bypass sqs=false loopback_supported=true
2017-09-21T10:53:01.566Z|00032|dpdk|INFO|PMD: nicvf_eth_dev_init(): Port 3 (177d:a034) mac=1c:1b:0d:0d:52:da
2017-09-21T10:53:01.566Z|00033|dpdk|INFO|EAL: PCI device 0002:01:00.6 on NUMA socket 0
2017-09-21T10:53:01.566Z|00034|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.617Z|00035|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=5 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.617Z|00036|dpdk|INFO|EAL: PCI device 0002:01:00.7 on NUMA socket 0
2017-09-21T10:53:01.617Z|00037|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.667Z|00038|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=6 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.667Z|00039|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:53:01.667Z|00040|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.718Z|00041|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=7 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.718Z|00042|dpdk|INFO|EAL: PCI device 0002:01:01.1 on NUMA socket 0
2017-09-21T10:53:01.718Z|00043|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.769Z|00044|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=8 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.769Z|00045|dpdk|INFO|EAL: PCI device 0002:01:01.2 on NUMA socket 0
2017-09-21T10:53:01.769Z|00046|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.819Z|00047|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=9 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.819Z|00048|dpdk|INFO|EAL: PCI device 0002:01:01.3 on NUMA socket 0
2017-09-21T10:53:01.819Z|00049|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.870Z|00050|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=10 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.870Z|00051|dpdk|INFO|EAL: PCI device 0002:01:01.4 on NUMA socket 0
2017-09-21T10:53:01.870Z|00052|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.920Z|00053|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=11 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.920Z|00054|dpdk|INFO|EAL: PCI device 0002:01:01.5 on NUMA socket 0
2017-09-21T10:53:01.920Z|00055|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:01.971Z|00056|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=12 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:01.971Z|00057|dpdk|INFO|EAL: PCI device 0002:01:01.6 on NUMA socket 0
2017-09-21T10:53:01.971Z|00058|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.022Z|00059|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=13 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.022Z|00060|dpdk|INFO|EAL: PCI device 0002:01:01.7 on NUMA socket 0
2017-09-21T10:53:02.022Z|00061|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.072Z|00062|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=14 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.072Z|00063|dpdk|INFO|EAL: PCI device 0002:01:02.0 on NUMA socket 0
2017-09-21T10:53:02.072Z|00064|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.139Z|00065|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=15 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.139Z|00066|dpdk|INFO|EAL: PCI device 0002:01:02.1 on NUMA socket 0
2017-09-21T10:53:02.139Z|00067|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.189Z|00068|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=16 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.189Z|00069|dpdk|INFO|EAL: PCI device 0002:01:02.2 on NUMA socket 0
2017-09-21T10:53:02.189Z|00070|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.240Z|00071|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=17 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.240Z|00072|dpdk|INFO|EAL: PCI device 0002:01:02.3 on NUMA socket 0
2017-09-21T10:53:02.240Z|00073|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.290Z|00074|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=18 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.290Z|00075|dpdk|INFO|EAL: PCI device 0002:01:02.4 on NUMA socket 0
2017-09-21T10:53:02.290Z|00076|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.341Z|00077|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=19 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.341Z|00078|dpdk|INFO|EAL: PCI device 0002:01:02.5 on NUMA socket 0
2017-09-21T10:53:02.341Z|00079|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.391Z|00080|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=20 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.391Z|00081|dpdk|INFO|EAL: PCI device 0002:01:02.6 on NUMA socket 0
2017-09-21T10:53:02.391Z|00082|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.442Z|00083|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=21 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.442Z|00084|dpdk|INFO|EAL: PCI device 0002:01:02.7 on NUMA socket 0
2017-09-21T10:53:02.442Z|00085|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.493Z|00086|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=22 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.493Z|00087|dpdk|INFO|EAL: PCI device 0002:01:03.0 on NUMA socket 0
2017-09-21T10:53:02.493Z|00088|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.543Z|00089|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=23 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.543Z|00090|dpdk|INFO|EAL: PCI device 0002:01:03.1 on NUMA socket 0
2017-09-21T10:53:02.543Z|00091|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.594Z|00092|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=24 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.594Z|00093|dpdk|INFO|EAL: PCI device 0002:01:03.2 on NUMA socket 0
2017-09-21T10:53:02.594Z|00094|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.644Z|00095|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=25 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.644Z|00096|dpdk|INFO|EAL: PCI device 0002:01:03.3 on NUMA socket 0
2017-09-21T10:53:02.644Z|00097|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.695Z|00098|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=26 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.695Z|00099|dpdk|INFO|EAL: PCI device 0002:01:03.4 on NUMA socket 0
2017-09-21T10:53:02.695Z|00100|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.745Z|00101|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=27 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.745Z|00102|dpdk|INFO|EAL: PCI device 0002:01:03.5 on NUMA socket 0
2017-09-21T10:53:02.745Z|00103|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.796Z|00104|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=28 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.796Z|00105|dpdk|INFO|EAL: PCI device 0002:01:03.6 on NUMA socket 0
2017-09-21T10:53:02.796Z|00106|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:02.847Z|00107|dpdk|INFO|PMD: nicvf_eth_dev_init(): node=0 vf=29 mode=tns-bypass sqs=true loopback_supported=false
2017-09-21T10:53:02.847Z|00108|dpdk|INFO|DPDK pdump packet capture enabled
2017-09-21T10:53:02.857Z|00109|dpdk|INFO|DPDK Enabled - initialized
2017-09-21T10:53:02.873Z|00110|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
2017-09-21T10:53:02.873Z|00111|ofproto_dpif|INFO|netdev@ovs-netdev: VLAN header stack length probed as 1
2017-09-21T10:53:02.873Z|00112|ofproto_dpif|INFO|netdev@ovs-netdev: MPLS label stack length probed as 3
2017-09-21T10:53:02.873Z|00113|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports truncate action
2017-09-21T10:53:02.874Z|00114|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports unique flow ids
2017-09-21T10:53:02.874Z|00115|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports clone action
2017-09-21T10:53:02.874Z|00116|ofproto_dpif|INFO|netdev@ovs-netdev: Max sample nesting level probed as 10
2017-09-21T10:53:02.874Z|00117|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports eventmask in conntrack action
2017-09-21T10:53:02.874Z|00118|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state
2017-09-21T10:53:02.874Z|00119|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_zone
2017-09-21T10:53:02.874Z|00120|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_mark
2017-09-21T10:53:02.874Z|00121|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_label
2017-09-21T10:53:02.874Z|00122|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_state_nat
2017-09-21T10:53:02.874Z|00123|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple
2017-09-21T10:53:02.874Z|00124|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports ct_orig_tuple6
2017-09-21T10:53:03.146Z|00125|bridge|INFO|bridge ovsdpdkbr0: added interface ovsdpdkbr0 on port 65534
2017-09-21T10:53:03.150Z|00126|dpdk|INFO|VHOST_CONFIG: vhost-user server: socket created, fd: 152
2017-09-21T10:53:03.150Z|00127|netdev_dpdk|INFO|Socket /var/run/openvswitch/vhost-user-1 created for vhost-user port vhost-user-1
2017-09-21T10:53:03.155Z|00128|dpdk|INFO|VHOST_CONFIG: bind to /var/run/openvswitch/vhost-user-1
2017-09-21T10:53:03.155Z|00129|dpdk|INFO|EAL: Socket /var/run/openvswitch/vhost-user-1 changed permissions to 0666
2017-09-21T10:53:03.155Z|00130|dpdk|INFO|EAL: Socket /var/run/openvswitch/vhost-user-1 changed ownership to 64055:117.
2017-09-21T10:53:03.155Z|00131|netdev_dpdk|WARN|dpdkvhostuser ports are considered deprecated; please migrate to dpdkvhostuserclient ports.
2017-09-21T10:53:03.163Z|00132|dpif_netdev|INFO|PMD thread on numa_id: 0, core id: 13 created.
2017-09-21T10:53:03.163Z|00133|dpif_netdev|INFO|There are 1 pmd threads on numa node 0
2017-09-21T10:53:03.339Z|00134|bridge|INFO|bridge ovsdpdkbr0: added interface vhost-user-1 on port 1
2017-09-21T10:53:03.340Z|00135|dpdk|INFO|EAL: PCI device 0002:01:01.0 on NUMA socket 0
2017-09-21T10:53:03.340Z|00136|dpdk|INFO|EAL: probe driver: 177d:a034 net_thunderx
2017-09-21T10:53:04.349Z|00137|dpdk|ERR|PMD: nicvf_eth_dev_init(): Failed to get ready message from PF
2017-09-21T10:53:04.349Z|00138|dpdk|INFO|EAL: Releasing pci mapped resource for 0002:01:01.0
2017-09-21T10:53:04.349Z|00139|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff8be00000
2017-09-21T10:53:04.349Z|00140|dpdk|INFO|EAL: Calling pci_unmap_resource for 0002:01:01.0 at 0xffff8bc00000
2017-09-21T10:53:04.349Z|00141|dpdk|WARN|EAL: Requested device 0002:01:01.0 cannot be used
2017-09-21T10:53:04.349Z|00142|dpdk|ERR|EAL: Driver cannot attach the device (0002:01:01.0)
2017-09-21T10:53:04.349Z|00143|netdev_dpdk|WARN|Error attaching device '0002:01:01.0' to DPDK
2017-09-21T10:53:04.349Z|00144|netdev|WARN|dpdk0: could not set configuration (Invalid argument)
2017-09-21T10:53:04.349Z|00145|bridge|INFO|bridge ovsdpdkbr0: using datapath ID 0000e699e66bd34d
2017-09-21T10:53:04.350Z|00146|connmgr|INFO|ovsdpdkbr0: added service controller "punix:/var/run/openvswitch/ovsdpdkbr0.mgmt"
2017-09-21T10:53:07.500Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Segmentation fault), core dumped)
2017-09-21T10:53:07.500Z|00003|daemon_unix|EMER|could not initiate process monitoring
Dmesg:
[ 9148.473072] vfio-pci 0002:01:00.2: enabling device (0004 -> 0006)
[ 9148.574273] vfio-pci 0002:01:00.3: enabling device (0004 -> 0006)
[ 9148.674919] vfio-pci 0002:01:00.4: enabling device (0004 -> 0006)
[ 9148.775782] vfio-pci 0002:01:00.5: enabling device (0004 -> 0006)
[ 9148.876680] vfio-pci 0002:01:00.6: enabling device (0004 -> 0006)
[ 9148.927253] vfio-pci 0002:01:00.7: enabling device (0004 -> 0006)
[ 9148.977817] vfio-pci 0002:01:01.0: enabling device (0004 -> 0006)
[ 9149.028613] vfio-pci 0002:01:01.1: enabling device (0004 -> 0006)
[ 9149.079179] vfio-pci 0002:01:01.2: enabling device (0004 -> 0006)
[ 9149.129748] vfio-pci 0002:01:01.3: enabling device (0004 -> 0006)
[ 9149.180317] vfio-pci 0002:01:01.4: enabling device (0004 -> 0006)
[ 9149.230885] vfio-pci 0002:01:01.5: enabling device (0004 -> 0006)
[ 9149.281482] vfio-pci 0002:01:01.6: enabling device (0004 -> 0006)
[ 9149.332049] vfio-pci 0002:01:01.7: enabling device (0004 -> 0006)
[ 9149.398419] vfio-pci 0002:01:02.0: enabling device (0004 -> 0006)
[ 9149.448986] vfio-pci 0002:01:02.1: enabling device (0004 -> 0006)
[ 9149.499555] vfio-pci 0002:01:02.2: enabling device (0004 -> 0006)
[ 9149.550134] vfio-pci 0002:01:02.3: enabling device (0004 -> 0006)
[ 9149.600727] vfio-pci 0002:01:02.4: enabling device (0004 -> 0006)
[ 9149.651296] vfio-pci 0002:01:02.5: enabling device (0004 -> 0006)
[ 9149.701863] vfio-pci 0002:01:02.6: enabling device (0004 -> 0006)
[ 9149.752432] vfio-pci 0002:01:02.7: enabling device (0004 -> 0006)
[ 9149.803001] vfio-pci 0002:01:03.0: enabling device (0004 -> 0006)
[ 9149.853567] vfio-pci 0002:01:03.1: enabling device (0004 -> 0006)
[ 9149.904159] vfio-pci 0002:01:03.2: enabling device (0004 -> 0006)
[ 9149.954725] vfio-pci 0002:01:03.3: enabling device (0004 -> 0006)
[ 9150.005295] vfio-pci 0002:01:03.4: enabling device (0004 -> 0006)
[ 9150.055868] vfio-pci 0002:01:03.5: enabling device (0004 -> 0006)
[ 9150.106453] vfio-pci 0002:01:03.6: enabling device (0004 -> 0006)
[ 9151.681855] eal-intr-thread[42086]: unhandled level 2 translation fault (11) at 0xffff8be00200, esr 0x92000006, in librte_pmd_thunderx_nicvf.so.17.05[ffff8ea1d000+11000]
[ 9151.681870] CPU: 0 PID: 42086 Comm: eal-intr-thread Not tainted 4.13.0-11-generic #12-Ubuntu
[ 9151.681872] Hardware name: GIGABYTE R120-T33/MT30-GS1, BIOS T45 06/14/2017
[ 9151.681874] task: ffff800aefc15a00 task.stack: ffff800aeceec000
[ 9151.681879] PC is at 0xffff8ea20014
[ 9151.681881] LR is at 0xffff8ea20d00
[ 9151.681884] pc : [<0000ffff8ea20014>] lr : [<0000ffff8ea20d00>] pstate: 20000000
[ 9151.681885] sp : 0000ffff8e4b93b0
[ 9151.681887] x29: 0000ffff8e4b93d0 x28: 0000ffff8f8a04b0
[ 9151.681893] x27: 0000ffff8e4b94c8 x26: 0000000000000001
[ 9151.681898] x25: 0000000000000001 x24: 0000ffff8e4b9488
[ 9151.681902] x23: 0000ffff8f869000 x22: 0000ffff8f8a0000
[ 9151.681907] x21: 0000000000000001 x20: 0000ffff8c5a7800
[ 9151.681912] x19: 0000ffff8f8a04d0 x18: 0000000000000000
[ 9151.681917] x17: 0000ffff8f4cc8f0 x16: 0000ffff8f86a5c0
[ 9151.681922] x15: 00002d9ad8000000 x14: 0028008430000000
[ 9151.681927] x13: 00000000000023bf x12: 0000000000000018
[ 9151.681932] x11: 00000000282e1f08 x10: 00000000000023bf
[ 9151.681937] x9 : 003b9aca00000000 x8 : 00ffffffffffffff
[ 9151.681941] x7 : 0000000000208616 x6 : 0000ffff8fafd000
[ 9151.681946] x5 : 0000ffff8e4b9dc0 x4 : 0000ffff8faff948
[ 9151.681951] x3 : ea6fa8ce98f91d98 x2 : 00000000282e1f08
[ 9151.681956] x1 : 0000ffff8be00000 x0 : 0000ffff8be00200
In general Openvswitch-dpdk does not segfault - e.g. if I take away vfio from the device I want to initialize then it works.
Also the same build of openvswitch and dpdk just passed x86 regression checks last week, so for now I doubt that is a generic bug.
So to me it really seems to be tied to the initialization of that thunderX port.
Just as a test, could you try the kernel from ppa:yarmouth- team/next? There is a known IOMMU-related issue (LP: #1718734) that would be good to rule out.