OVS+DPDK segfault at the host, after running "ovs-vsctl set interface dpdk0 options:n_rxq=2 " within a KVM Guest

Bug #1577088 reported by Thiago Martins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dpdk (Ubuntu)
Expired
High
Unassigned
openvswitch (Ubuntu)
Expired
High
Unassigned

Bug Description

Guys,

 It is possible to crash OVS + DPDK running at the host, from inside of a KVM Guest!

 All you need to do, is to enable multi-queue, then, from a KVM Guest, you can kill OVS running at the host...

 * Hardware requirements (might be exaggerated but this is what I have):

 1 Dell Server with dedicated 2 x 10G NIC cards, plus another 1 or 2 1G NIC, for management, apt-get, ssh, etc;
 1 IXIA Traffic Generator - 10G in both directions.

 * Steps to reproduce, at a glance:

 1- Deploy Ubuntu at the host;

 a. Grub options /etc/default/grub:

-
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash iommu=pt intel_iommu=on default_hugepagesz=1GB hugepagesz=1G hugepages=64"
-

 2- Install OVS with DPDK;

 3- Configure DPDK, 1G Hugepages, PCI IDs and create the OVS bridges for a VM:

 a. /etc/default/openvswitch-switch:

-
DPDK_OPTS='--dpdk -c 0x1 -n 4 -m 2048,0 --vhost-owner libvirt-qemu:kvm --vhost-perm 0664'
-

 b. /etc/dpdk/interfaces:

-
pci 0000:06:00.0 uio_pci_generic
pci 0000:06:00.1 uio_pci_generic
-

 NOTE: those PCI devices are located at NUMA Node 0.

 c. DPDK Hugepages /etc/dpdk/dpdk.conf:

-
NR_1G_PAGES=32
-

 d. OVS Bridges:

ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
ovs-vsctl add-port br0 vhost-user1 -- set Interface vhost-user1 type=dpdkvhostuser

ovs-vsctl add-br br1 -- set bridge br1 datapath_type=netdev
ovs-vsctl add-port br1 dpdk1 -- set Interface dpdk1 type=dpdk
ovs-vsctl add-port br1 vhost-user2 -- set Interface vhost-user2 type=dpdkvhostuser

ip link set dev br0 up
ip link set dev br1 up

 4- At the host, enable multi-queue and add more CPU Cores to OVS+DPDK PMD threads:

ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=4
ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=FFFF

 5- Deploy Ubuntu at the VM, full Libvirt XML:

 a. ubuntu-16.01-1 XML:

 https://paste.ubuntu.com/16162857/

 b. /etc/default/grub:

-
GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1GB hugepagesz=1G hugepages=1"
-

 6- Install OVS with DPDK;

 7- Configure DPDK, 1G Hugepages, PCI IDs and create the OVS bridges within the VM:

 NOTE: Do NOT enable multi-queue inside of the VM yet, you'll see that, so far, it will work!

 a. /etc/default/openvswitch-switch:

-
DPDK_OPTS='--dpdk -c 0x1 -n 4 -m 1024 --pci-blacklist 0000:00:03.0 --pci-blacklist 0000:00:04.0'
-

 b. /etc/dpdk/interfaces:

-
pci 0000:00:05.0 uio_pci_generic
pci 0000:00:06.0 uio_pci_generic
-

 c. DPDK Hugepages /etc/dpdk/dpdk.conf:

-
NR_1G_PAGES=1
-

 d. OVS Bridge:

ovs-vsctl add-br ovsbr -- set bridge ovsbr datapath_type=netdev
ovs-vsctl add-port ovsbr dpdk0 -- set Interface dpdk0 type=dpdk
ovs-vsctl add-port ovsbr dpdk1 -- set Interface dpdk1 type=dpdk

ip link set dev ovsbr up

 NOTE 1: So far, so good! But no multi-queue yet!

 NOTE 2: Sometimes, you can crash ovs-vswitchd at the host, right here!!!

 8- At the VM, add more CPU Cores to OVS+DPDK PMD threads:

2 Cores):

ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=6

or:

4 Cores):

ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=F

 9- Enable multi-queue before starting up DPDK and OVS, run this inside of the VM:

systemctl disable dpdk
systemctl disable openvswitch-switch

reboot

ethtool -L ens5 combined 4
ethtool -L ens6 combined 4

service dpdk start
service openvswitch-switch start

BOOM!!!

 10- Error log at the host (ovs-vswitchd + DPDK crashed):

 https://paste.ubuntu.com/16152614/

 IMPORTANT NOTES:

 * Sometimes, even without enabling multi-queue at the VM, ovs-vswitchd at the host, crashes!

 ** Also, more weird, is that I have a proprietary DPDK App (L2 Bridge for DPI), that uses multi-queue automatically and it does NOT crash the ovs-vswitchd running at the host! I can use my DPDK App with multi-queue, but I can't do the same with OVS+DPDK.

 So, if I replace "ubuntu16.01-1.qcow2", by my own qcow2 where I have a proprietary DPDK App, I can use multi-queue, OVS+DPDK at the host works just fine (slower than PCI Pass but, acceptable, much better than just regular OVS).

Cheers!
Thiago

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: openvswitch-switch-dpdk 2.5.0-0ubuntu1
ProcVersionSignature: Ubuntu 4.4.0-22.38-generic 4.4.8
Uname: Linux 4.4.0-22-generic x86_64
ApportVersion: 2.20.1-0ubuntu2
Architecture: amd64
Date: Sat Apr 30 18:04:16 2016
SourcePackage: openvswitch
UpgradeStatus: Upgraded to xenial on 2016-04-07 (23 days ago)

Revision history for this message
Thiago Martins (martinx) wrote :
Thiago Martins (martinx)
summary: OVS+DPDK crashes at the host, right after starting another OVS+DPDK
- inside of a KVM Guest, if multi-queue is enabled
+ inside of a KVM Guest, easier to reproduce if multi-queue is enabled.
Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: OVS+DPDK crashes at the host, right after starting another OVS+DPDK inside of a KVM Guest, easier to reproduce if multi-queue is enabled.

Very interesting Thiago,
thanks for reporting - also the steps to reproduce are detailed and I should be able to work on that.

As I said in the mail thread it would be great if you could report that to upstream DPDK&OVS Dev and keep me on CC.
Quite often such things turned out to be known issues.

I'll try to reproduce in the meantime ...

Changed in openvswitch (Ubuntu):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
also the only "good" thing on that crash is that it should have left you a .crash file for apport.

You might use e.g.
apport-retrace --rebuild-package-info --stdout --confirm /var/crash/_usr_lib_openvswitch-switch-dpdk_ovs-vswitchd-dpdk.0.crash | pastebinit

The filename could be slightly different, that would make a nice addition to the bug and the mailing list request to identify where your host crashes.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I tried to reproduce, to not interfere with another setup I'm currently working on I had just one dpdk interface attached in the host and just one vhost-user into the guest.

Device came up with 1 of 4 queues, installed DPDK in the guest as well, was able to initialze it in the guest with one queue without bug.

I followed your testcase of disabling, rebooting, setting multiple queues via ethtool on the guest dev and reenabling openvswitch-switch. It worked just fine in my (slightly different) environment.
In the Host I see this in the Journal when I start the multiq-enabled openvswitch-dpdk in the guest:
http://paste.ubuntu.com/16167772/
Does that look anything like it for you - no matter what it looks like probably also worth adding when you report at the upstream mailing lists?

Sidenotes:
Other than the system config I found a few small differences.
 I don't expect it but worth a test on yur side maybe?
My multiqueue xml doesn't usually have the vhost set:
yours:
 <driver name='vhost' queues='4'/>
mine:
<driver queues='4'/>

Also I usually never needed/used to set up the bridge itself
ip link set dev ovsbr up

You didn#t set "ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=4" in the guest - intentional?

Revision history for this message
Thiago Martins (martinx) wrote :

Hey Chris! Working during the weekend? Me too! It is fun anyway! :-D

Here is the output of:

apport-retrace --rebuild-package-info --stdout --confirm /var/crash/_usr_lib_openvswitch-switch-dpdk_ovs-vswitchd-dpdk.0.crash | pastebinit

https://paste.ubuntu.com/16174276

Double checking the other configs now...

Revision history for this message
Thiago Martins (martinx) wrote :

Chris,

 You're right, I executed:

 ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=4

 ...Within the KVM Guest and guess what? That is precisely what is crashing OVS+DPDK at the host!

 I removed IXIA out from the equation, not sending traffic now, all stopped.

 * Steps to reproduce (I'll update the original post soon)

 1- Within the KVM Guest (original post reference, steps 7, 8 and 9):

systemctl disable dpdk
systemctl disable openvswitch-switch

reboot

ethtool -L ens5 combined 4
ethtool -L ens6 combined 4

ovs-vsctl add-br ovsbr -- set bridge ovsbr datapath_type=netdev

ovs-vsctl add-port ovsbr dpdk0 -- set Interface dpdk0 type=dpdk

ovs-vsctl add-port ovsbr dpdk1 -- set Interface dpdk1 type=dpdk

So far, so good (slow but, not crashing)...

ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=4

BOOM! Now it crashes!

Screenshot: http://i.imgur.com/2qREFET.png

Easier to reproduce, no need for traffic (IXIA, for example).

Cheers!
Thiago

summary: - OVS+DPDK crashes at the host, right after starting another OVS+DPDK
- inside of a KVM Guest, easier to reproduce if multi-queue is enabled.
+ OVS+DPDK segfault at the host, after running "ovs-vsctl set Open_vSwitch
+ . other_config:n-dpdk-rxqs=4" within a KVM Guest
Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: OVS+DPDK segfault at the host, after running "ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=4" within a KVM Guest

Hi,
I'm on a business trip and alone with my bugs on the Weekend atm :-)
It is good to have that simplified reproduction - thanks.

Could you also check if it would be sufficient to have just ONE device in Host and Guest instead of two, that would make it even easier. Once we have done a few steps of simplification you can go to the List and ask.

So far it really appears to me to be very similar to the issue we had a few we faced a few weeks ago where we did have the device in use by dpdk AND the Kernel. Is anything like that happening again on any layer?

Also finally, your stacktrace could be enhanced by installing the debug packets.
You should be able to get openvswitch-switch-dbgsym, openvswitch-switch-dpdk-dbgsym and libdpdk0-dpgsym installed after folllowing https://wiki.ubuntu.com/Debug%20Symbol%20Packages
After that the trace will have actual function names and parameters - that will be much more helpful when going to the mailing lists.

Changed in openvswitch (Ubuntu):
status: Triaged → Confirmed
Changed in dpdk (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Thiago,
I didn't see mailing list activity on this - are you going to report it there?

Revision history for this message
Thiago Martins (martinx) wrote :

Sorry about this delay! Reported on both OVS and DPDK dev mail lists...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for driving it upstream - that is the right way.
I'll mark the bug as incomplete to properly reflect that neither you nor I can currently really "work" on it to fix it (which triaged would imply).

Changed in dpdk (Ubuntu):
status: Confirmed → Incomplete
Changed in openvswitch (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Thiago Martins (martinx) wrote :

Why not just "Confirmed"? Giving it an "Incomplete" means that I have to provide more information about the problem, which isn't the case too... Am I right?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Thiago,
sorry - I did not want to upset you.
In that case that clearly isn't you, but upstream along the discussion you are driving there that we are waiting on.

In fact no state really matches this case perfectly:
- confirmed suggests it is waiting for me/us to finally triage/work on it which we already did together to the state that we agreed that we want upstream to clarify it.
- incomplete suggests I'm waiting for you which I don't in this particular case.

I chose the latter, because the "waiting for someone" applies more than the first.
Also when looking at our bugs that state makes it clear that no one can work on it until further feedback is provided.

If that is not reasonable to you let me know and I'll set it back to confirmed - I better have a few bugs "mislabeled" in my perception than to upset you :-)

Revision history for this message
Thiago Martins (martinx) wrote :

Oh, no! I'm not upset! I was just curious about the bug status on launchpad.... Thank you for your clarification!

I have to stop writting e-mails after a big day of work, while I'm tired... :-)

Cheers!
Thiago

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for openvswitch (Ubuntu) because there has been no activity for 60 days.]

Changed in openvswitch (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for dpdk (Ubuntu) because there has been no activity for 60 days.]

Changed in dpdk (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Thiago Martins (martinx) wrote :

This is still a bug on Xenial with OpenvSwitch-2.6 and DPDK-16.07, from Ubuntu Cloud Archive.

I just crashed the OVS+DPDK running at the host, right after trying to enable multiqueue inside of a KVM Guest, also running OVS+DPDK.

NOTE: Multiqueue was enabled at the host in advance, on both OVS and Libvirt VM XML.

I also noted that right after enabling 2 queues at the host, the speed improved inside of the running guest! Without doing anything with it. But, then, after trying to enable multiqueue at the KVM guest, on its OVS+DPDK on top of VirtIO, host crashed.

After running on the KVM Guest:

---
root@ubuntu-ovs-dpdk-vm-1:~# ovs-vsctl set interface dpdk0 options:n_rxq=2 ; ovs-vsctl set interface dpdk1 options:n_rxq=2
---

At the host, ovs-vswitchd crashed:

---
root@ubuntu-ovs-dpdk-kvm-1:~# tail -F /var/log/openvswitch/ovs-vswitchd.log
......
2016-12-02T04:11:28.578Z|00127|dpdk(vhost_thread2)|INFO|State of queue 0 ( tx_qid 0 ) of vhost device '/var/run/openvswitch/vhost-user1'changed to 'enabled'
2016-12-02T04:11:28.578Z|00128|dpdk(vhost_thread2)|INFO|State of queue 2 ( tx_qid 1 ) of vhost device '/var/run/openvswitch/vhost-user1'changed to 'enabled'
2016-12-02T04:11:28.956Z|00002|daemon_unix(monitor)|ERR|1 crashes: pid 3841 died, killed (Segmentation fault), core dumped, restarting
---

I REALLY want to be able to use DPDK Apps, on top of OVS+DPDK at the host but, it still looks unstable to me.

:-(

summary: - OVS+DPDK segfault at the host, after running "ovs-vsctl set Open_vSwitch
- . other_config:n-dpdk-rxqs=4" within a KVM Guest
+ OVS+DPDK segfault at the host, after running "ovs-vsctl set interface
+ dpdk0 options:n_rxq=2 " within a KVM Guest
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Thiago,
I have used multiqueue in the guest in the way:
Host+OVS+DPDK
   |
   |
 multiqueue (4)
   |
   v
  Guest

Not sure if you use DPDK in the guest as well in your case.
But the former worked well for me.

But IIRC from mailing list discussions the DPDK vhost user still doesn't like if you change the number of queues too often.
I could boot and got up to #cpus queues auto-enabled and that worked.

A while ago I found that they at least fixed the issues I had with changing the queue number via ethtool then - but not sure if your case still triggers something in there.

You might catch me out on IRC - but next week I'm on a business trip - we might sync on our setup and identify if there is something to "fix in your setup" or "to reproduce in mine".

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Given the summary you updated you used DPDK in the guest as well.
I'll put it at my backlog to look at it, but very very likely this ends up upstream anyway.

I hope to get 16.07.2 (for more fixes) and 16.11 in soon so you might then retest on those.
The former should migrate into the cloud archive you use I hope.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.