Performance regression from qemu 2.3 to 2.5 for vhost-user with ovs + dpdk

Bug #1668829 reported by Rafael David Tinoco
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
Invalid
Medium
Unassigned
Nominated for Xenial by Rafael David Tinoco
Nominated for Yakkety by Rafael David Tinoco
Nominated for Zesty by Rafael David Tinoco

Bug Description

It was brought to my attention the following situation:

"""

- Overload not caused by traffic: packet drops with qemu 2.5 are caused by short (1-2 ms) bursts of packets (1-2k packets) that arrive on the vswitch at twice the rate at which the external traffic generator sends them.

- Overload is high enough to fulfill virtio tx queues (256 hard coded slots <=> 128 packets) run full.

- Fluctuating packet rate: processed packets/ms oscillate storngly over time:

  Avg: ~1020 pkts/ms
  Stdev: ~107 ms packets/ms
  Min: 720 packets/ms
  Max: 1324 packets/ms

  Every 2-4 seconds there is a queue overrun (causing rates to go high b/c they are dropped).
  The measures above were taken with a 1 second capture (multiple times)

More information:

- Traffic received back from the QEMU VM is rather oscillating (tx) with peaks for queue overrun.
- Reverting to QEMU 2.2 the packet drops disappear (no more overload bursts).
- Can't get statistics on QEMU 2.2 (like 2.5) because of instrumentation.
- QEMU 2.3 is also tested and seemed to be fine.

"""

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

It was also brought to my attention that overriding the virtio queue size mitigates the issue:

http://pastebin.ubuntu.com/24087865/

I believe that working with vhost queues, if supported by openvswitch+dpdk user is using, would also mitigate the issue with no code change (since it would parallelize the TX flush).

Changed in qemu (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → In Progress
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (3.4 KiB)

This is my initial code analysis:

In between 2.3 and 2.5 we have about 80 vhost changes (no merges, no tests), being ~30 for vhost-user.

The most important vhost-user ones are these:

48854f57 vhost-user: fix log size
dc3db6ad vhost-user: start/stop all rings
5421f318 vhost-user: print original request on error
2b8819c6 vhost-user: modify SET_LOG_BASE to pass mmap size and offset
f6f56291 vhost user: add support of live migration
9a78a5dd vhost-user: send log shm fd along with log_base
1be0ac21 vhost-user: add vhost_user_requires_shm_log()
7263a0ad vhost-user: add a new message to disable/enable a specific virt queue.
* b931bfbf vhost-user: add multiple queue support
fc57fd99 vhost: introduce vhost_backend_get_vq_index method
e2051e9e vhost-user: add VHOST_USER_GET_QUEUE_NUM message
dcb10c00 vhost-user: add protocol feature negotiation
7305483a vhost-user: use VHOST_USER_XXX macro for switch statement
d345ed2d Revert "vhost-user: add multi queue support"
830d70db vhost-user: add multi queue support
294ce717 vhost-user: Send VHOST_RESET_OWNER on vhost stop

And these for vhost:

12b8cbac3c8 vhost: don't send RESET_OWNER at stop
25a2a920ddd vhost: set the correct queue index in case of migration with multiqueue
* 15324404f68 vhost: alloc shareable log
2ce68e4cf5b vhost: add vhost_has_free_slot() interface
0cf33fb6b49 virtio-net: correctly drop truncated packets
fc57fd9900d vhost: introduce vhost_backend_get_vq_index method
06c4670ff6d Revert "virtio-net: enable virtio 1.0"
dfb8e184db7 virtio-pci: initial virtio 1.0 support
b1506132001 vhost_net: add version_1 feature
df91055db5c virtio-net: enable virtio 1.0
* 309750fad51 vhost: logs sharing
9718e4ae362 arm_gicv2m: set kvm_gsi_direct_mapping and kvm_msi_via_irqfd_allowed

The vhost-user change is responsible for refactoring the multiple queue support for vhost-user. I'm not entirely sure about this change, in regards to this problem, since they're not using queues=XX in "-netdev" command.

They have changed amount of virtio device queues (virtio) - http://pastebin.ubuntu.com/24087865/ - but not the number of queues for the virtio-net-pci device (vhost-user multi queues, for this example).

Possible causes of such behavior (based on QEMU changes):

- vhost-user multiple queue support refactored
  they are not using "queues=XX" in "-netdev" cmdline
  it could have changed some logic (to check)

- tx queue callbacks scheduling (either timer or qemu aio bottom half)
  this would happen if there wasn't enough context switching
  (for qemu and vhost-user threads). could happen due to lock contention
  or system overload (due to some other change unrelated to virtio).

* raising tx queue size we make the flushes longer in time and that is
  possibly causing a bigger throughput (stopping the queue overrun). this
  tells us that either the buffer is small OR the flush is being called
  less times than it should.
* that is why im focusing on this part. something either reduced buffer
  size or is causing a bottleneck for the buffer flush typical of the
  "burst" behavior", btw.

- There was also a change in vhost logging system:

* vhost-user, commit: 309750fad51

* For live migration they sta...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

As a temporary workaround, I've prepared a QEMU 2.5 with a higher virtio RX/TX queue size - just like the change that mitigates the issue:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1668829work

That contains the workaround (256 -> 1024 change on virtio queue size).

The QEMU pkg version is: 1:2.5+dfsg-5ubuntu10.9~cloud1~lp1668829~work1

I'll use another PPA for possible tests. This one will have the workaround only.

Revision history for this message
Jan Scheurich (jan-scheurich) wrote :

Hi,

One guest application where this problem is observed is a DPDK application using a pipeline of multiple DPDK logical cores to process packets between Rx from one virtio-net device and Tx to another virtio-net device. The DPDK lcores are coupled through DPDK rings that may buffer a significant amount of packets inside the application.

One possible explanation for the observed changed behavior in Qemu 2.5 compared to Qemu 2.3 could be that for some reason one of these DPDK threads is interrupted so that a significant amount of packets queue up in its ingress ring. These are then forwarded in a burst at larger speed than normal to the Tx virtio queue. This would imply that Qemu 2.5 changes the scheduling of such DPDK threads in the guest.

BR, Jan

Revision history for this message
Jan Scheurich (jan-scheurich) wrote :

The proposal to work around the problem by using multiple vhost-user queues per port cannot solve the problem as it has two prerequisites that are not generally fulfilled:

1. The guest application needs to be able to use multiple queues and spread its Tx traffic across them.
2. The OpenStack environment must support configuration of vhost multi-queue.

The work-around to increase the queue length to 1024, in contrast, is completely transparent for applications and reduces the likelihood of packet drops for all types of sub-ms scale load fluctuations, no matter their cause.

In general we believe that it would be good to dimension the virtio queue size roughly equal to the typical queue sizes of physical interfaces (typically ~1K packets) to avoid that the virtio queues are the weakest link in the end-to-end data path.

To this end we do support the idea of making the virtio-net queue size configurable in both directions (Rx and Tx) in upstream Qemu.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote : Re: [Bug 1668829] Re: Performance regression from qemu 2.3 to 2.5 for vhost-user with ovs + dpdk

It seems there is kind of a training bug related to this?
=> 1668931 <https://bugs.launchpad.net/bugs/1668931>

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Christian,

Yes, I believe its related and I'm not sure why other bug was opened (other than this one). For now, lets concentrate efforts into this one only. I still need feedback from final user - present in the case discussion - and still need to do some more testing.

Somehow the best move here is a bisection and then the problem resolution (since we have a good and a bad version). We can try to guess the problem, but, unfortunately we can't reproduce at our side. Do have in mind that their setup is not using Canonical/Ubuntu OVS and DPDK, just QEMU.

Jan, based on your comment:

" One guest application where this problem is observed is a DPDK application using a pipeline of multiple DPDK logical cores to process packets between Rx from one virtio-net device and Tx to another virtio-net device. The DPDK lcores are coupled through DPDK rings that may buffer a significant amount of packets inside the application. "

This is the reason I said:

* Raising tx queue size make the flushes longer in time and that is possibly causing a bigger throughput (stopping the queue overrun). This tells us that either the buffer is smaller OR the flush is being called less times than it should (<- less scheduling or lock contention).

For this other comment:

" One possible explanation for the observed changed behavior in Qemu 2.5 compared to Qemu 2.3 could be that for some reason one of these DPDK threads is interrupted so that a significant amount of packets queue up in its ingress ring. These are then forwarded in a burst at larger speed than normal to the Tx virtio queue. This would imply that Qemu 2.5 changes the scheduling of such DPDK threads in the guest. "

* Yep, the "some reason" for the interruption of dpdk threads scheduling is the problem I would like to pursue - like said in the first comment. Do have in mind that it might not have anything related to virtio or vhost, but, something that causes the timer function callback to be postponed (example).

I'll comment the vhost queue comment in the next one.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Jan,

You said:

"
The proposal to work around the problem by using multiple vhost-user queues per port cannot solve the problem as it has two prerequisites that are not generally fulfilled:

1. The guest application needs to be able to use multiple queues and spread its Tx traffic across them.
"

But QEMU vhost multiple queue feature is there and it could solve your issue - if some development decision in qemu (that caused this) was made. We still cannot affirm that, since we need to find the cause (bisection is best since I don't have access to your environment AND you're using packages/patches not generally available for Ubuntu community - from upstream/customized).

About this one:

"
2. The OpenStack environment must support configuration of vhost multi-queue.
"

According to this documents:

https://specs.openstack.org/openstack/nova-specs/specs/liberty/implemented/libvirt-virtiomq.html
https://github.com/openstack/nova/commit/9a09674220a071e51fdca7911b52c0027c01ff64

It is already supported. You have to specify hw_vif_multiqueue_enabled=True in the image. The libvirt xml will be generated - on the instantiation - with vhost queues (one per CPU).

For this one:

"
The work-around to increase the queue length to 1024, in contrast, is completely transparent for applications and reduces the likelihood of packet drops for all types of sub-ms scale load fluctuations, no matter their cause.

In general we believe that it would be good to dimension the virtio queue size roughly equal to the typical queue sizes of physical interfaces (typically ~1K packets) to avoid that the virtio queues are the weakest link in the end-to-end data path.
"

It didn't happen in 2.2 (or 2.3) but it happens in 2.5. Queue size - not for the virtio device but for the virtio net device (using vhost) - has always been 256 in those versions. We would be mitigating an unknown cause. That will be hard to be accepted upstream if you want to go there directly.

IMHO I think we should bisect your tests - 12 steps - and find the cause. After the cause is found I can fix it (in the best possible way for you) and we can go upstream together, if needed. Sometimes the cause has already been fixed in development tree.

For this last comment:

"
To this end we do support the idea of making the virtio-net queue size configurable in both directions (Rx and Tx) in upstream Qemu.
"

THAT I do agree with you. Changing the default is tricky, but, providing a mechanism to configure it - up to hardcoded 1024 - could be beneficial. Although I still think that we are going in hypothesis without doing the tangible thing we can: bisect the test and find exact cause.

What do you think ? Can I start bisecting QEMU and providing you new packages in a PPA for you to test ? You provide comments saying #good or #bad based on test results. I upload another version, you upgrade qemu, test again, and so on.

How does that sound ?

Revision history for this message
Jan Scheurich (jan-scheurich) wrote :

Hi Rafael,

Some answers to your questions:

1. You are probably right that OpenStack Mitaka in principle supports assigning one vhost queue per vCPU of an instance, but since this requires support in VNFs we cannot utilize this in general.

Some VNFs we need to support with our NFV Infrastructure are not able to deal with multiple vhost queues or, if they can, may not distribute traffic evenly over multiple TX queues. That's is why vhost-multique is not a general solution to the problem we see with short virtio queues.

2. The run-time behavior of VNFs on Qemu 2.5 has degraded compared to Qemu 2.2 and 2.3. The increased burstiness of TX traffic is much more likely to overrun the short 256 slot virtio queues, which leads to the increase of packet drops at lower traffic rates.

But even with Qemu 2.0 certain VNFs drop TX packets at traffic rates well below the maximum vSwitch throughput because of too short TX queues. We have seen a 30% increase in throughput at same packet drop level between Qemu 2.5 with 1024 queue slots and Qemu 2.2 with the original 256 queue slots, which indicates that the original queue size is underdimensioned.

3. We agree that your bisection approach is the only way to find the commit between Qemu 2.3 and 2.5 that is responsible for the increased burstiness. Then we can assess if this is bug, or an avoidable consequence of some new feature implemented in Qemu 2.4 or 2.5 and decide on the right upstreaming strategy for this. With the 1K queue length option in place fixing this is clearly no longer as critical.

It is not guaranteed that all of the 12 intermediate commits picked by the bisect procedure will be fully working in end-to-end NFV context, though. We might need to try out more commits than 12 to hunt down the guilty one. When we have the test channel ready for this procedure, we will find out.

4. For the above reasons we really want to make the virtio queue length configurable in rx and tx direction in upstream Qemu up to the limit of 1024 as earlier proposed by Patrick Hermansson (https://patchwork.ozlabs.org/patch/549544/). This can be done per port or by a global default configuration option.

BR, Jan

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: [Bug 1668829] Performance regression from qemu 2.3 to 2.5 for vhost-user with ovs + dpdk
Download full text (4.4 KiB)

Hello Jan,

> On 3 Mar 2017, at 09:58, Jan Scheurich <<email address hidden> <mailto:<email address hidden>>> wrote:
>
> Hi Rafael,
>
> Some answers to your questions:
>
> 1. You are probably right that OpenStack Mitaka in principle supports
> assigning one vhost queue per vCPU of an instance, but since this
> requires support in VNFs we cannot utilize this in general.

Okay, thanks for letting me know.

> Some VNFs we need to support with our NFV Infrastructure are not able to
> deal with multiple vhost queues or, if they can, may not distribute
> traffic evenly over multiple TX queues. That's is why vhost-multique is
> not a general solution to the problem we see with short virtio queues.

I see. Just separating the problems and each "problem" competency. For the throughtput (not the packet loss) It is not improbable that upstream would tell you to use multiqueue for that problem - changing VNFs - instead of changing vring buffer size. I totally understand the situation about the multiqueue. I'm working (reading/researching) in this part right now to be honest (since I'm waiting the testing environment).

> 2. The run-time behavior of VNFs on Qemu 2.5 has degraded compared to Qemu 2.2 and 2.3. The increased burstiness of TX traffic is much more likely to overrun the short 256 slot virtio queues, which leads to the increase of packet drops at lower traffic rates.

Yes, what led us to think about the queue flush scheduling issue because of some change.

> But even with Qemu 2.0 certain VNFs drop TX packets at traffic rates
> well below the maximum vSwitch throughput because of too short TX
> queues. We have seen a 30% increase in throughput at same packet drop
> level between Qemu 2.5 with 1024 queue slots and Qemu 2.2 with the
> original 256 queue slots, which indicates that the original queue size
> is underdimensioned.

This is the tricky part. You can always maintain your own QEMU package - by patching it with your own changes every release - if upstream doesn't accept this change (because they would enter the TX vring buffer size discussion for virtio). Upstream would likely give you some ways to "solve" this problem and you wouldn't be able to comply because you might not be able to change your VNC to take advantage of those "new features" (from underlaying virtio hypervisor/driver code). I still have to study more this topic, I'm reading the code. Will likely need some more days here.

> 3. We agree that your bisection approach is the only way to find the commit between Qemu 2.3 and 2.5 that is responsible for the increased burstiness. Then we can assess if this is bug, or an avoidable consequence of some new feature implemented in Qemu 2.4 or 2.5 and decide on the right upstreaming strategy for this. With the 1K queue length option in place fixing this is clearly no longer as critical.

That is why I was insisting so much on the bisection. If i just provide you a fix for the throughput - that couldn't even be accepted as a SRU for our package - the original issue - packet drops - would loose importance. I'm glad we are differentiating both and following different path. The throughput is not guaranteed BUT it might be en...

Read more...

Revision history for this message
Billey O'Mahoney (billey) wrote :

Hi All,

the scenario appears to involved DPDK application in the guest. Is that correct?

For linux applications in the guest iperf3 default tcp throughput test, I have seen much increased performance in with indirect descriptors enabled in the dpdkvhostuser ports. This setting is available with >=DPDK16.11 in the host.

https://mail.openvswitch.org/pipermail/ovs-dev/2017-March/329227.html

I have not yet had the chance to test with dpdk applications in the guest but when I get results with that I will post them here.

/Billy

Revision history for this message
Zoltan Szeder (zoltan-szeder) wrote :

Hi All,

I would like to mention (if it has not been already), that using the patch provided by Rafael's first comment ( http://pastebin.ubuntu.com/24087865/ ) caused iPXE to fail in our test environment with the following message:

Error: message queue 1024 > 256

It happens, when the guest is set to boot from network.

The test environment was an Ubuntu 14.04 virtual host server updated with cloudarchive-mitaka PPA and a qemu-2.5 patched with the change mentioned in the link.
The same was experienced with a plain Ubuntu 16.04 with the patched qemu package.

Ubuntu 14.04:
qemu: 1:2.5+dfsg-5ubuntu10.9~cloud0
ipxe-qemu: 1.0.0+git-20131111.c3d1e78-2ubuntu1

Ubuntu 16.04:
qemu: 1:2.5+dfsg-5ubuntu10.9
ipxe-qemu: 1.0.0+git-20150424.a25a16d-1ubuntu1

BR, Zoltan Szeder

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Yes, vhost implementation from the firmware itself has to be changed also - since it is likely using a hard coded value for the vring size. For now, do have an emulated interface (ne2k/intel) for PXE.

Revision history for this message
Zoltan Szeder (zoltan-szeder) wrote :

Further investigation revealed, that on Dec, 2016 this issue has been solved in iPXE, so the iso image can be used as an alternative to the host provided ipxe-qemu based network boot.

That means, that with the increased net-queue size, there is no need to change the virtio network device model in order to use PXE booting.

Revision history for this message
wangzhike (wangzhike) wrote :

Hi Rafael,

Do you find the root cause for this bug? From the discussion, it seems Qemu issue, but i am not sure whether it is fixed or not.

We are using qemu 2.9.1, and also observed this issue.

scenario:
1. VM A and VM B are on different ovs+dpdk node.
2. VM A scp file to VM B.If the delay between VM A and B is large, say 30ms, we observed packets loss on VM rx direction (from ovs+dpdk, it is tx stats drop). Note we observed the throughput is slow, only about 3MB/s.We already change the rx_queue_size to 1024. We can not understand why VM can not handle such low traffic.

Thanks.

Zhike

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (6.9 KiB)

Hello Wangzhike,

I failed to update case with latest analysis here. I could isolate the commit with a bisection. If you could provide feedback about this would be also good.

#### ANALYSIS

# git bisection log
#
# bad: [a8c40fa2d667e585382080db36ac44e216b37a1c] Update version for v2.5.0 release
# good: [e5b3a24181ea0cebf1c5b20f44d016311b7048f0] Update version for v2.3.0 release
# bad: [6b324b3e5906fd9a9ce7f4f24decd1f1c7afde97] Merge remote-tracking branch 'remotes/stefanha/tags/net-pull-request' into staging
# good: [a3d586f704609a45b6037534cb2f34da5dfd8895] dma/rc4030: create custom DMA address space
# good: [54f3223730736fca1e6e89bb7f99c4f8432fdabb] ahci: factor ncq_finish out of ncq_cb
# good: [e46e1a74ef482f1ef773e750df9654ef4442ca29] target-arm: Fix broken SCTLR_EL3 reset
# bad: [776f87845137a9b300a4815ba6bf6879310795aa] Merge remote-tracking branch 'remotes/mjt/tags/pull-trivial-patches-2015-07-27' into staging
# good: [21a03d17f2edb1e63f7137d97ba355cc6f19d79f] AioContext: fix broken placement of event_notifier_test_and_clear
# bad: [e40db4c6d391419c0039fe274c74df32a6ca1a28] Merge remote-tracking branch 'remotes/jnsnow/tags/cve-2015-5154-pull-request' into staging
# bad: [30fdfae49d53cfc678859095e49ac60b79562d6f] Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20150723' into staging
# bad: [12e21eb088a51161c78ee39ed54ac56ebcff4243] Merge remote-tracking branch 'remotes/ehabkost/tags/numa-pull-request' into staging
# bad: [dc94bd9166af5236a56bd5bb06845911915a925c] Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging
# good: [b9c46307996856d03ddc1527468ff5401ac03a79] Merge remote-tracking branch 'remotes/mdroth/tags/qga-pull-2015-07-21-tag' into staging
# bad: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: optimize clearing the EventNotifier
# first bad commit: [05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3] AioContext: optimize clearing the EventNotifier
#

commit 05e514b1d4d5bd4209e2c8bbc76ff05c85a235f3 (HEAD, refs/bisect/bad)
Author: Paolo Bonzini <email address hidden>
Date: Tue Jul 21 16:07:53 2015 +0200

    It is pretty rare for aio_notify to actually set the EventNotifier. It
    can happen with worker threads such as thread-pool.c's, but otherwise it
    should never be set thanks to the ctx->notify_me optimization. The
    previous patch, unfortunately, added an unconditional call to
    event_notifier_test_and_clear; now add a userspace fast path that
    avoids the call.

    Note that it is not possible to do the same with event_notifier_set;
    it would break, as proved (again) by the included formal model.

    This patch survived over 3000 reboots on aarch64 KVM.

    Signed-off-by: Paolo Bonzini <email address hidden>
    Reviewed-by: Fam Zheng <email address hidden>
    Tested-by: Richard W.M. Jones <email address hidden>
    Message-id: <email address hidden>
    Signed-off-by: Stefan Hajnoczi <email address hidden>

## UNDERSTANDING

Basic logic for AioContext before this change was:

QEMU has its own asynchronous IO implementation, which has a master structure referred as "AIO Context". This asynchronous IO subsystem is used by different parts of QE...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

The following upstream commits mitigate this issue by allowing the user to control tx queue size (up to 1024). When this is done, the performance drop caused by the commit, showed in the previous comment, is mitigated.

# QEMU

commit 9b02e1618cf26aa52cf786f215d757506dda14f8
Author: Wei Wang <email address hidden>
Date: Wed Jun 28 10:37:59 2017 +0800

virtio-net: enable configurable tx queue size

commit 2eef278b9e6326707410eed23be40e57f6c331b7
Author: Michael S. Tsirkin <email address hidden>
Date: Mon Jul 3 22:25:24 2017 +0300

virtio-net: fix tx queue size for !vhost-user

# LIBVIRT

commit 2074ef6cd4a2e033813ec091487d027a85f73509
Author: Michal Privoznik <email address hidden>
Date: Wed Jul 12 14:19:26 2017 +0200

Add support for virtio-net.tx_queue_size

# NOVA COMPUTE (pending, not yet accepted)

https://blueprints.launchpad.net/nova/+spec/libvirt-virtio-set-queue-sizes

https://review.openstack.org/#/c/484997/

Revision history for this message
wangzhike (wangzhike) wrote :

Thanks Rafael for this info.

Changed in qemu (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.