qemu vhost-user should ignore irrelevant mem regions because it has limit of 8 regions

Bug #1887525 reported by Dan Streetman on 2020-07-14
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
qemu (Ubuntu)
Undecided
Unassigned
Bionic
Medium
Dan Streetman

Bug Description

[impact]

the impact is the same as bug 1886704, the qemu vhost-user driver fails to init. see that bug for more details on the impact.

Because the vhost-user driver cannot dictate how many mem regions are present in the qemu guest, if the vhost-user driver calculates more than 8 regions at driver initialization time, this api limitation causes the qemu instance that is attempting to add/initialize a new vhost-user interface (nic) to fail, resulting in the qemu instance being unable to use the nic. Typically, this will mean that a qemu instance that is supposed to connect to DPDK-OVS is unable to, and has broken/missing networking, and in most cases is unusable.

[test case]

start a qemu guest with at least one vhost-user interface (e.g. using DPDK-OVS), and more than 8 discontiguous memory regions. This might happen when using multiple PCI passthrough devices in combination with vhost-user interface(s). The vhost-user device will fail to init due to exceeding its max memory region limit.

As I don't have a DPDK setup to reproduce this, I am relying on the reporter of this bug to me to test and verify.

[regression potential]

as this causes vhost-user to ignore some mem regions, any regression would likely involve problems with the vhost-user interface; possibly failure to init the interface, or failure to configure the interface, or problems while using the interface.

[scope]

this is needed for bionic.

this is fixed upstream by commits 9e2a2a3e083fec1e8059b331e3998c0849d779c1 and 988a27754bbbc45698f7acb54352e5a1ae699514, which are first included in v2.12.0 and v3.0.0, respectively, so this is fixed in focal and later.

I am not proposing this for xenial at this time, as there is more context difference and higher regression potential, and lack of anyone reporting to me the need for this fix when using xenial.

[other info]

this is closely related to bug 1886704, but that bug is specifically about the 8 mem region limit of the vhost-user api. This bug doesn't attempt to fix that limitation (as it requires using a new extension of the vhost-user api to increase the max mem regions), this only backports existing upstream patches that fix the vhost region calculations and allow the vhost-user driver to indicate which mem regions it doesn't need to use, so those are ignored, in order to keep the total under the vhost-user limit.

Dan Streetman (ddstreet) on 2020-07-14
Changed in qemu (Ubuntu Bionic):
assignee: nobody → Dan Streetman (ddstreet)
importance: Undecided → Medium
status: New → In Progress
Changed in qemu (Ubuntu):
status: New → Fix Released
Dan Streetman (ddstreet) on 2020-07-14
description: updated
description: updated

Patches look well reduced to fill the SRU purpose, both are applied alter already so ack to this only being for Bionic.

The new filter function is internal and should not break external things (patch reviewed).

The biggest (if not the only back at Bionic) user of vhost-user was DPDK-OVS which is the target of the fix here - I would not expect much other components to regress (ack to regression potential).

I have a OVS-DPDK setup I could test this with, not with the extra memory segments for the crash. But as regression test it would serve well. I can also run a general qemu regression set against it then. But I'm out next week :-/. If this would be in -proposed ~today I should be able to complete the tests in time.

Hello Dan, or anyone else affected,

Accepted qemu into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.11+dfsg-1ubuntu7.29 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in qemu (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Robie Basak (racb) wrote :

Thank you for working on this. I've accepted this into bionic-proposed as I see no reason to hold it on the following points, but I would like the following cleared up please.

I don't understand why this bug needs to be fixed in Bionic from a user perpsective. From the background its clear to me that it is a problem that qualifies for an SRU, but all I can gather from this bug is that it's a problem with the vhost-user interface when there are more than 8 discontiguous memory regions. The bug you reference makes me none the wiser. Why does this create an impact to qemu users on Bionic? For example, is it that there's particular hardware where this is always the case? What's the actual *user* use case that's broken here, as distinct from a technical explanation of the root cause of the bug?

The second query is on the test case. Can you detail what steps would be carried out to test this even if you can't do it yourself?

We have an obligation to other users of qemu to clear this up as the SRU process is a community one, the results are consumed by the community, and if there is a regression then they deserve to know why we made changes, so I'd appreciate it if you could clear this up. Thanks!

Dan Streetman (ddstreet) on 2020-07-15
description: updated
Dan Streetman (ddstreet) wrote :

> Why does this create an impact to qemu users on Bionic? For example, is it that there's particular hardware where this is always the case? What's the actual *user* use case that's broken here, as distinct from a technical explanation of the root cause of the bug?

Sorry, I've updated the description to clarify that this causes affeceted qemu instances to fail to setup their networking, making them unusable.

> Can you detail what steps would be carried out to test this even if you can't do it yourself?

Setting up DPDK is complex and certainly outside the scope of a bug test case. I've updated the description to suggest a possible way to increase the number of mem regions.

description: updated

All autopkgtests for the newly accepted qemu (1:2.11+dfsg-1ubuntu7.29) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

vagrant-mutate/1.2.0-3 (armhf)
ubuntu-image/1.9+18.04ubuntu1 (i386)
systemd/237-3ubuntu10.41 (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#qemu

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

@ddstreet:
systemd/ubunut-image should be flaky - I have restarted them.
vagrant-mutate sometimes breaks due to external changes, you might need to look into. For now I have restarted it as well.

Two things are concerning in excuses:
qemu (1:2.11+dfsg-1ubuntu7 to 1:2.11+dfsg-1ubuntu7.29)
^^ where is the former .28?

qemu-user-static/amd64 has unsatisfiable Built-Using on gcc-8 8.4.0-1ubuntu1~18.04
qemu-user-static/amd64 has unsatisfiable Built-Using on glib2.0 2.56.4-0ubuntu0.18.04.6
qemu-user-static/amd64 has unsatisfiable Built-Using on glibc 2.27-3ubuntu1.2
I don't see anything about that in http://launchpadlibrarian.net/488558105/qemu_1%3A2.11+dfsg-1ubuntu7.28_1%3A2.11+dfsg-1ubuntu7.29.diff.gz

Comparing
https://launchpadlibrarian.net/486520742/buildlog_ubuntu-bionic-amd64.qemu_1%3A2.11+dfsg-1ubuntu7.28_BUILDING.txt.gz
https://launchpadlibrarian.net/488663141/buildlog_ubuntu-bionic-amd64.qemu_1%3A2.11+dfsg-1ubuntu7.29_BUILDING.txt.gz

Since it is static it correctly collects the BUilt-using.
.28: Built-Using: gcc-8 (= 8.4.0-1ubuntu1~18.04), glib2.0 (= 2.56.4-0ubuntu0.18.04.6), glibc (= 2.27-3ubuntu1), zlib (= 1:1.2.11.dfsg-0ubuntu2)
.29 Built-Using: gcc-8 (= 8.4.0-1ubuntu1~18.04), glib2.0 (= 2.56.4-0ubuntu0.18.04.6), glibc (= 2.27-3ubuntu1.2), zlib (= 1:1.2.11.dfsg-0ubuntu2)

All that is available in bionic-updates, it almost seems it miscompares to bionic-release instead of bionic-updates. Might there have been any issue in selecting the target pocket?

>
> > Can you detail what steps would be carried out to test this even if
> you can't do it yourself?
>
> Setting up DPDK is complex and certainly outside the scope of a bug test
> case. I've updated the description to suggest a possible way to increase
> the number of mem regions.
>

As I said above I can help with that, it is built in proposed and I'll
report back here once done with the test.

Dan Streetman (ddstreet) wrote :

> it is built in proposed and I'll report back here once done with the test.

Thanks @paelzer!

Also, the original bug reporter (to me) has tested the package in -proposed and verified it does fix the issue for them.

Dan Streetman (ddstreet) wrote :

> Two things are concerning in excuses:

also for this, there was some discussion in #ubuntu-release and it looks like this is unrelated to this bug, it's some issue in the latest update to the code that generates the update-excuses page.

All autopkgtest issues resolved - status green on that front now.

All other tests that were in queue needing this system to be on >=Focal done, redeploying as Bionic for the DPDK test needed for this bug.

There are some more endurance tests going on which repeat things a lot of times, but in terms of general regression test we are good and I've thereby checked OVS-DPDK against the .29 version in bionic-proposed:
    3.2.1 (13:20:15): test guest-dpdk-vhost-user-singleq for OVSDPDK => Pass
    3.2.2 (13:39:53): test guest-dpdk-vhost-user-multiq for OVSDPDK-tuned => Pass
    3.2.3 (14:01:00): test guest-dpdk-vhost-user-client-multiq for OVSDPDK-VUC => Pass

Now someone needs to explicitly verify the case that was meant to be fixed

And yes, I've spoken to Laney about update-excuse, this isn't a problem of this upload.

I think we are overall verified-done then, but I'll leave this decision to you @ddstreet.

Dan Streetman (ddstreet) wrote :

Thanks @paelzer! I rechecked with the original bug reporter again and the package in proposed definitely fixes the issue for them, so marking as verified.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.11+dfsg-1ubuntu7.29

---------------
qemu (1:2.11+dfsg-1ubuntu7.29) bionic; urgency=medium

  * allow vhost-user driver to ignore some unneeded mem regions,
    to stay under its api limit of 8 mem regions (LP: #1887525)
    - d/p/lp1887525/0001-vhost-fix-memslot-limit-check.patch
    - d/p/lp1887525/0002-vhost-allow-backends-to-filter-memory-sections.patch

 -- Dan Streetman <email address hidden> Tue, 14 Jul 2020 09:35:16 -0400

Changed in qemu (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for qemu has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers