QEMU memfd_create fallback mechanism change for security drivers

Bug #1626972 reported by Rafael David Tinoco on 2016-09-23
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Rafael David Tinoco
Ubuntu Cloud Archive
Undecided
Rafael David Tinoco
Declined for Newton by James Page
Mitaka
Undecided
Unassigned
qemu (Ubuntu)
Undecided
Rafael David Tinoco
Xenial
Undecided
Rafael David Tinoco
Yakkety
Undecided
Rafael David Tinoco
Zesty
Undecided
Rafael David Tinoco

Bug Description

[Impact]

 * Updated QEMU (from UCA) live migration doesn't work with 3.13 kernels.
 * QEMU code checks if it can create /tmp/memfd-XXX files wrongly.
 * Apparmor will block access to /tmp/ and QEMU will fail migrating.

[Test Case]

 * Install 2 Ubuntu Trusty (3.13) + UCA Mitaka + apparmor rules.
 * Try to live-migration from one to another.
 * Apparmor will block creation of /tmp/memfd-XXX files.

[Regression Potential]

 Pros:
 * Exhaustively tested this.
 * Worked with upstream on this fix.
 * I'm implementing new vhost log mechanism for upstream.
 * One line change to a blocker that is already broken.

 Cons:
 * To break live migration in other circumstances.

[Other Info]

 * Christian Ehrhardt has been following this.

ORIGINAL DESCRIPTION:

When libvirt starts using apparmor, and creating apparmor profiles for every virtual machine created in the compute nodes, mitaka qemu (2.5 - and upstream also) uses a fallback mechanism for creating shared memory for live-migrations. This fall back mechanism, on kernels 3.13 - that don't have memfd_create() system-call, try to create files on /tmp/ directory and fails.. causing live-migration not to work.

Trusty with kernel 3.13 + Mitaka with qemu 2.5 + apparmor capability = can't live migrate.

From qemu 2.5, logic is on :

void *qemu_memfd_alloc(const char *name, size_t size, unsigned int seals, int *fd)
{
    if (memfd_create)... ### only works with HWE kernels

    else ### 3.13 kernels, gets blocked by apparmor
       tmpdir = g_get_tmp_dir
       ...
       mfd = mkstemp(fname)
}

And you can see the errors:

From the host trying to send the virtual machine:

2016-08-15 16:36:26.160 1974 ERROR nova.virt.libvirt.driver [req-0cac612b-8d53-4610-b773-d07ad6bacb91 691a581cfa7046278380ce82b1c38ddd 133ebc3585c041aebaead8c062cd6511 - - -] [instance: 2afa1131-bc8c-43d2-9c4a-962c1bf7723e] Migration operation has aborted
2016-08-15 16:36:26.248 1974 ERROR nova.virt.libvirt.driver [req-0cac612b-8d53-4610-b773-d07ad6bacb91 691a581cfa7046278380ce82b1c38ddd 133ebc3585c041aebaead8c062cd6511 - - -] [instance: 2afa1131-bc8c-43d2-9c4a-962c1bf7723e] Live Migration failure: internal error: unable to execute QEMU command 'migrate': Migration disabled: failed to allocate shared memory

From the host trying to receive the virtual machine:

Aug 15 16:36:19 tkcompute01 kernel: [ 1194.356794] type=1400 audit(1471289779.791:72): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-2afa1131-bc8c-43d2-9c4a-962c1bf7723e" pid=12565 comm="apparmor_parser"
Aug 15 16:36:19 tkcompute01 kernel: [ 1194.357048] type=1400 audit(1471289779.791:73): apparmor="STATUS" operation="profile_load" profile="unconfined" name="qemu_bridge_helper" pid=12565 comm="apparmor_parser"
Aug 15 16:36:20 tkcompute01 kernel: [ 1194.877027] type=1400 audit(1471289780.311:74): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-2afa1131-bc8c-43d2-9c4a-962c1bf7723e" pid=12613 comm="apparmor_parser"
Aug 15 16:36:20 tkcompute01 kernel: [ 1194.904407] type=1400 audit(1471289780.343:75): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="qemu_bridge_helper" pid=12613 comm="apparmor_parser"
Aug 15 16:36:20 tkcompute01 kernel: [ 1194.973064] type=1400 audit(1471289780.407:76): apparmor="DENIED" operation="mknod" profile="libvirt-2afa1131-bc8c-43d2-9c4a-962c1bf7723e" name="/tmp/memfd-tNpKSj" pid=12625 comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=107 ouid=107
Aug 15 16:36:20 tkcompute01 kernel: [ 1194.979871] type=1400 audit(1471289780.411:77): apparmor="DENIED" operation="open" profile="libvirt-2afa1131-bc8c-43d2-9c4a-962c1bf7723e" name="/tmp/" pid=12625 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=107 ouid=0
Aug 15 16:36:20 tkcompute01 kernel: [ 1194.979881] type=1400 audit(1471289780.411:78): apparmor="DENIED" operation="open" profile="libvirt-2afa1131-bc8c-43d2-9c4a-962c1bf7723e" name="/var/tmp/" pid=12625 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=107 ouid=0

When leaving libvirt without apparmor capabilities (thus not confining virtual machines on compute nodes, the live migration works as expected, so, clearly, apparmor is stepping into the live migration). I'm sure that virtual machines have to be confined and that this isn't the desired behaviour...

I came up with this patch for QEMU:

http://paste.ubuntu.com/23217056/

I'm finishing libvirt patch so I can propose upstream QEMU already sure that libvirt will benefit from this change. Right after I'll propose libvirt upstream patch (changing vert-aa-helper logic).

And later:

Improved it a little bit: http://paste.ubuntu.com/23217333/

And fixed it:

http://paste.ubuntu.com/23219599/
(Probable the version to be suggested to upstream)

Changed in qemu:
status: New → In Progress
assignee: nobody → Rafael David Tinoco (inaddy)

Fixed it according to checkpatch.pl as stated in http://wiki.qemu.org/Contribute/SubmitAPatch.

http://paste.ubuntu.com/23220104/

Will submit to mailing list after testing everything.

Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
mechanism for systems not supporting memfd_create syscall (started
being supported since 3.17).

Backporting memfd_create might not be accepted for distros relying
on older kernels. Nowadays there is no way for security driver
to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.

It is more appropriate to include UUID and/or VM names in the
temporary filename, allowing security driver rules to be applied
while maintaining the required unpredictability with mkstemp.

This change will allow libvirt to know exact memfd file to be created
for vhost log AND to create appropriate security rules to allow access
per instance (instead of a opened rule like <tmpdir>/memfd-*).

Example of apparmor deny messages with this change:

Per VM UUID (preferred, generated automatically by libvirt):

kernel: [26632.154856] type=1400 audit(1474945148.633:78): apparmor=
"DENIED" operation="mknod" profile="libvirt-0b96011f-0dc0-44a3-92c3-
196de2efab6d" name="/tmp/memfd-0b96011f-0dc0-44a3-92c3-196de2efab6d-
qeHrBV" pid=75161 comm="qemu-system-x86" requested_mask="c" denied_
mask="c" fsuid=107 ouid=107

Per VM name (if no UUID is specified):

kernel: [26447.505653] type=1400 audit(1474944963.985:72): apparmor=
"DENIED" operation="mknod" profile="libvirt-00000000-0000-0000-0000-
000000000000" name="/tmp/memfd-instance-teste-osYpHh" pid=74648
comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=107
ouid=107

Signed-off-by: Rafael David Tinoco <email address hidden>
---
 util/memfd.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/util/memfd.c b/util/memfd.c
index 4571d1a..4b715ac 100644
--- a/util/memfd.c
+++ b/util/memfd.c
@@ -30,6 +30,9 @@
 #include <glib/gprintf.h>

 #include "qemu/memfd.h"
+#include "qmp-commands.h"
+#include "qemu-common.h"
+#include "sysemu/sysemu.h"

 #ifdef CONFIG_MEMFD
 #include <sys/memfd.h>
@@ -94,11 +97,32 @@ void *qemu_memfd_alloc(const char *name, size_t size, unsigned int seals,
             return NULL;
         }
     } else {
+ int ret = 0;
         const char *tmpdir = g_get_tmp_dir();
+ UuidInfo *uinfo;
+ NameInfo *ninfo;
         gchar *fname;

- fname = g_strdup_printf("%s/memfd-XXXXXX", tmpdir);
+ uinfo = qmp_query_uuid(NULL);
+
+ ret = strcmp(uinfo->UUID, UUID_NONE);
+ if (ret == 0) {
+ ninfo = qmp_query_name(NULL);
+ if (ninfo->has_name) {
+ fname = g_strdup_printf("%s/memfd-%s-XXXXXX", tmpdir,
+ ninfo->name);
+ } else {
+ fname = g_strdup_printf("%s/memfd-XXXXXX", tmpdir);
+ }
+ qapi_free_NameInfo(ninfo);
+ } else {
+ fname = g_strdup_printf("%s/memfd-%s-XXXXXX", tmpdir,
+ uinfo->UUID);
+ }
+
         mfd = mkstemp(fname);
+
+ qapi_free_UuidInfo(uinfo);
         unlink(fname);
         g_free(fname);

--
2.9.3

I'll follow to see if patch was accepted upstream:

https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg06191.html
https://<email address hidden>/msg400892.html

On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
> mechanism for systems not supporting memfd_create syscall (started
> being supported since 3.17).

This is really dubious code in general and IMHO should just
be reverted.

We have a golden rule that any time QEMU needs to be able to
create a file on disk, then the path should be explicitly
provided as a command line argument so that mgmt apps can
control the location used.

> Backporting memfd_create might not be accepted for distros relying
> on older kernels. Nowadays there is no way for security driver
> to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
>
> It is more appropriate to include UUID and/or VM names in the
> temporary filename, allowing security driver rules to be applied
> while maintaining the required unpredictability with mkstemp.

We should not have QEMU creating unpredictabile filenames in the
first place - any filenames should be determined by libvirt
explicitly.

> This change will allow libvirt to know exact memfd file to be created
> for vhost log AND to create appropriate security rules to allow access
> per instance (instead of a opened rule like <tmpdir>/memfd-*).

Even with this change it is bad - we don't want driver backends
creating arbitrary files in the shared /tmp directory.

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

> On Sep 27, 2016, at 05:36, Daniel P. Berrange <email address hidden> wrote:
>
> On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
>> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
>> mechanism for systems not supporting memfd_create syscall (started
>> being supported since 3.17).
>
> This is really dubious code in general and IMHO should just
> be reverted.

There are numerous people relying on older kernels in openstack
deployments - sometimes with specific drivers (ovswitch, dpdk,
infiniband) holding kernel upgrades - but still in need of upgrading
userland (e.g. newer releases). Having a fallback mechanism seems
appropriate for those cases.

>
> We have a golden rule that any time QEMU needs to be able to
> create a file on disk, then the path should be explicitly
> provided as a command line argument so that mgmt apps can
> control the location used.
>
>> Backporting memfd_create might not be accepted for distros relying
>> on older kernels. Nowadays there is no way for security driver
>> to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
>>
>> It is more appropriate to include UUID and/or VM names in the
>> temporary filename, allowing security driver rules to be applied
>> while maintaining the required unpredictability with mkstemp.
>
> We should not have QEMU creating unpredictabile filenames in the
> first place - any filenames should be determined by libvirt
> explicitly.

Note that the filename, per se, is not as important as other files,
since qemu won't provide it for being accessed by external programs, and,
deletes the file, while keeping the descriptor, right after its creation
(due to its nature, that is probably why it was created in /tmp).

Having libvirt to define a filename that would not be used for recent
kernels (> 3.17) and would exist for a fraction of second doesn't seem
right to me.

>
>> This change will allow libvirt to know exact memfd file to be created
>> for vhost log AND to create appropriate security rules to allow access
>> per instance (instead of a opened rule like <tmpdir>/memfd-*).
>
> Even with this change it is bad - we don't want driver backends
> creating arbitrary files in the shared /tmp directory.

On the other hand, if we are creating a tmp file, like I said, I see
benefit on having unpredictability (mkstemp), but providing predictable
parts to allow security driver to apply rules per instance basis
(/tmp/memfd-UUID*, /tmp/memfd-VMname*).

Looking forward to a decision so I can backport correct behaviour
(with or without memfd file).

Thank you!

Best Regards,
Rafael

Hello!

> On Sep 27, 2016, at 08:13, Marc-André Lureau <email address hidden> wrote:
>
>> Note that the filename, per se, is not as important as other files,
>> since qemu won't provide it for being accessed by external programs, and,
>> deletes the file, while keeping the descriptor, right after its creation
>> (due to its nature, that is probably why it was created in /tmp).
>>
>> Having libvirt to define a filename that would not be used for recent
>> kernels (> 3.17) and would exist for a fraction of second doesn't seem
>> right to me.
>>
>
> There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules?

You got a point. With a per-vm TMPDIR we don't have to care about filenames in future for the security driver, while still securing them per-instance base. I'll come back to you!

Thank you!

On Tue, Sep 27, 2016 at 07:13:55AM -0400, Marc-André Lureau wrote:
> Hi
>
> ----- Original Message -----
> >
> > > On Sep 27, 2016, at 05:36, Daniel P. Berrange <email address hidden> wrote:
> > >
> > > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> > > We should not have QEMU creating unpredictabile filenames in the
> > > first place - any filenames should be determined by libvirt
> > > explicitly.
> >
> > Note that the filename, per se, is not as important as other files,
> > since qemu won't provide it for being accessed by external programs, and,
> > deletes the file, while keeping the descriptor, right after its creation
> > (due to its nature, that is probably why it was created in /tmp).
> >
> > Having libvirt to define a filename that would not be used for recent
> > kernels (> 3.17) and would exist for a fraction of second doesn't seem
> > right to me.
> >
>
> There are other parts of qemu that rely on creating temporary files, and
> this seems to lack a bit of uniformity. Would it make sense to define a
> place where qemu could create those? Or setting TMPDIR should help too.
> Could libvirt set a per-vm TMPDIR with appropriate security rules?

The other places that use mkstemp are block for snapshot=on, which
libvirt does not support as we want control over the filename. This
needs fixing by allowing a filename to be given. The qemu sockets code
uses it for auto-creating a UNIX domain socket path, but again libvirt
doesn't support that usage. The exec.c file uses it, but that honours
an explicit directory path provided on the command line. So this memfd
code really is the first place which is causing a real

Just setting TMPDIR per VM doesn't magically solve all these cases as
it isn't reasonable to assume that all these files should be in the
same location. Certainly block snapshot file will be somewhere different
from others, due to its size.

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Sep 27, 2016 at 11:01:10AM -0000, Rafael David Tinoco wrote:
> > On Sep 27, 2016, at 05:36, Daniel P. Berrange <email address hidden> wrote:
> >
> > On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
> >> Commit: 35f9b6ef3acc9d0546c395a566b04e63ca84e302 added a fallback
> >> mechanism for systems not supporting memfd_create syscall (started
> >> being supported since 3.17).
> >
> > This is really dubious code in general and IMHO should just
> > be reverted.
>
> There are numerous people relying on older kernels in openstack
> deployments - sometimes with specific drivers (ovswitch, dpdk,
> infiniband) holding kernel upgrades - but still in need of upgrading
> userland (e.g. newer releases). Having a fallback mechanism seems
> appropriate for those cases.

I'm not against some kind of fallback - just about the way it
silently creates files in /tmp.

>
> Note that the filename, per se, is not as important as other files,
> since qemu won't provide it for being accessed by external programs, and,
> deletes the file, while keeping the descriptor, right after its creation
> (due to its nature, that is probably why it was created in /tmp).

If it doesn't shared with other processes, and is deleted immediately,
why does the file need to be on disk at all ?

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Sorry, I was only able to come back to this today.

> On Sep 27, 2016, at 09:18, Daniel Berrange <email address hidden> wrote:
>
>> There are numerous people relying on older kernels in openstack
>> deployments - sometimes with specific drivers (ovswitch, dpdk,
>> infiniband) holding kernel upgrades - but still in need of upgrading
>> userland (e.g. newer releases). Having a fallback mechanism seems
>> appropriate for those cases.
>
> I'm not against some kind of fallback - just about the way it
> silently creates files in /tmp.
>

That is why memfd_create is used here I suppose: To allow anonymous-backed-pages to have a descriptor and to be sealed. When falling back this mechanism I don't see any other way other than creating a temporary file. Of course one way would be something like:

http://paste.ubuntu.com/23270379/

But this is pretty much the same, just solving the "where to place the temporary file" (non configurable for this usage).

>>
>> Note that the filename, per se, is not as important as other files,
>> since qemu won't provide it for being accessed by external programs, and,
>> deletes the file, while keeping the descriptor, right after its creation
>> (due to its nature, that is probably why it was created in /tmp).
>
> If it doesn't shared with other processes, and is deleted immediately,
> why does the file need to be on disk at all ?

Well, it unlinks the file but the references are still there while the descriptor isn't closed by this process, or by the one that receives the descriptor (that is why is the "unlink" so early).

If you check vhost_dev_log_resize(), it gets *possible* new vhost log (if a new size is given) and informs the vhost dev driver about the new log base (vhost_ops->vhost_set_log_base).

For vhost_user, this means that the file descriptors for vhost logs are likely going to be passed to vhost backend (fds[] in vhost_user_set_log_base). This is just one example, not sure about others.

Probably the best approach here, like what Marc-André said, is to create some sort of TMPDIR, set by libvirt perhaps ?

>
> Regards,
> Daniel

Hello Marc,

> On Sep 27, 2016, at 08:13, Marc-André Lureau <email address hidden> wrote:
>
>>> On Tue, Sep 27, 2016 at 03:06:21AM +0000, Rafael David Tinoco wrote:
>>> We should not have QEMU creating unpredictabile filenames in the
>>> first place - any filenames should be determined by libvirt
>>> explicitly.
>>
>> Note that the filename, per se, is not as important as other files,
>> since qemu won't provide it for being accessed by external programs, and,
>> deletes the file, while keeping the descriptor, right after its creation
>> (due to its nature, that is probably why it was created in /tmp).
>>
>> Having libvirt to define a filename that would not be used for recent
>> kernels (> 3.17) and would exist for a fraction of second doesn't seem
>> right to me.
>>
>
> There are other parts of qemu that rely on creating temporary files, and this seems to lack a bit of uniformity. Would it make sense to define a place where qemu could create those? Or setting TMPDIR should help too. Could libvirt set a per-vm TMPDIR with appropriate security rules?

Best move I can see. Only problem is that if we do that, we would have to create a fallback mechanism for when TMPDIR is not set. It would go back to /tmp ?

In my particular case (for 1 vhost log file):

-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3

I could have something similar to:

-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3,vhostpath=/var/lib/XXXX/YYYY/

and put mkstemp() files (one per vhost device) in there.

Even so, what to do when "vhostpath" is not informed ?

I'm worried that, right now there are security drivers either blocking the live migration entirely or allowing all instances to be able to read /tmp/memfd-XXXX.

Don't you think we could push the first patch until we come up with a better approach for the tmp (and default tmp) files & directories ? The patch is not worse than what was committed already.

Tks

Rafael

Download full text (3.6 KiB)

On Mon, Oct 03, 2016 at 03:41:10PM -0000, Rafael David Tinoco wrote:
> Sorry, I was only able to come back to this today.
>
> > On Sep 27, 2016, at 09:18, Daniel Berrange <email address hidden> wrote:
> >
> >> There are numerous people relying on older kernels in openstack
> >> deployments - sometimes with specific drivers (ovswitch, dpdk,
> >> infiniband) holding kernel upgrades - but still in need of upgrading
> >> userland (e.g. newer releases). Having a fallback mechanism seems
> >> appropriate for those cases.
> >
> > I'm not against some kind of fallback - just about the way it
> > silently creates files in /tmp.
> >
>
> That is why memfd_create is used here I suppose: To allow anonymous-
> backed-pages to have a descriptor and to be sealed. When falling back
> this mechanism I don't see any other way other than creating a temporary
> file. Of course one way would be something like:
>
> http://paste.ubuntu.com/23270379/
>
> But this is pretty much the same, just solving the "where to place the
> temporary file" (non configurable for this usage).
>
> >>
> >> Note that the filename, per se, is not as important as other files,
> >> since qemu won't provide it for being accessed by external programs, and,
> >> deletes the file, while keeping the descriptor, right after its creation
> >> (due to its nature, that is probably why it was created in /tmp).
> >
> > If it doesn't shared with other processes, and is deleted immediately,
> > why does the file need to be on disk at all ?
>
> Well, it unlinks the file but the references are still there while the
> descriptor isn't closed by this process, or by the one that receives the
> descriptor (that is why is the "unlink" so early).
>
> If you check vhost_dev_log_resize(), it gets *possible* new vhost log
> (if a new size is given) and informs the vhost dev driver about the new
> log base (vhost_ops->vhost_set_log_base).
>
> For vhost_user, this means that the file descriptors for vhost logs are
> likely going to be passed to vhost backend (fds[] in
> vhost_user_set_log_base). This is just one example, not sure about
> others.
>
> Probably the best approach here, like what Marc-André said, is to create
> some sort of TMPDIR, set by libvirt perhaps ?

So you're saying that the file descriptor here is actually getting
passed to a different process for it to use ?

If so that means we definitely do not want this in TMPDIR. If we
create a generic file in TMPDIR, then its going to have a generic
security label. That means that the other process we're giving the
FD to is going to have to be granted permission to access this FD
and we certainly don't want to grant permission for it to access
any of QEMU's other FDs. So for the SELinux integration, we'll
need this FD to be in a specific directory, so that we can setup
policy such that the file created gets given a specific SELinux
label. We can then grant the other process access to only that
particular file, and not anything else of QEMU's.

This makes me wonder about the memfd_create() code path too - we'll
again not want that external process to be granted access to arbitrary
FDs of QEMU's and I'm not sure of a way to get the m...

Read more...

Download full text (4.1 KiB)

Hello Daniel,

> On Oct 03, 2016, at 14:55, Daniel P. Berrange <email address hidden> wrote:
>
>> Well, it unlinks the file but the references are still there while the
>> descriptor isn't closed by this process, or by the one that receives the
>> descriptor (that is why is the "unlink" so early).
>>
>> If you check vhost_dev_log_resize(), it gets *possible* new vhost log
>> (if a new size is given) and informs the vhost dev driver about the new
>> log base (vhost_ops->vhost_set_log_base).
>>
>> For vhost_user, this means that the file descriptors for vhost logs are
>> likely going to be passed to vhost backend (fds[] in
>> vhost_user_set_log_base). This is just one example, not sure about
>> others.
>>
>> Probably the best approach here, like what Marc-André said, is to create
>> some sort of TMPDIR, set by libvirt perhaps ?
>
> So you're saying that the file descriptor here is actually getting
> passed to a different process for it to use ?
>
> If so that means we definitely do not want this in TMPDIR. If we
> create a generic file in TMPDIR, then its going to have a generic
> security label. That means that the other process we're giving the
> FD to is going to have to be granted permission to access this FD
> and we certainly don't want to grant permission for it to access
> any of QEMU's other FDs. So for the SELinux integration, we'll
> need this FD to be in a specific directory, so that we can setup
> policy such that the file created gets given a specific SELinux
> label. We can then grant the other process access to only that
> particular file, and not anything else of QEMU's.
>
> This makes me wonder about the memfd_create() code path too - we'll
> again not want that external process to be granted access to arbitrary
> FDs of QEMU's and I'm not sure of a way to get the memfd FD to have
> a specific label. So I think it is possible that when using libvirt
> we'll want the ability to tell QEMU to *always* use an explicit file
> in a path libvirt specifies, and never use memfd even if available.

Check this execution path:

(vhost_vsock_device_realize)
  vhost_dev_init
  vhost_commit
  |- vhost_get_log_size
  |...
  |- vhost_dev_log_resize

(vhost_dev_log_resize):
  vhost_log_get -> here if the size is bigger, a new log is created
  dev->vhost_ops->vhost_set_log_base() -> kernel or user vhost driver
  vhost_log_put()

----

So,

* In case of the kernel mode, this is just a:

vhost in kernel mode = vhost_kernel_set_log_base
return vhost_kernel_call(dev, VHOST_SET_LOG_BASE, &base);

which makes an ioctl to dev->opaque file descriptor to set a new vhost log base.

* But in the case of user mode:

vhost in user mode = vhost_user_set_log_base

which gets the log file descriptor (log->fd) and gives to vhost_user_write. vhost_user_write will do a qemu_chr_fe_set_msgfds passing the log file descriptors for the backend vhost driver (CharDriverState).

If I'm reading this right.. if the backend driver is:

static int tcp_set_msgfds(CharDriverState *chr, int *fds, int num)

it would check for:

!qio_channel_has_feature(s->ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {

and configure s->write_msgfds. This would be sent in:

static int tcp_chr_writ...

Read more...

Yes, definitely. Check this:

/**
 * @qemu_chr_fe_set_msgfds:
 *
 * For backends capable of fd passing, set an array of fds to be passed with
 * the next send operation.
 * A subsequent call to this function before calling a write function will
 * result in overwriting the fd array with the new value without being send.
 * Upon writing the message the fd array is freed.
 *
 * Returns: -1 if fd passing isn't supported.
 */
int qemu_chr_fe_set_msgfds(CharDriverState *s, int *fds, int num);

----

So, at least for vhost_dev_log_resize, this "interface" is being implemented:

vhost_user_set_log_base -> VhostUserMsg = VHOST_USER_SET_LOG_BASE

vhost_user_write(with the VHOST_USER_GET_LOG_BASE message):

- configures the file descriptors(... , fds, fd_num)
  qemu_chr_fe_set_msgfds

- writes them down the char driver
  qemu_chr_fe_write_all

> On Oct 03, 2016, at 15:46, Rafael David Tinoco <email address hidden> wrote:
>
>> So you're saying that the file descriptor here is actually getting
>> passed to a different process for it to use ?

Daniel Berrange (berrange) wrote :

On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> Yes, definitely. Check this:

[snip]

So in that case, I think we must add ability to specify an explicit path
that apps can use *regardles* of whether memfd support exists or not.

> > On Oct 03, 2016, at 15:46, Rafael David Tinoco <email address hidden> wrote:
> >
> >> So you're saying that the file descriptor here is actually getting
> >> passed to a different process for it to use ?
>

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

Let me work on it. I'll get back soon.

Tks Daniel.

> On Oct 04, 2016, at 05:36, Daniel P. Berrange <email address hidden> wrote:
>
> On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
>> Yes, definitely. Check this:
>
> [snip]
>
> So in that case, I think we must add ability to specify an explicit path
> that apps can use *regardles* of whether memfd support exists or not.
>
>>> On Oct 03, 2016, at 15:46, Rafael David Tinoco <email address hidden> wrote:
>>>
>>>> So you're saying that the file descriptor here is actually getting
>>>> passed to a different process for it to use ?

elmarco (marcandre-lureau) wrote :

Hi Rafael, Daniel,

On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
<email address hidden>> wrote:

> Let me work on it. I'll get back soon.
>
>
thanks for working on it, before that I have a few questions:

Tks Daniel.
>
> > On Oct 04, 2016, at 05:36, Daniel P. Berrange <email address hidden>
> wrote:
> >
> > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> >> Yes, definitely. Check this:
> >
> > [snip]
> >
> > So in that case, I think we must add ability to specify an explicit path
> > that apps can use *regardles* of whether memfd support exists or not.
>

How will this path be used? Is it going to be global to qemu for various
use (kinda like $TMP), or per-device, or for memfd fallback only? Should
the path pre-exist? (I suppose, if not, qemu should clean it up when
leaving)

>
> >>> On Oct 03, 2016, at 15:46, Rafael David Tinoco <
> <email address hidden>> wrote:
> >>>
> >>>> So you're saying that the file descriptor here is actually getting
> >>>> passed to a different process for it to use ?
>
>
> --
Marc-André Lureau

Daniel Berrange (berrange) wrote :

On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> Hi Rafael, Daniel,
>
> On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> <email address hidden>> wrote:
>
> > Let me work on it. I'll get back soon.
> >
> >
> thanks for working on it, before that I have a few questions:
>
> Tks Daniel.
> >
> > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <email address hidden>
> > wrote:
> > >
> > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> > >> Yes, definitely. Check this:
> > >
> > > [snip]
> > >
> > > So in that case, I think we must add ability to specify an explicit path
> > > that apps can use *regardles* of whether memfd support exists or not.
> >
>
> How will this path be used? Is it going to be global to qemu for various
> use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> the path pre-exist? (I suppose, if not, qemu should clean it up when
> leaving)

I'd expect it to be an option set against the vhost user backend, since
that's the thing using this.

If other things have similar usage needs wrt memfd in future, they would
also need similar path config option.

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

elmarco (marcandre-lureau) wrote :

Hi

On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <email address hidden>
wrote:

> On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> > Hi Rafael, Daniel,
> >
> > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> > <email address hidden>> wrote:
> >
> > > Let me work on it. I'll get back soon.
> > >
> > >
> > thanks for working on it, before that I have a few questions:
> >
> > Tks Daniel.
> > >
> > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <email address hidden>
> > > wrote:
> > > >
> > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> > > >> Yes, definitely. Check this:
> > > >
> > > > [snip]
> > > >
> > > > So in that case, I think we must add ability to specify an explicit
> path
> > > > that apps can use *regardles* of whether memfd support exists or not.
> > >
> >
> > How will this path be used? Is it going to be global to qemu for various
> > use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > leaving)
>
> I'd expect it to be an option set against the vhost user backend, since
> that's the thing using this.
>
> If other things have similar usage needs wrt memfd in future, they would
> also need similar path config option.
>

The log may be shared if there are several vhost-user (stored in
vhost_log_shm global), so I think it makes more sense to have a global
config path for it, or you may end up duplicating that information per
vhost backend and having files in either of the specified paths.

>
>
> Regards,
> Daniel
> --
> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org -o- http://virt-manager.org
> :|
> |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/
> :|
>
--
Marc-André Lureau

Daniel Berrange (berrange) wrote :

On Tue, Oct 04, 2016 at 01:10:17PM +0000, Marc-André Lureau wrote:
> Hi
>
> On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <email address hidden>
> wrote:
>
> > On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> > > Hi Rafael, Daniel,
> > >
> > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> > > <email address hidden>> wrote:
> > >
> > > > Let me work on it. I'll get back soon.
> > > >
> > > >
> > > thanks for working on it, before that I have a few questions:
> > >
> > > Tks Daniel.
> > > >
> > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <email address hidden>
> > > > wrote:
> > > > >
> > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco wrote:
> > > > >> Yes, definitely. Check this:
> > > > >
> > > > > [snip]
> > > > >
> > > > > So in that case, I think we must add ability to specify an explicit
> > path
> > > > > that apps can use *regardles* of whether memfd support exists or not.
> > > >
> > >
> > > How will this path be used? Is it going to be global to qemu for various
> > > use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> > > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > > leaving)
> >
> > I'd expect it to be an option set against the vhost user backend, since
> > that's the thing using this.
> >
> > If other things have similar usage needs wrt memfd in future, they would
> > also need similar path config option.
> >
>
> The log may be shared if there are several vhost-user (stored in
> vhost_log_shm global), so I think it makes more sense to have a global
> config path for it, or you may end up duplicating that information per
> vhost backend and having files in either of the specified paths.

Hmm, is there a reason why it is shared? That seems to make an assumption
that all vhost-user backends would be managed by the same external process.
While that may be the common case today, it doesn't feel like a reasonable
assumption to make long term. IOW it feels wiser to have it set per-NIC
unless I'm missing something important that means it must be shared ?

Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

> On Oct 04, 2016, at 10:10, Marc-André Lureau <email address hidden> wrote:
>
> > How will this path be used? Is it going to be global to qemu for various
> > use (kinda like $TMP), or per-device, or for memfd fallback only? Should
> > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > leaving)
>
> I'd expect it to be an option set against the vhost user backend, since
> that's the thing using this.
>
> If other things have similar usage needs wrt memfd in future, they would
> also need similar path config option.

I was going for that approach. I could have something similar to:

-netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:5c:10:f2,bus=pci.0,addr=0x3,vhostpath=/var/lib/XXXX/YYYY/

> The log may be shared if there are several vhost-user (stored in vhost_log_shm global), so I think it makes more sense to have a global config path for it, or you may end up duplicating that information per vhost backend and having files in either of the specified paths.

But, yes, indeed the vhost_log_shm makes that approach tricky. If sharing the same log file with multiple vhost backend. Besides, tools like openstack would put all the vhost log files in the same place at the end.

Having a global config path, forced to be specified, orelse the vhost log isn't created, like when it fails nowadays. This seems to be the right approach.

True.

What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only.

This way every device would get its own log file and vhost-user backends would be able to get its file descriptors. (and, of course, allow the security drivers to do their jobs).

>> On Oct 04, 2016, at 10:25, Daniel P. Berrange <email address hidden> wrote:
>>
>> Hmm, is there a reason why it is shared? That seems to make an assumption
>> that all vhost-user backends would be managed by the same external process.
>> While that may be the common case today, it doesn't feel like a reasonable
>> assumption to make long term. IOW it feels wiser to have it set per-NIC
>> unless I'm missing something important that means it must be shared ?
>

elmarco (marcandre-lureau) wrote :

Hi

On Tue, Oct 4, 2016 at 5:25 PM Daniel P. Berrange <email address hidden>
wrote:

> On Tue, Oct 04, 2016 at 01:10:17PM +0000, Marc-André Lureau wrote:
> > Hi
> >
> > On Tue, Oct 4, 2016 at 4:42 PM Daniel P. Berrange <email address hidden>
> > wrote:
> >
> > > On Tue, Oct 04, 2016 at 12:39:17PM +0000, Marc-André Lureau wrote:
> > > > Hi Rafael, Daniel,
> > > >
> > > > On Tue, Oct 4, 2016 at 4:22 PM Rafael David Tinoco <
> > > > <email address hidden>> wrote:
> > > >
> > > > > Let me work on it. I'll get back soon.
> > > > >
> > > > >
> > > > thanks for working on it, before that I have a few questions:
> > > >
> > > > Tks Daniel.
> > > > >
> > > > > > On Oct 04, 2016, at 05:36, Daniel P. Berrange <
> <email address hidden>>
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Oct 03, 2016 at 04:15:55PM -0300, Rafael David Tinoco
> wrote:
> > > > > >> Yes, definitely. Check this:
> > > > > >
> > > > > > [snip]
> > > > > >
> > > > > > So in that case, I think we must add ability to specify an
> explicit
> > > path
> > > > > > that apps can use *regardles* of whether memfd support exists or
> not.
> > > > >
> > > >
> > > > How will this path be used? Is it going to be global to qemu for
> various
> > > > use (kinda like $TMP), or per-device, or for memfd fallback only?
> Should
> > > > the path pre-exist? (I suppose, if not, qemu should clean it up when
> > > > leaving)
> > >
> > > I'd expect it to be an option set against the vhost user backend, since
> > > that's the thing using this.
> > >
> > > If other things have similar usage needs wrt memfd in future, they
> would
> > > also need similar path config option.
> > >
> >
> > The log may be shared if there are several vhost-user (stored in
> > vhost_log_shm global), so I think it makes more sense to have a global
> > config path for it, or you may end up duplicating that information per
> > vhost backend and having files in either of the specified paths.
>
> Hmm, is there a reason why it is shared? That seems to make an assumption
> that all vhost-user backends would be managed by the same external process.
> While that may be the common case today, it doesn't feel like a reasonable
> assumption to make long term. IOW it feels wiser to have it set per-NIC
> unless I'm missing something important that means it must be shared ?
>
>
It's a shared log, just like they share the same ram. Duplicating the log
would mostly make migration more difficult to handle and increase a bit
memory usage.

> Regards,
> Daniel
> --
> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org -o- http://virt-manager.org
> :|
> |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/
> :|
>
--
Marc-André Lureau

elmarco (marcandre-lureau) wrote :

On Tue, Oct 4, 2016 at 5:34 PM Rafael David Tinoco <
<email address hidden>> wrote:

> True.
>
> What about having a single config parameter as a place to put all vhost
> logs for all drives for a single instance ? Remove the memfd implementation
> with all the memfd shared_memory option ? Replace it with a
> open+unlink+ftruncate+mmap approach only.
>
>
I fail to see your point, memfd is superior to open+unlink and has other
advantages with sealing etc.

Regarding shared log, see my previous reply to Daniel.

This way every device would get its own log file and vhost-user backends
> would be able to get its file descriptors. (and, of course, allow the
> security drivers to do their jobs).
>
> >> On Oct 04, 2016, at 10:25, Daniel P. Berrange <email address hidden>
> wrote:
> >>
> >> Hmm, is there a reason why it is shared? That seems to make an
> assumption
> >> that all vhost-user backends would be managed by the same external
> process.
> >> While that may be the common case today, it doesn't feel like a
> reasonable
> >> assumption to make long term. IOW it feels wiser to have it set per-NIC
> >> unless I'm missing something important that means it must be shared ?
> >
>
> --
Marc-André Lureau

> On Oct 04, 2016, at 10:50, Marc-André Lureau <email address hidden> wrote:
>
> What about having a single config parameter as a place to put all vhost logs for all drives for a single instance ? Remove the memfd implementation with all the memfd shared_memory option ? Replace it with a open+unlink+ftruncate+mmap approach only.
>
>
> I fail to see your point, memfd is superior to open+unlink and has other advantages with sealing etc.

I was just summarising needs based on previous statement from Daniel:

> This makes me wonder about the memfd_create() code path too - we'll
> again not want that external process to be granted access to arbitrary
> FDs of QEMU's and I'm not sure of a way to get the memfd FD to have
> a specific label. So I think it is possible that when using libvirt
> we'll want the ability to tell QEMU to *always* use an explicit file
> in a path libvirt specifies, and never use memfd even if available.
>
> Regards,
> Daniel

Download full text (6.7 KiB)

Hello Again, finally I could get back to this, and..

I was finishing a patch creating the open+truncate+mmap+unlink mechanism on files specified by "vhostlog" parameter of tap devices. Patch is done, problem is that... looks like the "memfd" is only used for shared logs AND vhost-net (used for tap devices) doesn't use it.

In the following...

(scenario 1)

Linux kvm01 4.8.0-22-generic #24-Ubuntu SMP Sat Oct 8 09:15:00 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

with:
-netdev tap,id=net0,vhost=on
-device virtio-net-pci,netdev=net0,id=net0,mac=52:54:00:20:c5:42,bus=pci.0,addr=0x3

## kvm01

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create worked, truncating...
qemu_memfd_alloc: mmaping
qemu_memfd_free: enter
qemu_memfd_check: ok
vhost_dev_start: enter
vhost_log_get: enter
vhost_log_alloc: enter
vhost_log_alloc: local
vhost_log_get: not shared
vhost_log_put: enter
vhost_log_put: enter
vhost_log_put: local free

(qemu) migrate -d tcp:kvm02:4444
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compres
Migration status: completed
total time: 14586 milliseconds
downtime: 10 milliseconds
setup: 20 milliseconds
transferred ram: 377224 kbytes
throughput: 212.02 mbps
remaining ram: 0 kbytes
total ram: 4001544 kbytes
duplicate: 908879 pages
skipped: 0 pages
normal: 92129 pages
normal bytes: 368516 kbytes
dirty sync count: 4

## kvm02

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create worked, truncating...
qemu_memfd_alloc: mmaping
qemu_memfd_free: enter
qemu_memfd_check: ok
vhost_dev_start: enter

(scenario 2)

Linux kvm01 3.13.0-99-generic #146-Ubuntu SMP Wed Oct 12 20:56:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

with:
-netdev tap,id=net0,vhost=on
-device virtio-net-pci,netdev=net0,id=net0,mac=52:54:00:20:c5:42,bus=pci.0,addr=0x3

## kvm01

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create failed #2
qemu_memfd_alloc: fallback
qemu_memfd_alloc: fname = /tmp/memfd-XXXXXX
qemu_memfd_alloc: fallback truncating
qemu_memfd_alloc: mmaping
qemu_memfd_free
qemu_memfd_check: ok
vhost_dev_start: enter
vhost_log_get: enter
vhost_log_alloc: enter
vhost_log_alloc: local
vhost_log_get: not shared
vhost_log_put: enter
vhost_log_put: enter
vhost_log_put: local free

(qemu) migrate -d tcp:kvm02:4444
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compres
Migration status: completed
total time: 15400 milliseconds
downtime: 9 milliseconds
setup: 5 milliseconds
transferred ram: 375812 kbytes
throughput: 199.99 mbps
remaining ram: 0 kbytes
total ram: 4001544 kbytes
duplicate: 909186 pages
skipped: 0 pages
normal: 91776 pages
normal bytes: 367104 kbytes
dirty sync count: 3

## kvm02

$ ./instance.sh
qemu_memfd_check
qemu_memfd_alloc: enter
qemu_memfd_alloc: memfd_create with no sealing
qemu_memfd_alloc: memfd_create failed #2
qemu_memfd_alloc: fallback
qemu_memfd_alloc: fname = /tmp/memfd-XXXXXX
qemu_memfd_alloc: fallbac...

Read more...

The correct (and draft) one:
http://pastebin.ubuntu.com/23357210/

Im passing vhostlog parameter as "hdev->log_filename" so it can be accessed from net_init_tap()-> functions AND from vhost_dev_start()-> functions. This way I don't have to change function prototypes anymore.

> On Oct 21, 2016, at 01:03, Rafael David Tinoco <email address hidden> wrote:
>
> Also, if possible, I would like comments about a draft:
>
> https://pastebin.canonical.com/168579/
> (please disregard printfs and minor problems)

elmarco (marcandre-lureau) wrote :

Hi

On Fri, Oct 21, 2016 at 6:03 AM Rafael David Tinoco <
<email address hidden>> wrote:

> Judging by the outputs above, looks like vhost_dev_log_is_shared is
> returning false, making (2) - vhost_dev_start - to use a different log
> allocation (malloc) than the one that was tested for allowing migrations at
> (1) - vhost_dev_init.
>
>
correct

> Question: Why to check for "memfd" when its not sure - yet - if a shared
> descriptor and memory pointer is going to be needed for the migration to
> happen ? Do you want me to

It's done early enough to disable migration.

> change that ? If memfd fails, but, the guest in question is using regular
> "malloc" for vhost log, we are marking it unable to live migrate by
> mistake. I could check for vhost_requires_shm_log pointer during
> vhost_dev_init (coming from tap).
>
>
Right, it should be done only if vhost_dev_log_is_shared is true. Patch
welcome

> Also, if possible, I would like comments about a draft:
>
> https://pastebin.canonical.com/168579/
> (please disregard printfs and minor problems)
>
> OBS: I'm basically removing fallback mechanism from memfd, creating a
> generic qemu_mmap_XXX implementation, adding a vhostlog parameter in tap
> cmdline AND changing the decision on what to use: if vhostlog is present in
> cmdline, qemu_mmap_XXX on vhostlog is used. If it is a directory, a random
> file is created inside it. If it is a file, the file is used. If no
> vhostlog is given (default while libvirt isn't changed), it tries first to
> use memfd (all newer kernels), and, if not possible, it tries to fallback
> using the qemu_mmap mechanism on "tmp" directory creating random files.
>

Sounds reasonable, but I am not sure so many fallbacks are necessary. I
would just have an optional filename.

>
> PS: Remember that this is because selinux/apparmor labelling on tmp files
> (and because file descriptors can be passed away, like we discussed before).
>
> If that is okay I'll provide a patch asap. Let me know if you prefer
> something else.
>

Ok, I hope other comments on the idea, and I'll review your patch once on
the ML.

Thanks
--
Marc-André Lureau

Download full text (16.9 KiB)

Commit 31190ed7 added a migration blocker in vhost_dev_init() to
check if memfd would succeed. It is better if this blocker first
checks if vhost backend requires shared log. This will avoid a
situation where a blocker is added inappropriately (e.g. shared
log allocation fails when vhost backend doesn't support it).

Commit: 35f9b6e added a fallback mechanism for systems not supporting
memfd_create syscall (started being supported since 3.17).

Backporting memfd_create might not be accepted for distros relying
on older kernels. Nowadays there is no way for security driver
to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.

Also, because vhost log file descriptors can be passed to other
processes, after discussion, we thought it is best to back mmap by
using files that can be placed into a specific directory: this commit
creates "vhostlog" argv parameter for such purpose. This will allow
security drivers to operate on those files appropriately.

Argv examples:

    -netdev tap,id=net0,vhost=on
    -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
    -netdev tap,id=net0,vhost=on,vhostlog=/tmp

For vhost backends supporting shared logs, if vhostlog is non-existent,
or a directory, random files are going to be created in the specified
directory (or, for non-existent, in tmpdir). If vhostlog is specified,
the filepath is always used when allocating vhost log files.

Signed-off-by: Rafael David Tinoco <email address hidden>
---
 hw/net/vhost_net.c | 4 +-
 hw/scsi/vhost-scsi.c | 2 +-
 hw/virtio/vhost-vsock.c | 2 +-
 hw/virtio/vhost.c | 41 +++++++------
 include/hw/virtio/vhost.h | 4 +-
 include/net/vhost_net.h | 1 +
 include/qemu/mmap-file.h | 10 +++
 net/tap.c | 6 ++
 qapi-schema.json | 3 +
 qemu-options.hx | 3 +-
 util/Makefile.objs | 1 +
 util/mmap-file.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++
 12 files changed, 207 insertions(+), 23 deletions(-)
 create mode 100644 include/qemu/mmap-file.h
 create mode 100644 util/mmap-file.c

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index f2d49ad..d650c92 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -171,8 +171,8 @@ struct vhost_net *vhost_net_init(VhostNetOptions *options)
         net->dev.vq_index = net->nc->queue_index * net->dev.nvqs;
     }

- r = vhost_dev_init(&net->dev, options->opaque,
- options->backend_type, options->busyloop_timeout);
+ r = vhost_dev_init(&net->dev, options->opaque, options->backend_type,
+ options->busyloop_timeout, options->vhostlog);
     if (r < 0) {
         goto fail;
     }
diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 5b26946..5dc3d30 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -248,7 +248,7 @@ static void vhost_scsi_realize(DeviceState *dev, Error **errp)
     s->dev.backend_features = 0;

     ret = vhost_dev_init(&s->dev, (void *)(uintptr_t)vhostfd,
- VHOST_BACKEND_TYPE_KERNEL, 0);
+ VHOST_BACKEND_TYPE_KERNEL, 0, NULL);
     if (ret < 0) {
         error_setg(errp, "vhost-scsi:...

Commit 31190ed7 added a migration blocker in vhost_dev_init() to
check if memfd would succeed. It is better if this blocker first
checks if vhost backend requires shared log. This will avoid a
situation where a blocker is added inappropriately (e.g. shared
log allocation fails when vhost backend doesn't support it).
---
 hw/virtio/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index bd051ab..742d0aa 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1122,7 +1122,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: vhost lacks VHOST_F_LOG_ALL feature.");
- } else if (!qemu_memfd_check()) {
+ } else if (vhost_dev_log_is_shared(hdev) && !qemu_memfd_check()) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: failed to allocate shared memory");
         }
--
2.9.3

Download full text (18.6 KiB)

> Begin forwarded message:
>
> From: Marc-André Lureau <email address hidden>
> Subject: Re: [Qemu-devel] [PATCH] vhost: secure vhost shared log files using argv paremeter
> Date: October 22, 2016 at 05:18:02 GMT-2
> To: Rafael David Tinoco <email address hidden>
> Cc: QEMU <email address hidden>
>
> Hi
>
> On Sat, Oct 22, 2016 at 10:01 AM Rafael David Tinoco <<email address hidden> <mailto:<email address hidden>>> wrote:
> Commit 31190ed7 added a migration blocker in vhost_dev_init() to
> check if memfd would succeed. It is better if this blocker first
> checks if vhost backend requires shared log. This will avoid a
> situation where a blocker is added inappropriately (e.g. shared
> log allocation fails when vhost backend doesn't support it).
>
> Could you make this a seperate patch?
>
> Commit: 35f9b6e added a fallback mechanism for systems not supporting
> memfd_create syscall (started being supported since 3.17).
>
> Backporting memfd_create might not be accepted for distros relying
> on older kernels. Nowadays there is no way for security driver
> to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
>
> Also, because vhost log file descriptors can be passed to other
> processes, after discussion, we thought it is best to back mmap by
> using files that can be placed into a specific directory: this commit
> creates "vhostlog" argv parameter for such purpose. This will allow
> security drivers to operate on those files appropriately.
>
> Argv examples:
>
> -netdev tap,id=net0,vhost=on
> -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
> -netdev tap,id=net0,vhost=on,vhostlog=/tmp
>
> Could it be only a filename? This would simplify testing.
>
>
> For vhost backends supporting shared logs, if vhostlog is non-existent,
> or a directory, random files are going to be created in the specified
> directory (or, for non-existent, in tmpdir). If vhostlog is specified,
> the filepath is always used when allocating vhost log files.
>
>
> Regarding testing, you add utility code mmap-file, could you make this a seperate commit, with unit tests?
>
> thanks
>
> Signed-off-by: Rafael David Tinoco <<email address hidden> <mailto:<email address hidden>>>
> ---
> hw/net/vhost_net.c | 4 +-
> hw/scsi/vhost-scsi.c | 2 +-
> hw/virtio/vhost-vsock.c | 2 +-
> hw/virtio/vhost.c | 41 +++++++------
> include/hw/virtio/vhost.h | 4 +-
> include/net/vhost_net.h | 1 +
> include/qemu/mmap-file.h | 10 +++
> net/tap.c | 6 ++
> qapi-schema.json | 3 +
> qemu-options.hx | 3 +-
> util/Makefile.objs | 1 +
> util/mmap-file.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++
> 12 files changed, 207 insertions(+), 23 deletions(-)
> create mode 100644 include/qemu/mmap-file.h
> create mode 100644 util/mmap-file.c
>
> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> index f2d49ad..d650c92 100644
> --- a/hw/net/vhost_net.c
> +++ b/hw/net/vhost_net.c
> @@ -171,8 +171,8 @@ struct vhost_net *vhost_net_init(VhostNetOptions *options)
> net->dev.vq_index = net->nc->queue_index * net->d...

> Begin forwarded message:
>
> From: Rafael David Tinoco <email address hidden>
> Subject: Re: [Qemu-devel] [PATCH] vhost: secure vhost shared log files using argv paremeter
> Date: October 22, 2016 at 19:52:31 GMT-2
> To: Marc-André Lureau <email address hidden>
> Cc: Rafael David Tinoco <email address hidden>, qemu-devel <email address hidden>
>
> Hello,
>
>> On Oct 22, 2016, at 05:18, Marc-André Lureau <email address hidden> wrote:
>>
>> Hi
>>
>> On Sat, Oct 22, 2016 at 10:01 AM Rafael David Tinoco <email address hidden> wrote:
>> Commit 31190ed7 added a migration blocker in vhost_dev_init() to
>> check if memfd would succeed. It is better if this blocker first
>> checks if vhost backend requires shared log. This will avoid a
>> situation where a blocker is added inappropriately (e.g. shared
>> log allocation fails when vhost backend doesn't support it).
>>
>> Could you make this a seperate patch?
>
> Just did, in another e-mail, cc'ing you.
>
>> Argv examples:
>>
>> -netdev tap,id=net0,vhost=on
>> -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
>> -netdev tap,id=net0,vhost=on,vhostlog=/tmp
>>
>> Could it be only a filename? This would simplify testing.
>
> It could. Should I keep the /tmp/<random> logic if no vhostlog arg is present ? Or you think it should fail if no arg is given ? I'm afraid of backward compatibility when back-porting this to older qemu versions on stable releases (like my case: I'll backport this to ~3 different versions).
>
>> For vhost backends supporting shared logs, if vhostlog is non-existent,
>> or a directory, random files are going to be created in the specified
>> directory (or, for non-existent, in tmpdir). If vhostlog is specified,
>> the filepath is always used when allocating vhost log files.
>>
>>
>> Regarding testing, you add utility code mmap-file, could you make this a seperate commit, with unit tests?
>>
>
> Sure, I'll work on it.
>
>> thanks
>
> Thank u!
>
> -Rafael Tinoco

Commit 31190ed7 added a migration blocker in vhost_dev_init() to
check if memfd would succeed. It is better if this blocker first
checks if vhost backend requires shared log. This will avoid a
situation where a blocker is added inappropriately (e.g. shared
log allocation fails when vhost backend doesn't support it).

Signed-off-by: Rafael David Tinoco <email address hidden>
Reviewed-by: Marc-André Lureau <email address hidden>
---
 hw/virtio/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index bd051ab..742d0aa 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1122,7 +1122,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: vhost lacks VHOST_F_LOG_ALL feature.");
- } else if (!qemu_memfd_check()) {
+ } else if (vhost_dev_log_is_shared(hdev) && !qemu_memfd_check()) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: failed to allocate shared memory");
         }
--
2.9.3

Download full text (18.1 KiB)

On Sat, Oct 22, 2016 at 07:00:41AM +0000, Rafael David Tinoco wrote:
> Commit 31190ed7 added a migration blocker in vhost_dev_init() to
> check if memfd would succeed. It is better if this blocker first
> checks if vhost backend requires shared log. This will avoid a
> situation where a blocker is added inappropriately (e.g. shared
> log allocation fails when vhost backend doesn't support it).

Sounds like a bugfix but I'm not sure. Can this part be split
out in a patch by itself?

> Commit: 35f9b6e added a fallback mechanism for systems not supporting
> memfd_create syscall (started being supported since 3.17).
>
> Backporting memfd_create might not be accepted for distros relying
> on older kernels. Nowadays there is no way for security driver
> to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
>
> Also, because vhost log file descriptors can be passed to other
> processes, after discussion, we thought it is best to back mmap by
> using files that can be placed into a specific directory: this commit
> creates "vhostlog" argv parameter for such purpose. This will allow
> security drivers to operate on those files appropriately.
>
> Argv examples:
>
> -netdev tap,id=net0,vhost=on
> -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
> -netdev tap,id=net0,vhost=on,vhostlog=/tmp
>
> For vhost backends supporting shared logs, if vhostlog is non-existent,
> or a directory, random files are going to be created in the specified
> directory (or, for non-existent, in tmpdir). If vhostlog is specified,
> the filepath is always used when allocating vhost log files.

When vhostlog is not specified, can we just use memfd as we did?

> Signed-off-by: Rafael David Tinoco <email address hidden>
> ---
> hw/net/vhost_net.c | 4 +-
> hw/scsi/vhost-scsi.c | 2 +-
> hw/virtio/vhost-vsock.c | 2 +-
> hw/virtio/vhost.c | 41 +++++++------
> include/hw/virtio/vhost.h | 4 +-
> include/net/vhost_net.h | 1 +
> include/qemu/mmap-file.h | 10 +++
> net/tap.c | 6 ++
> qapi-schema.json | 3 +
> qemu-options.hx | 3 +-
> util/Makefile.objs | 1 +
> util/mmap-file.c | 153 ++++++++++++++++++++++++++++++++++++++++++++++
> 12 files changed, 207 insertions(+), 23 deletions(-)
> create mode 100644 include/qemu/mmap-file.h
> create mode 100644 util/mmap-file.c
>
> diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
> index f2d49ad..d650c92 100644
> --- a/hw/net/vhost_net.c
> +++ b/hw/net/vhost_net.c
> @@ -171,8 +171,8 @@ struct vhost_net *vhost_net_init(VhostNetOptions *options)
> net->dev.vq_index = net->nc->queue_index * net->dev.nvqs;
> }
>
> - r = vhost_dev_init(&net->dev, options->opaque,
> - options->backend_type, options->busyloop_timeout);
> + r = vhost_dev_init(&net->dev, options->opaque, options->backend_type,
> + options->busyloop_timeout, options->vhostlog);
> if (r < 0) {
> goto fail;
> }
> diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
> index 5b26946..5dc3d30 100644
> --- a/hw/scsi/vhost-scsi.c
> +++ b/hw/scsi/vhost-scsi.c
> @@ -...

On Sun, Oct 30, 2016 at 5:26 PM, Michael S. Tsirkin <email address hidden> wrote:
>
> On Sat, Oct 22, 2016 at 07:00:41AM +0000, Rafael David Tinoco wrote:
> > Commit 31190ed7 added a migration blocker in vhost_dev_init() to
> > check if memfd would succeed. It is better if this blocker first
> > checks if vhost backend requires shared log. This will avoid a
> > situation where a blocker is added inappropriately (e.g. shared
> > log allocation fails when vhost backend doesn't support it).
>
> Sounds like a bugfix but I'm not sure. Can this part be split
> out in a patch by itself?

Already sent some days ago (and pointed by Marc today).

> > Commit: 35f9b6e added a fallback mechanism for systems not supporting
> > memfd_create syscall (started being supported since 3.17).
> >
> > Backporting memfd_create might not be accepted for distros relying
> > on older kernels. Nowadays there is no way for security driver
> > to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
> >
> > Also, because vhost log file descriptors can be passed to other
> > processes, after discussion, we thought it is best to back mmap by
> > using files that can be placed into a specific directory: this commit
> > creates "vhostlog" argv parameter for such purpose. This will allow
> > security drivers to operate on those files appropriately.
> >
> > Argv examples:
> >
> > -netdev tap,id=net0,vhost=on
> > -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
> > -netdev tap,id=net0,vhost=on,vhostlog=/tmp
> >
> > For vhost backends supporting shared logs, if vhostlog is non-existent,
> > or a directory, random files are going to be created in the specified
> > directory (or, for non-existent, in tmpdir). If vhostlog is specified,
> > the filepath is always used when allocating vhost log files.
>
> When vhostlog is not specified, can we just use memfd as we did?
>

This was my approach on a "pastebin" example before this patch (in the
discussion thread we had). Problem goes back to when vhost log file
descriptor is shared with some vhost-user implementation - like the
interface allows to - and the security driver labelling issue. IMO,
yes, we could let vhostlog to specify a log file, and, if not
specified, assume memfd is ok to be used.

Please let me know if you - and Marc - want me to keep using memfd.
I'll create the mmap-file tests and files in a different commit, like
Marc has asked for, and will propose the patch again by the end of
this week.

mst (mst-0) wrote :

On Mon, Oct 31, 2016 at 08:35:33AM -0200, Rafael David Tinoco wrote:
> On Sun, Oct 30, 2016 at 5:26 PM, Michael S. Tsirkin <email address hidden> wrote:
> >
> > On Sat, Oct 22, 2016 at 07:00:41AM +0000, Rafael David Tinoco wrote:
> > > Commit 31190ed7 added a migration blocker in vhost_dev_init() to
> > > check if memfd would succeed. It is better if this blocker first
> > > checks if vhost backend requires shared log. This will avoid a
> > > situation where a blocker is added inappropriately (e.g. shared
> > > log allocation fails when vhost backend doesn't support it).
> >
> > Sounds like a bugfix but I'm not sure. Can this part be split
> > out in a patch by itself?
>
> Already sent some days ago (and pointed by Marc today).
>
> > > Commit: 35f9b6e added a fallback mechanism for systems not supporting
> > > memfd_create syscall (started being supported since 3.17).
> > >
> > > Backporting memfd_create might not be accepted for distros relying
> > > on older kernels. Nowadays there is no way for security driver
> > > to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
> > >
> > > Also, because vhost log file descriptors can be passed to other
> > > processes, after discussion, we thought it is best to back mmap by
> > > using files that can be placed into a specific directory: this commit
> > > creates "vhostlog" argv parameter for such purpose. This will allow
> > > security drivers to operate on those files appropriately.
> > >
> > > Argv examples:
> > >
> > > -netdev tap,id=net0,vhost=on
> > > -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
> > > -netdev tap,id=net0,vhost=on,vhostlog=/tmp
> > >
> > > For vhost backends supporting shared logs, if vhostlog is non-existent,
> > > or a directory, random files are going to be created in the specified
> > > directory (or, for non-existent, in tmpdir). If vhostlog is specified,
> > > the filepath is always used when allocating vhost log files.
> >
> > When vhostlog is not specified, can we just use memfd as we did?
> >
>
> This was my approach on a "pastebin" example before this patch (in the
> discussion thread we had). Problem goes back to when vhost log file
> descriptor is shared with some vhost-user implementation - like the
> interface allows to - and the security driver labelling issue. IMO,
> yes, we could let vhostlog to specify a log file, and, if not
> specified, assume memfd is ok to be used.
>
> Please let me know if you - and Marc - want me to keep using memfd.
> I'll create the mmap-file tests and files in a different commit, like
> Marc has asked for, and will propose the patch again by the end of
> this week.

I think that the best approach is to allow passing in the fd,
not the file path. If not passed, use memfd.

--
MST

Download full text (4.0 KiB)

Hello Michael, André,

Could you do a quick review before a final submission ?

http://paste.ubuntu.com/23446279/

- I split the commits into 1) bugfix, 2) new util with test, 3) vhostlog

The unit test is testing passing fds between 2 processes and asserting
contents of mmap buffer coming from the "vhostlog" util (mmap-file).

Your final comment on the "vhostlog" was:

>> Argv examples:
>>
>> -netdev tap,id=net0,vhost=on
>> -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
>> -netdev tap,id=net0,vhost=on,vhostlog=/tmp

(André) > Could it be only a filename? This would simplify testing.
(Michael) > When vhostlog is not specified, can we just use memfd as we did?

I'm going to change this to:

1 - if vhostlog is not provided shared log can't be used. Use memfd.

2 - for shared logs, vhostlog has to be provided as a "file" ?

Should i keep vhostlog being a directory also ? (i know we are unlinking the
file so might not be needed BUT a static file might have a race condition in
between different instances and providing a directory - that creates random
files on it - might be better approach).

Is there anything else ?

Thank you

Rafael Tinoco

On Mon, Oct 31, 2016 at 8:30 PM, Michael S. Tsirkin <email address hidden> wrote:
> On Mon, Oct 31, 2016 at 08:35:33AM -0200, Rafael David Tinoco wrote:
>> On Sun, Oct 30, 2016 at 5:26 PM, Michael S. Tsirkin <email address hidden> wrote:
>> >
>> > On Sat, Oct 22, 2016 at 07:00:41AM +0000, Rafael David Tinoco wrote:
>> > > Commit 31190ed7 added a migration blocker in vhost_dev_init() to
>> > > check if memfd would succeed. It is better if this blocker first
>> > > checks if vhost backend requires shared log. This will avoid a
>> > > situation where a blocker is added inappropriately (e.g. shared
>> > > log allocation fails when vhost backend doesn't support it).
>> >
>> > Sounds like a bugfix but I'm not sure. Can this part be split
>> > out in a patch by itself?
>>
>> Already sent some days ago (and pointed by Marc today).
>>
>> > > Commit: 35f9b6e added a fallback mechanism for systems not supporting
>> > > memfd_create syscall (started being supported since 3.17).
>> > >
>> > > Backporting memfd_create might not be accepted for distros relying
>> > > on older kernels. Nowadays there is no way for security driver
>> > > to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
>> > >
>> > > Also, because vhost log file descriptors can be passed to other
>> > > processes, after discussion, we thought it is best to back mmap by
>> > > using files that can be placed into a specific directory: this commit
>> > > creates "vhostlog" argv parameter for such purpose. This will allow
>> > > security drivers to operate on those files appropriately.
>> > >
>> > > Argv examples:
>> > >
>> > > -netdev tap,id=net0,vhost=on
>> > > -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
>> > > -netdev tap,id=net0,vhost=on,vhostlog=/tmp
>> > >
>> > > For vhost backends supporting shared logs, if vhostlog is non-existent,
>> > > or a directory, random files are going to be created in the specified
>> > > directory (or, for non-existent, in tmpdir). If vhostlog is specified,
>> > > the filepath is always...

Read more...

Download full text (4.6 KiB)

Hi

On Tue, Nov 8, 2016 at 4:49 PM Rafael David Tinoco <
<email address hidden>> wrote:

> Hello Michael, André,
>
> Could you do a quick review before a final submission ?
>
> http://paste.ubuntu.com/23446279/
>
> - I split the commits into 1) bugfix, 2) new util with test, 3) vhostlog
>
> The unit test is testing passing fds between 2 processes and asserting
> contents of mmap buffer coming from the "vhostlog" util (mmap-file).
>
> Your final comment on the "vhostlog" was:
>
> >> Argv examples:
> >>
> >> -netdev tap,id=net0,vhost=on
> >> -netdev tap,id=net0,vhost=on,vhostlog=/tmp/guest.log
> >> -netdev tap,id=net0,vhost=on,vhostlog=/tmp
>
> (André) > Could it be only a filename? This would simplify testing.
> (Michael) > When vhostlog is not specified, can we just use memfd as we
> did?
>
>
Michael said:
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08197.html
I think that the best approach is to allow passing in the fd, not the file
path. If not passed, use memfd.

I do agree :)

I'm going to change this to:
>
> 1 - if vhostlog is not provided shared log can't be used. Use memfd.
>
> 2 - for shared logs, vhostlog has to be provided as a "file" ?
>
> Should i keep vhostlog being a directory also ? (i know we are unlinking
> the
> file so might not be needed BUT a static file might have a race condition
> in
> between different instances and providing a directory - that creates random
> files on it - might be better approach).
>
> Is there anything else ?
>

Do we really need to give a path? (pass fd with -add-fd/qmp add-fd)

Thank you
>
> Rafael Tinoco
>
> On Mon, Oct 31, 2016 at 8:30 PM, Michael S. Tsirkin <email address hidden>
> wrote:
> > On Mon, Oct 31, 2016 at 08:35:33AM -0200, Rafael David Tinoco wrote:
> >> On Sun, Oct 30, 2016 at 5:26 PM, Michael S. Tsirkin <email address hidden>
> wrote:
> >> >
> >> > On Sat, Oct 22, 2016 at 07:00:41AM +0000, Rafael David Tinoco wrote:
> >> > > Commit 31190ed7 added a migration blocker in vhost_dev_init() to
> >> > > check if memfd would succeed. It is better if this blocker first
> >> > > checks if vhost backend requires shared log. This will avoid a
> >> > > situation where a blocker is added inappropriately (e.g. shared
> >> > > log allocation fails when vhost backend doesn't support it).
> >> >
> >> > Sounds like a bugfix but I'm not sure. Can this part be split
> >> > out in a patch by itself?
> >>
> >> Already sent some days ago (and pointed by Marc today).
> >>
> >> > > Commit: 35f9b6e added a fallback mechanism for systems not
> supporting
> >> > > memfd_create syscall (started being supported since 3.17).
> >> > >
> >> > > Backporting memfd_create might not be accepted for distros relying
> >> > > on older kernels. Nowadays there is no way for security driver
> >> > > to discover memfd filename to be created: <tmpdir>/memfd-XXXXXX.
> >> > >
> >> > > Also, because vhost log file descriptors can be passed to other
> >> > > processes, after discussion, we thought it is best to back mmap by
> >> > > using files that can be placed into a specific directory: this
> commit
> >> > > creates "vhostlog" argv parameter for such purpose. This will allow
> >> > > security drivers to op...

Read more...

Hello,

> On Tue, Nov 8, 2016 at 4:49 PM Rafael David Tinoco <email address hidden> wrote:
> Hello Michael, André,
>
> Could you do a quick review before a final submission ?
>
> http://paste.ubuntu.com/23446279/
> ...
> (André) > Could it be only a filename? This would simplify testing.
> (Michael) > When vhostlog is not specified, can we just use memfd as we did?
>
> Michael said: https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08197.html
> I think that the best approach is to allow passing in the fd, not the file path. If not passed, use memfd.

Missed this one.

> I do agree :)

Sounds good. I see that the new approach is to let the managing library to create the files and just pass the file descriptors, this way security rules are applied to library itself and not to qemu processes.

> Do we really need to give a path? (pass fd with -add-fd/qmp add-fd)

I guess not. So, for shared logs:

- vhostlogfd has to be provided.
- if vhostlogfd is not provided, use memfd.
(we don't want writes in /tmp, should i remove fallback mechanism from memfd logic)
- if memfd fails, log can't be shared/created and there is a migration blocker.

André, Michael,

I'll work on that and get the patches soon, meanwhile, could u push:

- "vhost: migration blocker only if shared log is use"

so I can backport it to Debian ?

Thank you,
-Rafael Tinoco

Changed in qemu (Ubuntu):
status: New → In Progress
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in qemu (Ubuntu Xenial):
status: New → In Progress
Changed in qemu (Ubuntu Yakkety):
status: New → In Progress
Changed in qemu (Ubuntu Xenial):
assignee: nobody → Rafael David Tinoco (inaddy)
Changed in qemu (Ubuntu Yakkety):
assignee: nobody → Rafael David Tinoco (inaddy)

For Ubuntu Xenial (Mitaka), Yakkety (Newton), Zesty: Commit 0d34fbabc1 fixes the issue for vhost-net kernel. Vhost-net kernel doesn't use shared log so the verification is not used and apparmor profiles won't block the live migration. With customers using vhost-user that might still cause migration problems, but, likely, those are the vast minority.

commit 0d34fbabc13891da41582b0823867dc5733fffef
Author: Rafael David Tinoco <email address hidden>
Date: Mon Oct 24 15:35:03 2016 +0000

    vhost: migration blocker only if shared log is used

    Commit 31190ed7 added a migration blocker in vhost_dev_init() to
    check if memfd would succeed. It is better if this blocker first
    checks if vhost backend requires shared log. This will avoid a
    situation where a blocker is added inappropriately (e.g. shared
    log allocation fails when vhost backend doesn't support it).

    Signed-off-by: Rafael David Tinoco <email address hidden>
    Reviewed-by: Marc-André Lureau <email address hidden>
    Reviewed-by: Michael S. Tsirkin <email address hidden>
    Signed-off-by: Michael S. Tsirkin <email address hidden>

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 131f164..25bf67f 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1122,7 +1122,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
         if (!(hdev->features & (0x1ULL << VHOST_F_LOG_ALL))) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: vhost lacks VHOST_F_LOG_ALL feature.");
- } else if (!qemu_memfd_check()) {
+ } else if (vhost_dev_log_is_shared(hdev) && !qemu_memfd_check()) {
             error_setg(&hdev->migration_blocker,
                        "Migration disabled: failed to allocate shared memory");
         }

The "final" fix for upstream fix is being finished by me, but, might not be suitable for SRU since it will add features in qemu (and likely to libvirt) in order for the vhost log file to be passed (by using an already opened file descriptor). This will require changes in libvirt and nova-compute but this change will, finally, allow security driver to apply rules to vhost log file for shared logs (mostly for vhost-user drivers).

On Fri, Nov 18, 2016 at 11:21 AM, Rafael David Tinoco <
<email address hidden>> wrote:

> With customers using vhost-user that might
> still cause migration problems, but, likely, those are the vast
> minority.
>

It is and has migration issues in general atm anyway - see:
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg03026.html
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg03223.html

So that needs more work and is not in your current scope IMHO.

Changed in cloud-archive:
status: New → In Progress
assignee: nobody → Rafael David Tinoco (inaddy)

Thanks Christian,

Then I'll finish this SRU first. Will work in the vhost mmap log file right after.

Took some more time here because of LP: #1621269.

Right now Zesty is behind Yakkety because of a Security Update. Not sure you need me to attach a debdiff for Zesty as well. Let me know.

description: updated

On Tue, Nov 22, 2016 at 1:02 PM, Rafael David Tinoco <
<email address hidden>> wrote:

> Right now Zesty is behind Yakkety because of a Security Update. Not sure
> you need me to attach a debdiff for Zesty as well. Let me know.
>

Arr - bad timing It got an upload about 5 minutes ago.
So yes a Zesty debdiff would be nice.

--
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd

Thanks Rafael - the upstream work on this is excellent!

I already built all those fine and I'm now looking into some regression checks before considering/doing an upload to Dev-Release & SRU-queue

Some other stages of my extra tests are currently WIP, but those that work worked fine on the ppa I built of your debdiffs.

That covers:
- migration with various workloads
- different types of migrations (live, offline, postcopy)
- upgrading onto the new qemu version
- migration into the upgraded version

I'll attach the log and upload your changes, thanks for your work.
I see you already set the SRU Teamplate for the SRU Team to review then - thanks.

Uploaded into Zesty - per SRU policy (and experience that always something happens at the last minute at LP build/tests) waiting with the SRU uploads until that fully migrated.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.6.1+dfsg-0ubuntu7

---------------
qemu (1:2.6.1+dfsg-0ubuntu7) zesty; urgency=medium

  [ Rafael David Tinoco ]
  * Fixed wrong migration blocker when vhost is used (LP: #1626972)
    - d/p/vhost_migration-blocker-only-if-shared-log-is-used.patch

 -- Christian Ehrhardt <email address hidden> Tue, 22 Nov 2016 13:45:52 +0100

Changed in qemu (Ubuntu Zesty):
status: In Progress → Fix Released

Ok, update into Zesty has passed and you already supplied the SRU Template.
Uploaded to Xenial and Yakkety queues for the SRU Team to consider your Fix.

Hello Rafael, or anyone else affected,

Accepted qemu into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-5ubuntu10.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed
Thomas Huth (th-huth) wrote :

Commit 0d34fbabc13 is upstream, so setting this to "Fix committed", too.

Changed in qemu:
status: In Progress → Fix Committed
James Page (james-page) on 2016-11-28
Changed in cloud-archive:
status: In Progress → Fix Committed
James Page (james-page) wrote :

This bug was fixed in the package qemu - 1:2.6.1+dfsg-0ubuntu7~cloud0
---------------

 qemu (1:2.6.1+dfsg-0ubuntu7~cloud0) xenial-ocata; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 qemu (1:2.6.1+dfsg-0ubuntu7) zesty; urgency=medium
 .
   [ Rafael David Tinoco ]
   * Fixed wrong migration blocker when vhost is used (LP: #1626972)
     - d/p/vhost_migration-blocker-only-if-shared-log-is-used.patch

Changed in cloud-archive:
status: Fix Committed → Fix Released
Brian Murray (brian-murray) wrote :

Hello Rafael, or anyone else affected,

Accepted qemu into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.6.1+dfsg-0ubuntu5.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu (Ubuntu Yakkety):
status: In Progress → Fix Committed
Antonio Messina (arcimboldo) wrote :

Hi all,

I am facing this issue too, and although I can confirm the patch can be easily backported to Trusty (we run Mitaka on Trusty), some of our customers have VMs started with the old qemu and I cannot live migrate anymore or update qemu without stopping and starting the VM.

Do you have any suggestion on how to allow the live migration of VMs currently running with qemu pre-patch and kernel 3.13?

Thank you in advance

James Page (james-page) on 2016-12-08
Changed in cloud-archive:
status: Fix Released → Invalid

Hello Antonio (@arcimboldo)

The fix only makes sense for newer QEMUs (>= Xenial, like the one from Mitaka Ubuntu Cloud Archive).

OBS:

The "migration check" is done in VHOST initialization functions when the devices are virtually attached to the virtual machine. If you are using kernel 3.13 and have apparmor enabled, then all the running instances have the "migration blocker" ON - because of this buggy migration check - and won't be able to live migration.

Unfortunately there is a "in-memory" linked list telling qemu that is has a blocker (with the reason). This blocker was added during instance startup and will be checked/used only when instance is live-migrated.

Check this: http://pastebin.ubuntu.com/23517175/

If you started the instance in a host not running apparmor (or not having libvirt profile loaded, for example) it won't block the creation of /tmp/memfd-XXX files during instance initialization. That won't trigger the "blocker flag" inside the running program and, if/when needed, the live migration will be able to occur.

This means that, after installing the new package, if you're using apparmor, yes, you would have to RESTART running instances that were affected by this bug in order to live migrating them. Sorry for the bad news! Even if you remove the apparmor rules, the migration blocker is already set.

Hacking your process virtual memory would jeopardize the contents of the virtual memory (could be catastrophic specially for a virtual machine).

@jamespage, @cpaelzer,

I'll verify this fix in couple of days so it can be released.

Thank you!

Rafael

Xenial Verification (with 3.13 kernel from Trusty since a <= 3.17 kernel is needed). This verifies that Ubuntu Cloud Archive repositories will be alright with this new packages (from Xenial / Yakkety).

## CURRENT

inaddy@(xkvm01):~$ apt-cache policy qemu-kvm
qemu-kvm:
  Installed: 1:2.5+dfsg-5ubuntu10.6
  Candidate: 1:2.5+dfsg-5ubuntu10.6

xkvm01 (sender):

Jan 11 01:07:54 xkvm01 kernel: type=1400 audit(1484104074.014:13): apparmor="DENIED" operation="mknod" profile="libvirt-7cdcb6c0-f85e-4639-912b-c785bd5992d9" name="/tmp/memfd-Jh5UhR" pid=2535 comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=112 ouid=112

$ sudo virsh migrate --live guest qemu+ssh://xkvm02/system
error: internal error: unable to execute QEMU command 'migrate': Migration disabled: failed to allocate shared memory

xkvm02 (receiver):

Jan 11 01:08:23 xkvm02 kernel: type=1400 audit(1484104103.888:53): apparmor="DENIED" operation="mknod" profile="libvirt-7cdcb6c0-f85e-4639-912b-c785bd5992d9" name="/tmp/memfd-fc9rij" pid=2000 comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=112 ouid=112

OBS: The check was being done in the wrong place AND situation, like I showed in this bug.

## PROPOSED

inaddy@(xkvm01):~$ apt-cache policy qemu-kvm
qemu-kvm:
  Installed: 1:2.5+dfsg-5ubuntu10.7
  Candidate: 1:2.5+dfsg-5ubuntu10.7

xkvm01 (sender):

<nothing related to /tmp/memfd>

xkvm02 (receiver):

inaddy@(xkvm02):~$ virsh list
 Id Name State
----------------------------------------------------
 1 guest running

<nothing related to /tmp/memfd>

Its all good.

verification-xenial-done

Yakkety Verification (with 3.13 kernel from Trusty since a <= 3.17 kernel is needed). This verifies that Ubuntu Cloud Archive repositories will be alright with this new packages (from Xenial / Yakkety).

## CURRENT

inaddy@(ykvm01):~$ apt-cache policy qemu-kvm
qemu-kvm:
  Installed: 1:2.6.1+dfsg-0ubuntu5.1
  Candidate: 1:2.6.1+dfsg-0ubuntu5.1

ykvm01 (sender):

Jan 11 11:34:35 ykvm01 kernel: type=1400 audit(1484141675.962:53): apparmor="DENIED" operation="mknod" profile="libvirt-7cdcb6c0-f85e-4639-912b-c785bd5992d9" name="/tmp/memfd-bF8new" pid=1934 comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=111 ouid=111

inaddy@(ykvm01):~$ sudo virsh migrate --live guest qemu+ssh://ykvm02/system
error: internal error: unable to execute QEMU command 'migrate': Migration disabled: failed to allocate shared memory

ykvm02 (receiver):

Jan 11 11:39:31 ykvm02 kernel: type=1400 audit(1484141971.526:53): apparmor="DENIED" operation="mknod" profile="libvirt-7cdcb6c0-f85e-4639-912b-c785bd5992d9" name="/tmp/memfd-JZ6L9T" pid=2177 comm="qemu-system-x86" requested_mask="c" denied_mask="c" fsuid=111 ouid=111

OBS: The check was being done in the wrong place AND situation, like I showed in this bug.

## PROPOSED

inaddy@(ykvm01):~$ apt-cache policy qemu-kvm
qemu-kvm:
  Installed: 1:2.6.1+dfsg-0ubuntu5.2
  Candidate: 1:2.6.1+dfsg-0ubuntu5.2

ykvm01 (sender):

<nothing related to /tmp/memfd>

ykvm02 (receiver):

inaddy@(ykvm02):~$ virsh list
 Id Name State
----------------------------------------------------
 1 guest running

<nothing related to /tmp/memfd>

Its all good.

verification-yakkety-done

tags: added: verification-done
removed: verification-needed
Thomas Huth (th-huth) wrote :

Commit 0d34fbabc13 has been released with QEMU v2.8

Changed in qemu:
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.6.1+dfsg-0ubuntu5.2

---------------
qemu (1:2.6.1+dfsg-0ubuntu5.2) yakkety; urgency=medium

  [ Rafael David Tinoco ]
  * Fixed wrong migration blocker when vhost is used (LP: #1626972)
    - d/p/vhost_migration-blocker-only-if-shared-log-is-used.patch

 -- Christian Ehrhardt <email address hidden> Tue, 22 Nov 2016 13:45:46 +0100

Changed in qemu (Ubuntu Yakkety):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Ping - we have the next fix for Xenial in the queue - all others are released now, has this one "baked" enough for Xenial SRU to migrate?

For me we had enough tests already. Upstream development/tests, Zesty, Yakkety. Christian, could you please move Xenial for me ? I have some end users waiting for this. Thank you very much.

On Tue, Jan 24, 2017 at 1:52 AM, Rafael David Tinoco <
<email address hidden>> wrote:

> Christian, could you please move Xenial for me ? I have some
> end users waiting for this. Thank you very much.
>

I can't - IIRC that is up to the SRU Team, I pinged the #ubuntu-release
channel if one could take a look.
You could do so again today if you want.

Thanks Christian! Will do!!

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.7

---------------
qemu (1:2.5+dfsg-5ubuntu10.7) xenial; urgency=medium

  [ Rafael David Tinoco ]
  * Fixed wrong migration blocker when vhost is used (LP: #1626972)
    - d/p/vhost_migration-blocker-only-if-shared-log-is-used.patch

 -- Christian Ehrhardt <email address hidden> Tue, 22 Nov 2016 13:45:39 +0100

Changed in qemu (Ubuntu Xenial):
status: Fix Committed → Fix Released

For Mitaka, this bug will be included in UCA together with the fix for:

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1656480

When it becomes available.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers