vm startup broken when interface definition has script tag

Bug #1620407 reported by Matthias Ferdinand on 2016-09-05
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Medium
Unassigned
Xenial
Medium
Unassigned

Bug Description

Regression restricted to set script which is rare as it is a massive security drop

[Impact]

 * A user using a type "ethernet" device with a custom script as working
   in Trusty runs into an error on Xenial.

 * Essentially Xenials libvirt "only" stumbles over a check&add on an
   index for that network device. But since it might be externally created
   later (up until on qemu start) it might not be available at the time it
   is checking. So the fix skips the check index in those cases.

 * The fix as-is is not upstream because upstream opted for a fare bigger
   rework of the section in which after 9c17d665fdc5f "autocreate tap
   device for ethernet network type" libvirt calls the script. That also
   allows different security levels (permission of libvirt instead of
   qemu) but these changes are huge and not sufficient for an SRU fix.

[Test Case]

 * Create a guest your usual way (e.g. uvtool-libvirt
 * Add a script based type network ethernet device (there is no need to
   create a script, one can use the default of qemu). The xml snippet
   looks like this:
    <interface type='ethernet'>
      <mac address='52:54:00:18:0d:a3'/>
      <script path='/etc/qemu-ifup'/>
      <target dev='newdevname'/>
      <model type='virtio'/>
    </interface>

 * without the fix this runs into:
   error: Failed to start domain <guestname>
   error: Unable to get index for interface <devicename>: No such device

[Regression Potential]

 * There could be a cornercase around nicindexes which due to the skip in
   case net->skript is set that was not covered in our experiments now
   failing. But since as of today setting the script tag in generally
   fails those people should not exist.

 * Another fact that limits the potential regression is that type ethernet
   devices come at a huge security disadvantage - it needs to run
   privileged as root and also disable certain security features (not
   clearing capabilities for example).
   That said the number of users still using that feature should be low
   and shrinking further.

[Other Info]

 * n/a

----- original report -----

Ubuntu 16.04.1 LTS (amd64)
libvirt-bin 1.3.1-1ubuntu10.1

We use external scripts to setup tap interfaces, e.g.

    <interface type='ethernet'>
      <mac address='52:54:00:18:0d:a3'/>
      <script path='/etc/libvirt/14v/mf_testet.sh'/>
      <target dev='mf_testet'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

Starting the VM throws an error message ("interface not found" or something like that).

IIUC, the script invocation is done by qemu, so the interface
may not yet exist when libvirt is constructing the qemu cmd args.
Checking for that interface in advance therefore is a bug.

Attached patch skips the check if a <script> parameter is provided.

Regards
Matthias Ferdinand

The attachment "interface-type-ethernet-with-script.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Joshua Powers (powersj) on 2016-09-06
Changed in libvirt (Ubuntu):
importance: Undecided → Medium
status: New → Triaged

Hi,
sorry for the Delay, I got subscribed by a Friend yesterday to look into it but never saw the bug before.

The spec of all of this is at https://libvirt.org/formatdomain.html#elementsNICSEthernet and the default handling script for that usually is /etc/qemu-ifup

I tested on Xenial (libvirt 1.3) as well as on Zesty (libvirt 2.5) by setting the path explicitly to the default script.
It then looks like:

    <interface type='ethernet'>
      <mac address='52:54:00:18:0d:a3'/>
      <script path='/etc/qemu-ifup'/>
      <target dev='mf_testet'/>
      <model type='virtio'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </interface>

That just worked fine on libvirt 2.5 for me.
The guest device shows up as ens7 just as it should (pic slot 7) and no error when starting the guest.

On libvirt 1.3 I got:
error: Failed to start domain testguest-X-on-X
error: Unable to get index for interface mf_testet: No such device

I think you only specify that if that is for an existing device your script shall use, you can drop the "target" statement and get further.

Now I see:
error: internal error: process exited while connecting to monitor: 2017-01-12T13:14:56.978535Z qemu-system-x86_64: -netdev tap,script=/etc/qemu-ifup,id=hostnet1,vhost=on,vhostfd=29: could not open /dev/net/tun: Operation not permitted

Here it rang a bell - IIRC That is kind of ok.
If I remember correctly there was quite a lot of default security that you had to drop to get ethernet type networking.
There is some help of this being known at http://wiki.libvirt.org/page/Troubleshooting under "Guest_won't_start_-_warning:_could_not_open_/dev/net/tun"
As well as (other distro but same applies) https://fedoraproject.org/wiki/How_to_debug_Virtualization_problems?rd=Tools/Virtualization/BugReporting#Errors_using_.3Cinterface_type.3D.27ethernet.27.2F.3E

I'll set the bug to incomplete for now until you provide more info.
ATM I'm not so sure this is a real bug or more a combination of default security features blocking you from this less recommended way.
Also did this work on a former version?

Changed in libvirt (Ubuntu):
status: Triaged → Incomplete

On Thu, Jan 12, 2017 at 01:25:28PM -0000, ChristianEhrhardt wrote:
> On libvirt 1.3 I got:
> error: Failed to start domain testguest-X-on-X
> error: Unable to get index for interface mf_testet: No such device

on xenial, that is the exact error message we get, while on "precise"
kvm hosts (libvirt 0.9.8) the same setup works.

> As well as (other distro but same applies) https://fedoraproject.org/wiki/How_to_debug_Virtualization_problems?rd=Tools/Virtualization/BugReporting#Errors_using_.3Cinterface_type.3D.27ethernet.27.2F.3E
Yes, "ethernet" type networking is made difficult by standard security
settings, and we do have our own infrastructure installed to overcome
these restrictions. One such measure is having a daemon running as root
that does the actual interface creation, triggered by the interface
setup script.

This error message "Unable to get index..." is generated by libvirt,
before qemu is even invoked. While it is perfectly ok for interfaces
with setup scripts to not yet exist before qemu can invoke the script,
libvirt 1.3 nonetheless pre-checks for its existence.

This is what the bug report and the patch is about. With the patch
applied (removal of existence check for scripted interfaces), libvirt
1.3 again does the right thing.

Regards
Matthias

On Thu, Jan 12, 2017 at 5:55 PM, Matthias Ferdinand <mf+ubuntu1@14v.de>
wrote:

> This error message "Unable to get index..." is generated by libvirt,
> before qemu is even invoked. While it is perfectly ok for interfaces
> with setup scripts to not yet exist before qemu can invoke the script,
> libvirt 1.3 nonetheless pre-checks for its existence.
>
> This is what the bug report and the patch is about. With the patch
> applied (removal of existence check for scripted interfaces), libvirt
> 1.3 again does the right thing.
>

Thank you so much for the explanation, that confirmed all that I saw.
Yet I wonder since newer libvirt works (see my experiments before) they
seem to have fixed that "a different way".
I'd pretty much prefer to fix it by backporting that if applicable.

Do you, based on your work on this, happen to know what the (maybe more
complex) upstream fix for this was?

Changed in libvirt (Ubuntu):
status: Incomplete → Confirmed

Info is here, we just have to sort out "what patch" to consider applying - thereby setting confirmed.

Matthias Ferdinand (mf+ubuntu1) wrote :

On Fri, Jan 13, 2017 at 04:40:55AM -0000, ChristianEhrhardt wrote:
> Yet I wonder since newer libvirt works (see my experiments before) they
> seem to have fixed that "a different way".
> I'd pretty much prefer to fix it by backporting that if applicable.
>
> Do you, based on your work on this, happen to know what the (maybe more
> complex) upstream fix for this was?

Sorry, but I don't know. Haven't looked beyond 16.04 so far.

Matthias

Ok, thanks.

While I like your fix being small and simple (always good), I'm afraid of potential side effects for an SRU e.g. systems that crash in a more severe way because they then pass that check.
That is why I'd like to know what the upstream fix was.

What we need is somebody who-just-knows what the change was or to do a bisect.

I'll put it on my backlog, but it is a bit down the list for now.
If anybody else is hit by that please chime in here so we can adjust severity and priorities.

Marking Triaged and for Xenial SRU and flagging DEV as resolved as shown in my Test,

tags: added: needs-bisect
Changed in libvirt (Ubuntu):
status: Confirmed → Triaged
Changed in libvirt (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → Medium
Changed in libvirt (Ubuntu):
status: Triaged → Fix Released

I set up an automatic bisecting env and was able to run locally in a way that I can reproduce from the local dir it built. That is still running, so I was moving to a more beefy Server now and feeding some time and electric power into it to see if it can find the upstream fix that resolved it.

By parsing through changes my gut feeling expects:
9c17d665fdc5f "autocreate tap device for ethernet network type"

I'll update what it hopefully found tomorrow.

On Thu, Feb 02, 2017 at 05:36:33PM -0000, ChristianEhrhardt wrote:
> By parsing through changes my gut feeling expects:
> 9c17d665fdc5f "autocreate tap device for ethernet network type"

Sounds very interesting, might even remove the need for the separate
daemon we run for interface creation.

Matthias

Yep, the bisect hit the change that I assumed 9c17d665fdc5f "autocreate tap device for ethernet network type".

The upstream "fix" now known and understood I agree way more with your fix now. To backport the upstream version is much more of a feature and large scale impact than the fix you provided.

I'm planning to include that in the next Xenial SRU I'm doing for libvirt after I had the chance to give it some extra tests to confirm there is no unexpected side-effect.

Changed in libvirt (Ubuntu Xenial):
status: Triaged → In Progress
description: updated

On ppa identical to SRU upload:
- Automated Tests passed (to check for regressions)
- Manual Test passed (to check for regressions)
- Explicit tests on this case worked

Uploaded to SRU queue now for SRU Team to check.

Hello Matthias, or anyone else affected,

Accepted libvirt into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.3.1-1ubuntu10.8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed

I could verify that I pass the device checks now, yet I'd ask the reporter to verify as well if possible.

The reason is that the type ethernet setup required a lot of permission tweaks as well that are in place at the reporters environment. So I'd really like to hear from them as well if possible.

@mf+ubuntu1 - could you test proposed for us in your setup?

I tested a few more which all worked, while I'd have hoped for a reporters check as well it is effectively 100% his suggested change which he already announced as fixing his case.
By that setting verification-done now.

tags: added: verification-done
removed: verification-needed

On Tue, Feb 14, 2017 at 07:52:09AM -0000, ChristianEhrhardt wrote:
> @mf+ubuntu1 - could you test proposed for us in your setup?

Sorry for the delay, I was held back by $VIRUS (the bio variant).

Tested on our machine, and it solves the problem.

Thank you very much for working on this.

Matthias

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.8

---------------
libvirt (1.3.1-1ubuntu10.8) xenial; urgency=medium

  * fix virsh nodecpumap output (LP: #1659769)
  * fix using type ethernet interfaces with user scripts (LP: #1620407)
  * add new block device types to virt-aa-helpers profile (LP: #1641618)

 -- Christian Ehrhardt <email address hidden> Mon, 06 Feb 2017 14:30:46 +0100

Changed in libvirt (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers