Unable to start VMs under a lxc container

Bug #1486296 reported by Adam Stokes
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Invalid
High
Unassigned
Vivid
Fix Released
High
Unassigned
Wily
Invalid
High
Unassigned

Bug Description

====================================================
1. Impact: cannot start vms inside a container
2. Devel solution: drop the cgmanager patches
3. Stable solution: same as devel solution
4. Test case: see comment #7 below for a very detailed and easy to use example
5. Regression potential: There should be no regressions, and the lp:qa-regression-tests passed.
====================================================

Our installer attempts to load VM's within an lxc container. This worked on trusty, however, with the latest releases from vivid and wily those VM's are unable to start. This affects running juju-local within a container as well.

We think it has something to do with cgmanager/cgproxy, however, we can manually run and launch qemu just fine within the container. This could potentially be an issue with libvirt attempt to set cgroup values for the qemu processes.

The current workaround to get a VM to start within a vivid and wily container running LXC 1.1.2+ is to disable cgroups in /etc/libvirt/qemu.conf

I'm starting this bug and will ask Ryan Harper to provide a more in-depth result of his debugging, but, wanted to have this on the radar.

Specs:

Host:
Vivid, kernel: 3.19.0-15-generic, lxc: 1.1.3+stable~20150815-0227-0ubuntu1~vivid

Container:
Vivid, kernel: 3.19.0-15-generic, lxc: 1.1.2-0ubuntu3.1, libvirt 1.2.12-0ubuntu14.1

tags: added: cloud-installer
Revision history for this message
Adam Stokes (adam-stokes) wrote :
Download full text (7.9 KiB)

Here is the bit of code we use to create a container with, it shells out to lxc-create passing in a custom userdata file:

```
    @classmethod
    def create(cls, name, userdata):
        """ creates a container from ubuntu-cloud template
        """
        # NOTE: the -F template arg is a workaround. it flushes the lxc
        # ubuntu template's image cache and forces a re-download. It
        # should be removed after https://github.com/lxc/lxc/issues/381 is
        # resolved.
        flushflag = "-F"
        if os.getenv("USE_LXC_IMAGE_CACHE"):
            log.debug("USE_LXC_IMAGE_CACHE set, so not flushing in lxc-create")
            flushflag = ""
        out = utils.get_command_output(
            'sudo -E lxc-create -t ubuntu-cloud '
            '-n {name} -- {flushflag} '
            '-u {userdatafilename}'.format(name=name,
                                           flushflag=flushflag,
                                           userdatafilename=userdata))
        if out['status'] > 0:
            raise Exception("Unable to create container: "
                            "{0}".format(out['output']))
        return out['status']
```

We also setup a custom lxc config file:

```
    def create_container_and_wait(self):
        """ Creates container and waits for cloud-init to finish
        """
        self.tasker.start_task("Creating Container",
                               self.read_container_status)

        Container.create(self.container_name, self.userdata)

        with open(os.path.join(self.container_abspath, 'fstab'), 'w') as f:
            f.write("{0} {1} none bind,create=dir\n".format(
                self.config.cfg_path,
                'home/ubuntu/.cloud-install'))
            f.write("/var/cache/lxc var/cache/lxc none bind,create=dir\n")
            # Detect additional charm plugins and make available to the
            # container.
            charm_plugin_dir = self.config.getopt('charm_plugin_dir')
            if charm_plugin_dir \
               and self.config.cfg_path not in charm_plugin_dir:
                plug_dir = os.path.abspath(
                    self.config.getopt('charm_plugin_dir'))
                plug_base = os.path.basename(plug_dir)
                f.write("{d} home/ubuntu/{m} "
                        "none bind,create=dir\n".format(d=plug_dir,
                                                        m=plug_base))

            extra_mounts = os.getenv("EXTRA_BIND_DIRS", None)
            if extra_mounts:
                for d in extra_mounts.split(','):
                    mountpoint = os.path.basename(d)
                    f.write("{d} home/ubuntu/{m} "
                            "none bind,create=dir\n".format(d=d,
                                                            m=mountpoint))

        # update container config
        with open(os.path.join(self.container_abspath, 'config'), 'a') as f:
            f.write("lxc.mount.auto = cgroup:mixed\n"
                    "lxc.start.auto = 1\n"
                    "lxc.start.delay = 5\n"
                    "lxc.mount = {}/fstab\n".format(self.container_abspath))
```

Here is the userdata.yaml file we output and pass into the container:
...

Read more...

Changed in libvirt (Ubuntu Vivid):
importance: Undecided → High
Changed in libvirt (Ubuntu Wily):
importance: Undecided → High
Revision history for this message
Ryan Harper (raharper) wrote :

I can recreate the issue with a Vivid VM and Vivid container running inside using the supplied user-data and container config.

Changed in libvirt (Ubuntu Vivid):
status: New → Confirmed
Revision history for this message
Ryan Harper (raharper) wrote :

Wily VM with wily container does not recreate this issue.

Changed in libvirt (Ubuntu Wily):
status: New → Invalid
Revision history for this message
Ryan Harper (raharper) wrote :

It appears to be unrelated to cgmanager. If I upgrade vivid container to libvirt 1.2.16 from wily, we can now launch the VMs.

Next need to bisect some changes between vivid libvirt 1.2.12 and wily 1.2.16 to see what, if any, cgroup changes were made that affected VM launching.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1486296] Re: Unable to start VMs under a lxc container

Quoting Ryan Harper (<email address hidden>):
> It appears to be unrelated to cgmanager. If I upgrade vivid container
> to libvirt 1.2.16 from wily, we can now launch the VMs.

However that's because libvirt in wily no longer uses cgmanager. So the
vivid issue may still be due to cgmanager.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

1. when you say vivid vm and vivid container, do you mean a vivid vm launched
   inside a vivid container, regardless of the host? Or a vm - regardless of
   guest os - launched inside a vivid container running inside a vivid vm?

2 Can you set cgmanager in the parent host/vm of the container to run with
   --debug, then show the 'journalctl -u cgmanager' output?

3 perhaps also set log_level=1 in /etc/libvirt/libvirtd.conf, restart libvirt,
  and show the journalctl -u libvirt-bin output.

Revision history for this message
Ryan Harper (raharper) wrote :

On Wed, Aug 19, 2015 at 2:33 PM, Serge Hallyn <email address hidden>
wrote:

> 1. when you say vivid vm and vivid container, do you mean a vivid vm
> launched
> inside a vivid container, regardless of the host? Or a vm - regardless
> of
> guest os - launched inside a vivid container running inside a vivid vm?
>

Wily host, launching Vivid cloud image, running a Vivid ubuntu-cloud
template inside the Vivid VM
Wily host, launching Wily cloud image, running a Wily ubuntu-cloud template
inside the Wily VM

uvt-simplestreams-libvirt --verbose sync release="~(vivid|wily)" arch=amd64
uvt-kvm create --developer --memory 1024 --cpu 2 --disk 10 --unsafe-caching
v1 release=vivid arch=amd64
uvt-kvm create --developer --memory 1024 --cpu 2 --disk 10 --unsafe-caching
w1 release=wily arch=amd64
get userdata.txt file from: http://paste.ubuntu.com/12130368/plain/

In each VM:
  scp userdate.txt $VM:
  sudo apt-get install libvirt-bin lxc
  sudo lxc-create -t ubuntu-cloud -n $release-v1 -- -u userdata.txt
  sudo lxc-start -n $release-v1
  sudo lxc-attach -n $release-v1
    sudo apt-get -y install libvirt-bin qemu-system-x86 uvtool
ubuntu-cloud-keyring
    exit
  sudo lxc-ls --fancy # look for container ip
  ssh ubuntu@10.0.3.XXX
    uvt-simplestreams-libvirt sync release=trusty arch=amd64
    uvt-kvm create --disk 5 testme release=trusty arch=amd64

Vivid on Vivid, this will fail with uvt-kvm saying unknown libvirt error
and an error in /var/log/libvirt/qemu/testme.log containing
 the handshake fail and error on input/output message.

On Wily, this passes.

> 2 Can you set cgmanager in the parent host/vm of the container to run with
> --debug, then show the 'journalctl -u cgmanager' output?
>

Yes, I'll collect and update. I've also tested just upgrading cgmanager in
VM/host only, and VM/host and container to wily levels (0.37)
but that did not fix the issue in Vivid.

>
> 3 perhaps also set log_level=1 in /etc/libvirt/libvirtd.conf, restart
> libvirt,
> and show the journalctl -u libvirt-bin output.
>

We have this, but unfortunately to doesn't have anything useful w.r.t which
calls in libvirt are failing.
At best, we have syslog error when libvirtd attempts to write to cgroups on
behalf of the VM
and we get:

 Aug 18 20:18:59 openstack-single-ubuntu libvirtd[11203]: cgmanager:
cgm_set for controller=devices,
cgroup_path=machine/ubuntu-local-machine-1.libvirt-qemu failed: invalid
request

I'll collect this as well incase I missed something important.

Ryan

Revision history for this message
Ryan Harper (raharper) wrote :

Serge,

Here's the complete libvirtd log during the run with full debug:

http://paste.ubuntu.com/12136138/

it's mostly noise, but I do see this:

% grep cgroup libvirtd-cgm-log.txt
Aug 20 15:20:15 vivid-v1 libvirtd[13114]: cgmanager: cgm_set for controller=devices, cgroup_path=machine/testme.libvirt-qemu failed: invalid request

Which I think coincides with the cgm error on the host:

http://paste.ubuntu.com/12136201/

Aug 20 15:12:33 v1 cgmanager[19873]: cgmanager: Failed to chown procs file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/cgroup.procs/cgroup.procs: Not a directory
Aug 20 15:12:33 v1 cgmanager[19873]: cgmanager: Failed to chown tasks file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/cgroup.procs/tasks: Not a directory
Aug 20 15:12:33 v1 cgmanager[19873]: cgmanager: Failed to chown procs file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/tasks/cgroup.procs: Not a directory
Aug 20 15:12:33 v1 cgmanager[19873]: cgmanager: Failed to chown tasks file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/tasks/tasks: Not a directory
Aug 20 15:13:00 v1 cgmanager[19873]: cgmanager: Failed to chown procs file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/cgroup.procs/cgroup.procs: Not a directory
Aug 20 15:13:00 v1 cgmanager[19873]: cgmanager: Failed to chown tasks file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/cgroup.procs/tasks: Not a directory
Aug 20 15:13:00 v1 cgmanager[19873]: cgmanager: Failed to chown procs file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/tasks/cgroup.procs: Not a directory
Aug 20 15:13:00 v1 cgmanager[19873]: cgmanager: Failed to chown tasks file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/tasks/tasks: Not a directory

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

> Aug 20 15:12:33 v1 cgmanager[19873]: cgmanager: Failed to chown procs file /run/cgmanager/fs/none,name=systemd/lxc/vivid-v1/user.slice/user-1000.slice/user@1000.service/cgroup.procs/cgroup.procs: Not a directory

Egads. I'll try to take a look at this after lunch.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I think the key is

Created /run/cgmanager/fs/blkiocgmanager:set_value_main: target cgroup is not below r (20360)'s

This should have been fixed by

cgmanager: make exception for proxys placed in systemd unit

Digging...

Revision history for this message
Ryan Harper (raharper) wrote :

On Fri, Aug 21, 2015 at 12:51 PM, Serge Hallyn <email address hidden>
wrote:

> I think the key is
>
> Created /run/cgmanager/fs/blkiocgmanager:set_value_main: target cgroup
> is not below r (20360)'s
>
> This should have been fixed by
>
> cgmanager: make exception for proxys placed in systemd unit
>
> Digging...
>

Right, I saw that commit and though that *should* have fixed it. I've run
0.37-1 on
the vivid system and container but it did not affect the outcome.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ok, so this is a cop-out, but simply dropping the 3 cgmanager patches fixes this for me.

I think I'm going to recommend doing so in the package. This is the same thing we did in wily. We needed the patches before in order to be able to run libvirt in containers, but now that we have lxcfs by defalut, including in vivid, they are not needed.

description: updated
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Adam, or anyone else affected,

Accepted libvirt into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.2.12-0ubuntu14.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Vivid):
status: Confirmed → Fix Committed
tags: added: verification-needed
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote : [libvirt/vivid] possible regression found

As a part of the Stable Release Updates quality process a search for Launchpad bug reports using the version of libvirt from vivid-proposed was performed and bug 1489751 was found. Please investigate this bug report to ensure that a regression will not be created by this SRU. In the event that this is not a regression remove the "verification-failed" tag from this bug report and tag 1489751 "bot-stop-nagging". Thanks!

tags: added: verification-failed
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I can't reproduce that error with this package (i.e. using vivid-proposed).

But waiting for the cause of that bug to become evident before removing the 'verification-falied' flag.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Failed to reproduce the upgrade failure, so removed the verification-failed tag.

tags: removed: verification-failed
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

As a part of the Stable Release Updates quality process a search for Launchpad bug reports using the version of libvirt from vivid-proposed was performed and bug 1489751 was found. Please investigate this bug report to ensure that a regression will not be created by this SRU. In the event that this is not a regression remove the "verification-failed" tag from this bug report and tag 1489751 "bot-stop-nagging". Thanks!

tags: added: verification-failed
tags: removed: verification-failed
Revision history for this message
Adam Stokes (adam-stokes) wrote :

I was able to start up VM's within the container without any issues now.

Thanks!

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.2.12-0ubuntu14.2

---------------
libvirt (1.2.12-0ubuntu14.2) vivid; urgency=medium

  * Drop the cgmanager patches - they break under systemd, and are no longer
    needed in containers due to lxcfs. (LP: #1486296)

 -- Serge Hallyn <email address hidden> Fri, 21 Aug 2015 15:16:17 -0700

Changed in libvirt (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.