pre-installed lxc in cloud image produces broken lxc (and later lxd) containers

Bug #1509414 reported by Mike McCracken
44
This bug affects 5 people
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
Fix Released
Critical
Stéphane Graber

Bug Description

[Problem]
The released wily image preinstalls lxc, which breaks the assumption that lxc's preinst packaging script makes:

It inspects the network to try to pick a 10.0.N.0 network that isn't being used, with N starting at 3, so this appears to have picked 10.0.3.0 when it was installed on whatever system was generating the image.

When a container is started, it will dhcp on eth0 and get 10.0.3.X as expected. The problem comes when the lxc-net service that is already installed in that container starts and configures *its* lxcbr0 with 10.0.3.X. The networking inside the container is broken at that point.

This affects LXC containers, and should affect LXD containers but doesn't currently, as the metadata used for lxd images is still pointing to a beta2 release (bug 1509390).

The easiest way to reproduce this is to use the ubuntu-cloud lxc template on a wily host.

[Test Case]

1.) Verify expectation for each image
   - -disk1.img cloud image, check for file
   - -root.tar.xz image (used by lxd) and check for file
   - -root.tar.gz image (used by lxc)

   For each of those images, verify:
   a.) A cloud image should not have /etc/default/lxc-net
   b.) lxd should be installed (dpkg-query --show | grep lxd)

2.) Start instance from updated image and start instance in lxc inside
   launch instance on openstack or kvm or other
   verify lxcbr0 bridge exists
   lxc-create -t ubuntu-cloud -n bugcheck -- --release=wily --stream=daily
   # wait until lxc-ls --fancy shows 'running'
   lxc-attach -n bugcheck wget http://ubuntu.com

3.) Start instance from updated image and start instance in lxd inside
   launch instance on openstack or kvm or other
   verify lxcbr0 bridge exists
   lxd import-images ubuntu wily
   lxc launch ubuntu
   # wait some amount
   lxc attach bugcheck wget http://ubuntu.com

[Regression Potentional]
The highest chance for fallout is a change in the /24 network that is chosen conflicting with some existing service.

[Other Info]
Default apt install of lxc has always picked some 10.0.X.0/24 network to use for its lxcbr0 bridge. That network (often 10.0.3.0/24) would then be unreachable from the host. The same behavior occurs with libvirt-bin and many other such services.

We are moving that logic to happen the first time that 'lxc-net' service starts.

This means first boot for a cloud instance rather than cloud-image build time.

[Work around]
To patch / fix an existing cloud image to make lxc and lxd guests start simply change the config of /etc/default/lxc-net to have something other than 10.0.3.0.

sudo sed -i '/^LXC.*10[.]0[.][0-9][.]/s/10.0.[0-9]./10.0.4./g' /etc/default/lxc-net &&
    sudo service lxc-net stop &&
    sudo service lxc-net start

=== End SRU Report ===

Related bugs:
 * bug 1510108: pre-installed lxc in cloud-image means loss of access to some 10.0.X.0/24

tags: added: cloud-installer
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in lxc (Ubuntu):
status: New → Confirmed
Revision history for this message
Daniel Westervelt (danwest) wrote :

Does not affect LXD only because the LXD metadata for simple-streams is out of date. We are going to hold off updating it until this bug is fixed and sru'ed.

Changed in lxc (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Debdiff which works for me.

I tested this by creating a cloud container, temporarily setting USE_LXC_BRIDGE=false, rebooting, building the package, setting USE_LXC_BRIDGE=true (leaving 10.0.3 as the lxcbr0 subnet), rebooting. lxcbr0 comes up with 10.0.4.1 as expected. A nested trusty container works fine.

The package is targeted at wily, should be at xenial presumably. That should be the only needed update.

description: updated
description: updated
Revision history for this message
Stéphane Graber (stgraber) wrote :

Not sure I like this approach. An init script should never change a system config, so this is a packaging policy violation...

To be fair, anything we come up with which picks a random/unused subnet will still break users who may have this subnet in use behind a router, so that's not really an option either.

For wily, I'd say we simply turn lxc-net off completely. That will add an extra step for any user who wants to use LXD, but it will also guarantee we don't regress anyone in the process.

Doing so would require the CPC team to update /etc/default/lxc-net, setting USE_LXC_BRIDGE to false.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I don't like disabling lxc-net, because it's simpler to tell a user to

apt-get install lxd

than to

systemctl enable lxc-net

or

echo "USE_LXC_BRIDGE=true" | sudo tee -a /etc/default/lxc-net
systemctl restart lxc-net

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Updated debdiff, which

1. stops creation of /etc/default/lxc-net on package install
2. removes that file only if upgrading from the 1.0.4ubuntu4 version with an umodified /etc/default/lxc-net file

Scott Moser (smoser)
description: updated
description: updated
description: updated
Scott Moser (smoser)
summary: - lxc postinst script checks available interfaces, can choose
+ pre-installed lxc in cloud image produces broken lxc (and later lxd)
+ containers
description: updated
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

new patch.

It upgrades a broken container fine, but lxc-net is not properly started until I manually call

/usr/lib/x86_64-linux-gnu/lxc/lxc-net stop
/usr/lib/x86_64-linux-gnu/lxc/lxc-net start

or reboot

Revision history for this message
Robert C Jennings (rcj) wrote :

Action plan:
Stage 1 - Configure lxc-net at boot rather than at install.
 * This addresses the network failure for 15.10 containers started on 15.10 hosts (patch above in comment #6)
Stage 2 - Start lxc-net through systemd on the first launch of an LXC container.
 * This mitigates the unroutable 10.0.x.0/16 network issue for general cloud image users. With this step we’re back to Trusty function.
 * This will work for privileged users, like Juju, without any interaction. Unprivileged users will be directed to start the service on the first container launch.

Next action:
 - serge-hallyn working on patch (last update in comment #7) and code in ppa:serge-hallyn/lxc-natty. Patch is not yet ready for upload and serge will update as he progresses.
 - Once ready, uploaded, and published in -proposed, email ring rcj/utlemming and we'll ensure cloud image builder picks this up ASAP to build -proposed images

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Final proposed patch for now. Uploaded to ppa:serge-hallyn/lxc-natty for wily.

Installing this on a fresh ubuntu-cloud wily container (i.e. a broken one) results in working lxcbr0 on new subnet.

Changed in lxc (Ubuntu):
assignee: nobody → Stéphane Graber (stgraber)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Handle one more corner case

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :
Revision history for this message
Stéphane Graber (stgraber) wrote : Please test proposed package

Hello Mike, or anyone else affected,

Accepted lxc into wily-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/lxc/1.1.4-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Revision history for this message
Robert C Jennings (rcj) wrote :

The cloud-image builder picked up the change and is building images. They are with the LP buildds now. I will update this bug once publication completes.

Revision history for this message
Robert C Jennings (rcj) wrote :

Cloud image downloads for amd64, i386, and ppc64el are available @ http://cloud-images.ubuntu.com/proposed/wily/20151024/

The amd64 image is also available in canonistack lcy02 region as lp1509414/wily-proposed

Revision history for this message
Robert C Jennings (rcj) wrote :

Cloud images build from proposed are available.

Next action:
 - Verification of proposed package.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

New image works for me in lxc:

lxcbr0 Link encap:Ethernet HWaddr 76:79:3e:90:1c:88
    inet addr:10.0.4.1 Bcast:0.0.0.0 Mask:255.255.255.0

Revision history for this message
Robert C Jennings (rcj) wrote :

Be aware in your testing that the lxd-net's service can come up slowly depending on the speed of your cloud instance. Without the bridge (lxcbr0) the container's networking will function prior to that service starting; watch out for this false positive in your testing.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Kicked off instances for testing with Juju. Will update with results once my testing is complete.

Revision history for this message
Tycho Andersen (tycho-s) wrote :

The lxc in wily-proposed also works for me: inet addr:10.0.4.1

Thanks all.

Revision history for this message
Jon Grimm (jgrimm) wrote :

Tested wily-proposed cloud-image running in local kvm. Started wily-proposed container via lxc (using ubuntu-cloud template), verified the container's lxcbr0 looks fine (10.0.4.1), verified networking works (via www.google.com).

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

I was able to For stage two, at least with systemd, I changed /lib/systemd/system/lxd-startup.service to:

[Unit]
Description=Container hypervisor based on LXC - boot time check
After=cgmanager.service lxd-unix.socket
Requires=cgmanager.service lxd-unix.socket

[Service]
Type=oneshot
ExecStart=/usr/bin/lxd activateifneeded
TimeoutSec=30s

[Install]
WantedBy=multi-user.target

while removing the files

    /etc/systemd/system/multi-user.target.wants/lxc-net.service
    /etc/systemd/system/multi-user.target.wants/lxc.service

(i.e. in packaging we would remove

[Install]
WantedBy=multi-user.target

from /lib/systemd/system/lxc{,-net}.service)

Now when the system comes up, lxcbr0 is not there. When I do 'lxc list', it comes up.

A user who wants to use non-lxd lxc can do

sudo systemctl add-wants multi-user.target lxc.service

to make lxc and lxc-net start back up at boot.

The lxd-startup.service also still works, since it works by activating lxd if the database shows containers need to be started.

Revision history for this message
Adam Stokes (adam-stokes) wrote :

I've also tested manually and via our OpenStack installer and lxcbr0 is assigned (10.0.4.1) and the network is able to reach out to the internet as before.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

This lxc debdiff (not appropriate upstream lxc) and a pull request against lxd-pkg-ubuntu (https://github.com/lxc/lxd-pkg-ubuntu/pull/7) combined should implement stage 2 of the fix.

Note I've tested these when separately implemented by hand, but have not built packages with this debdiff+pull-request.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Verified that Juju can start lxc containers with the proposed changes. The containers came up and were able to communicate with the state server, even when on a separate machine.

Revision history for this message
Robert C Jennings (rcj) wrote :

My testing with the cloud image containing the proposed package has been successful.

Just a note again that the test case detailed in the description is fine with the understanding that network testing needs to ensure the lxc-net service has started lxcbr0 or a false positive is possible (per comment #18).

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Serge - your changes from comment #22 will break juju with lxc. juju will need to be modified to call systemctl add-wants multi-user.target lxc.service

Revision history for this message
Scott Kitterman (kitterman) wrote : Re: [Bug 1509414] Re: pre-installed lxc in cloud image produces broken lxc (and later lxd) containers

On Sunday, October 25, 2015 03:12:44 AM you wrote:
> Serge - your changes from comment #22 will break juju with lxc. juju
> will need to be modified to call systemctl add-wants multi-user.target
> lxc.service

If that's the case, this approach for a fix probably isn't appropriate for an
SRU.

Revision history for this message
Stéphane Graber (stgraber) wrote :

I agree, the stage 2 fix for this issue concerns me with regard to regressing current use cases.

As much as I'd like to get rid of the rest of this issue (any user of 10.0.4.0/24 behind a router looses connectivity to that subnet), we must make sure we do not regress anyone who's been relying on "apt-get install lxc" providing something that can immediately be used both by root and for unprivileged users.

Serge: We may be able to provide a hook, added to /usr/share/lxc/config/common.conf.d which will bring the bridge up automatically at first lxc container start. Such a hook would unfortunately need to be setuid so that it also works for unprivileged users. We'd also need to make sure that the current lxc hooks are sufficient from a timing point of view to do so (run before lxc checks whether the requested bridge exists).

Revision history for this message
Robert C Jennings (rcj) wrote :

Séphane,

When this was added to the cloud seed we changed this from "users that install LXC will loose connectivity to a 10.0.x.0/24 network" to "all cloud users do not have connectivity to a 10.0.x.0/24 network at boot" and the cause/effect will not be as clear to an end user. This had come up in conversation but had not been documented in the bug. So let us continue to search for a solution like what you have outlined in the last paragraph of comment #29.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Stéphane Graber (<email address hidden>):
> I agree, the stage 2 fix for this issue concerns me with regard to
> regressing current use cases.
>
> As much as I'd like to get rid of the rest of this issue (any user of
> 10.0.4.0/24 behind a router looses connectivity to that subnet), we must
> make sure we do not regress anyone who's been relying on "apt-get
> install lxc" providing something that can immediately be used both by
> root and for unprivileged users.
>
> Serge: We may be able to provide a hook, added to
> /usr/share/lxc/config/common.conf.d which will bring the bridge up
> automatically at first lxc container start. Such a hook would
> unfortunately need to be setuid so that it also works for unprivileged
> users. We'd also need to make sure that the current lxc hooks are
> sufficient from a timing point of view to do so (run before lxc checks
> whether the requested bridge exists).

smoser and I had considered creating a new lxc-base (I'm making that
name up) package which is the current lxc without the multi-user.target
wants symlink for lxc, and making lxd depend on that package. Regular
lxc then would add the multiuser.target wants symlink for lxc.

Juju would not regress, regular cloud users would not have lxcbr0 until
they used lxd.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Such shuffling around as an SRU seems pretty risky to me. Having the main lxc package be essentially empty except for the systemd postinst also feels weird.

This would also further complicate things when I then break lxc into lxc and lxc-common this cycle (lxc-common will include the apparmor profiles and the binaries that are used by liblxc1).

Revision history for this message
Robert C Jennings (rcj) wrote :

I agree that this shuffling around is not pretty, but we need a solution that makes 10.0.0.0/16 routable in cloud images where lxc/lxd are not in use as had prior to http://bazaar.launchpad.net/~ubuntu-core-dev/ubuntu-seeds/ubuntu.wily/revision/2360

The current situation conflicts with how clouds are instructing users to set up private networks. [1] [2]

[1] http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Subnets.html
[2] https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-reserved-private-ip/

Revision history for this message
Stéphane Graber (stgraber) wrote :

A pre-start lxc hook with sufficient privileges to start lxc-net would cover all use cases as far as I can tell and would only require the addition of two files to the lxc package.

Such a hook would also cover LXD as LXD does exec all LXC hooks, so we wouldn't even have to mess with those init scripts at all. Just ship such a hook and have the cloud-images ship with the lxc-net job disabled. At container startup time, the hook fires and if the job isn't running, it gets started.

Revision history for this message
Robie Basak (racb) wrote :

> Doing so would require the CPC team to update /etc/default/lxc-net, setting USE_LXC_BRIDGE to false.

Note that this would cause a conffile prompt for all users using cloud images who dist-upgrade to pick up the latest updates after another lxc SRU, which breaks people doing automatic deployments (even though they could override, many don't). I think you're not planning to do this now anyway? Anyway, I think it's important for everyone to understand that altering conffiles as part of cloud image builds causes future problems and so need to be avoided - especially for default cases.

Scott Moser (smoser)
description: updated
Revision history for this message
Scott Moser (smoser) wrote :
tags: added: verification-done
removed: verification-needed
description: updated
Revision history for this message
Scott Moser (smoser) wrote :

I've opened bug 1510108 to address 'Stage 2' of this fix.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 1.1.4-0ubuntu1.1

---------------
lxc (1.1.4-0ubuntu1.1) wily-proposed; urgency=medium

  * lxc-net init script: update to select the default lxc bridge network
    at first service start time rather than install time. (LP: #1509414)
  * lxc-net init script: also move cleanup() definition as it was undefined
    when called after failed dnsmasq.
  * lxc.preinst:
    - remove code for writing /etc/default/lxc-net (moved to lxc-net service)
    - add code removing just the known-potentially-bad /etc/default/lxc-net
    - if user had deleted /etc/default/lxc-net (intending to disable lxcbr0),
       honor that by creating one which says not to use lxcbr0.

 -- Serge Hallyn <email address hidden> Fri, 23 Oct 2015 19:29:23 -0500

Changed in lxc (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Stéphane Graber (stgraber) wrote : Update Released

The verification of the Stable Release Update for lxc has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.