juju 2.1-beta5 - juju 2.1rc2 - localhost failing to allocate a nested container with an ip

Bug #1664409 reported by Adam Stokes
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.1
Fix Released
Critical
John A Meinel

Bug Description

To reproduce:

sudo snap install conjure-up --classic --edge
conjure-up kubernetes-core localhost

My status output and machine-0 log from a localhost deployment:

http://paste.ubuntu.com/23991798/
http://paste.ubuntu.com/23991794/

On the host system my lxc list output:

+---------------+---------+--------------------------------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------+---------+--------------------------------+------+------------+-----------+
| juju-677a75-0 | RUNNING | 10.0.8.28 (eth0) | | PERSISTENT | 0 |
+---------------+---------+--------------------------------+------+------------+-----------+
| juju-ba567e-0 | RUNNING | 10.0.8.251 (eth0) | | PERSISTENT | 0 |
| | | 10.0.9.1 (lxdbr0) | | | |
+---------------+---------+--------------------------------+------+------------+-----------+
| juju-ba567e-1 | RUNNING | 172.17.0.1 (docker0) | | PERSISTENT | 0 |
| | | 10.0.8.181 (eth0) | | | |
+---------------+---------+--------------------------------+------+------------+-----------+

We've also reproduced this on GCE and those logs will be coming soon.

This issue is not seen on Juju versions 2.0.3, and up to Juju 2.1-beta4. Additionally, Rick Harding tested Juju 2.1 rc2 on a MAAS deployment and colocated lxd's did allocate and obtain an IP successfully.

tags: added: conjure
description: updated
tags: added: regression
Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.1.0
Revision history for this message
Adam Stokes (adam-stokes) wrote :

Some more information:

```
ubuntu@tupac:~⟫ juju ssh 0
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-62-generic x86_64)

 * Documentation: https://help.ubuntu.com
 * Management: https://landscape.canonical.com
 * Support: https://ubuntu.com/advantage

  Get cloud support with Ubuntu Advantage Cloud Guest:
    http://www.ubuntu.com/business/services/cloud

1 package can be updated.
0 updates are security updates.

Last login: Tue Feb 14 00:30:10 2017 from 10.0.8.1
ubuntu@juju-ba567e-0:~$ lxc list
+---------------------+---------+-------------------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------------+---------+-------------------+------+------------+-----------+
| juju-ba567e-0-lxd-0 | RUNNING | | | PERSISTENT | 0 |
+---------------------+---------+-------------------+------+------------+-----------+
ubuntu@juju-ba567e-0:~$ lxc exec juju-ba567e-0-lxd-0 bash
root@juju-ba567e-0-lxd-0:~# dhclient eth0

root@juju-ba567e-0-lxd-0:~# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:16:3e:b3:f0:94
          inet addr:10.0.9.197 Bcast:10.0.9.255 Mask:255.255.255.0
          inet6 addr: fe80::216:3eff:feb3:f094/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:55557 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30577 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:110153722 (110.1 MB) TX bytes:2556853 (2.5 MB)

```

And now that will resolve the deployment:

http://paste.ubuntu.com/23992013/

Revision history for this message
Charles Butler (lazypower) wrote :

attached juju-crashdump output for the failed GCE deployment

description: updated
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I can't deploy at all with the snap version of conjure-up. With no other Juju or conjure-up installed:

$ snap install conjure-up --classic --edge
...
$ conjure-up kubernetes-core localhost
[info] Summoning kubernetes-core to localhost
[info] Bootstrapping Juju controller "conjure-up-localhost-bde" with deployment "conjure-up-kubernetes-core-bb9"
[info] Running pre deployment tasks.
[error] Failed to run pre deploy task: Expecting value: line 1 column 1 (char 0)

This seems like a different problem to the what is being reported though.

Revision history for this message
John A Meinel (jameinel) wrote :

For nested LXD, are we supposed to be bridging to the outer LXD's network interface and get IP addresses from the host bridge? Or are we supposed to be using local-only IP addresses like we do in AWS/Azure? How are we getting those IP addresses? Just DHCP on the inner bridge which is wired up to the outer bridge?
I don't believe this was ever "clearly defined to work" in previous versions of Juju, but it "might have happened to work".

John
=:->

summary: - juju 2.1-beta5 - juju 2.1rc2 - localhost failing to allocate a colocated
+ juju 2.1-beta5 - juju 2.1rc2 - localhost failing to allocate a nested
container with an ip
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Downgrading Importance as per comment # 4, this is a new functionality that has not been previously defined.
Let us know if this is Critical for 2.1.0 as we are in the final stretch before the release.

Changed in juju:
milestone: 2.1.0 → 2.2.0-alpha1
importance: Critical → High
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

FWIW, the issue described in #3 has been reported here: https://github.com/conjure-up/conjure-up/issues/674.

Revision history for this message
Adam Stokes (adam-stokes) wrote :

This is functionality we've depended on since the beginning of Juju 2. Also conjure-up is advertised https://www.ubuntu.com/cloud on being able to do deployments like these on a single machine.

This should be considered a huge blocker and critical as kubernetes under localhost provider does not work.

Revision history for this message
John A Meinel (jameinel) wrote :

Adam, I'm trying to address, this, but I'd really like to understand how it was concretely working so that I can make it be a purposeful fix. In 2.0 there is a bit of code that does "I'll try to make this work, but it that fails I'll fallback to something else", and I'm trying hard to make it a clear "right here this is how things work" instead.

Revision history for this message
John A Meinel (jameinel) wrote :

So it looks like this worked because Conjure-up completely carved up the networking for the containers, and assumed Juju would just ignore that and proceed.

Specifically, conjure up is forcing this LXD configuration:

name: juju-##MODEL##
config:
  boot.autostart: "true"
  linux.kernel_modules: ip_tables,ip6_tables,netlink_diag,nf_nat,overlay
  raw.lxc: |
    lxc.aa_profile=unconfined
    lxc.mount.auto=proc:rw sys:rw
  security.nesting: "true"
  security.privileged: "true"
description: ""
devices:
  aadisable:
    path: /sys/module/nf_conntrack/parameters/hashsize
    source: /dev/null
    type: disk
  aadisable1:
    path: /sys/module/apparmor/parameters/enabled
    source: /dev/null
    type: disk
  eth0:
    mtu: "9000"
    name: eth0
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth1:
    mtu: "9000"
    name: eth1
    nictype: bridged
    parent: conjureup0
    type: nic
  root:
    path: /
    type: disk

Notice that this is forcing the container to have 2 network interfaces, and Juju is going to then be trying to set its own network configuration for each container, rather than just inheriting the values from the 'default' profile.

So it never worked because Juju made it work, it only worked because Conjure-up worked around what Juju wasn't doing.. (spaces support on LXD.)

Revision history for this message
Adam Stokes (adam-stokes) wrote :

I've tried with removing those bridges and the problem is still reproducable.

Revision history for this message
John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/6985

When we saw an 'lxdbr0' we would attempt to use it for the container, but if it didn't have an address yet, we failed to set DHCP inside the container on eth0.

I'm not sure the fixes are all there for Trusty machines (where lxd isn't installed by default, thus 'lxdbr0' doesn't exist even without an address). But for Xenial machines, I've tested 2.1 with my patch, and 2.1 without my patch and doing:
 juju deploy cs:~jameinel/ubuntu-lite-nested

Succeeds with the patch, and fails without it.

Revision history for this message
John A Meinel (jameinel) wrote :

https://github.com/lxc/lxd/issues/2885 was opened around "security.nesting=true" not being enough to actually get nested containers. Though I'm testing on Xenial with the backported LXD 2.8, not sure if that matters.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking as Fix Committed as to the best of our knowledge with the referenced PR landed, the issue is fixed.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

I'll leave the bug open for 2.2 should we decide to follow a different approach there.

tags: added: eda
Revision history for this message
John A Meinel (jameinel) wrote :

The outcome of the LXD issue is that you might only need to set "nested=true" on the outer container, but then you're likely to need to set "privileged=true" on for the inner containers. Which might be ok from a security perspective as you don't have root on the outermost host.
Otherwise we have to configure the outer container with a huge range of UIDs so that it can slice those up for the nested containers.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

This has been forward-ported into 2.2 (develop) as part of a larger commit.

Changed in juju:
status: Triaged → Fix Committed
assignee: nobody → John A Meinel (jameinel)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.