Juju create a container on a wrong bridge

Bug #1656326 reported by Aymen Frikha
42
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
John A Meinel
2.1
Fix Released
High
John A Meinel

Bug Description

When creating a new unit for a service, this unit was created on a container and was added to the bridge lxdbr0 not the right bridge which is br-eno1. And this behaviour is not consistent, sometimes the container is created on the right bridge.

Juju 2.0.2
MAAS 2.1.1

Revision history for this message
Joey Stanford (joey) wrote :

Aymen upgraded to MAAS 2.1.2 to verify that this was NOT a dup of #1643057. Problem remains.

tags: added: canonical-bootstack
Revision history for this message
Andrew McDermott (frobware) wrote :

Please could you attach the Juju logs from the machine hosting the container.

And also be aware of: https://bugs.launchpad.net/juju/+bug/1656217

Revision history for this message
Aymen Frikha (aym-frikha) wrote :
Revision history for this message
Aymen Frikha (aym-frikha) wrote :

In those logs I found:

2017-01-13 17:17:41 WARNING juju.provisioner lxd-broker.go:62 failed to prepare container "27/lxd/4" network config: {"hostname": ["Node with this Hostname already exists."]}
2017-01-13 17:17:41 WARNING juju.provisioner broker.go:97 incomplete DNS config found, discovering host's DNS config
2017-01-13 17:17:41 DEBUG juju.provisioner broker.go:110 setting DNS servers [local-cloud:10.182.254.50] and domains [maas] on container interface "eth0"

I don't know what does that mean.

Revision history for this message
Andrew McDermott (frobware) wrote :

Please could you describe your MAAS network setup; perhaps the easiest thing to do is capture some screenshots of the subnets and network node configurations.

What applications were you deploying? Was this the 2nd unit, or way more than that? Is there a bundle I could try, etc.

Revision history for this message
Andrew McDermott (frobware) wrote :

Ah... do the interfaces names on the host span close to 15 characters? I noticed this in the logs "qbr420ddec0-7a". When juju creates the bridges for the container it add a prefix of 'br-' which will take the names to >15 characters; 15+NULL is a hard limitation in the kernel and I suspect the bridge does not get activated and, due to a different Juju bug[1], we fall back to using lxdbr0.

[1] - the bug in question in similar to https://bugs.launchpad.net/juju/+bug/1656217 - the code path currently ignores errors when it should return with an error. By ignoring the error we default to using lxdbr0.

Changed in juju:
status: New → Incomplete
Revision history for this message
John A Meinel (jameinel) wrote :

machine-0.log for a test run that demonstrates the failure.
Oddly enough, in exactly the same second, we get a line that says we successfully got the container networking config, and then a few lines later in the log we have a line that says we fail because the hostname already exist.

I have the feeling either we have 2 threads that are accidentally trying to create the container concurrently, or something very weird is happening.

Note that in this test run, it is unable to create the container because I accidentally broke outbound networking. However, that should only mean that the provisioner tries a few times to create it. I wonder if the second time it tries fails because it doesn't recognize any of the work done the first time we tried.

Revision history for this message
Aymen Frikha (aym-frikha) wrote :

We hit same issue in another environment with different network configuration.

Revision history for this message
John A Meinel (jameinel) wrote :

2017-01-18 00:40:24 WARNING juju.provisioner provisioner_task.go:713 starting instance: Error calling 'lxd forkstart juju-9867db-1-lxd-10 /var/lib/lxd/containers /var/log/lxd/juju-9867db-1-lxd-10/lxc.conf': err='exit status 1'
  lxc 20170118004018.993 ERROR lxc_apparmor - lsm/apparmor.c:apparmor_process_label_set:234 - No such file or directory - failed to change apparmor profile to lxd-juju-9867db-1-lxd-10_</var/lib/lxd>//&:lxd-juju-9867db-1-lxd-10_<var-lib-lxd>:
  lxc 20170118004018.993 ERROR lxc_sync - sync.c:__sync_wait:57 - An error occurred in another process (expected sequence number 5)
  lxc 20170118004018.993 ERROR lxc_start - start.c:__lxc_start:1338 - Failed to spawn container "juju-9867db-1-lxd-10".
  lxc 20170118004019.521 ERROR lxc_conf - conf.c:run_buffer:347 - Script exited with status 1
  lxc 20170118004019.521 ERROR lxc_start - start.c:lxc_fini:546 - Failed to run lxc.hook.post-stop for container "juju-9867db-1-lxd-10".

2017-01-18 00:40:35 WARNING juju.provisioner lxd-broker.go:62 failed to prepare container "1/lxd/10" network config: {"hostname": ["Node with this Hostname already exists."]}

seems to be an interesting item.

I don't specifically know why that one would be triggered.

Revision history for this message
John A Meinel (jameinel) wrote :

Regardless, falling back to 'lxdbr0' is clouding the issue, and I'd like to have a fix in the 2.1 series that would prevent that, and instead just report a provisioning failure for this container. That way we can both debug it, and make it easier for people to notice and fix something.

Changed in juju:
milestone: none → 2.1.0
Changed in juju:
status: Incomplete → Triaged
importance: Undecided → High
John A Meinel (jameinel)
Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
Revision history for this message
Simon Monette (simon-monette) wrote :

An other occurrence of the behavior occurred. I attached logs of the host machine and the failed container. Andrew asked about the interfaces name length, I joined a dump of all interfaces name.

Revision history for this message
Simon Monette (simon-monette) wrote :
Changed in juju:
milestone: 2.1.0 → 2.2.0-alpha1
Revision history for this message
John A Meinel (jameinel) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

the original PR is being reverted by https://github.com/juju/juju/pull/6945
it worked for MAAS and AWS, but apparently Azure/GCE are still relying on the fallback behavior to actually work.

Revision history for this message
John A Meinel (jameinel) wrote :

New PR: https://github.com/juju/juju/pull/7062

This seems to address all the issues I'm aware of that caused it to be pulled out in the first attempt.

Revision history for this message
John A Meinel (jameinel) wrote :

The part of this about "Node with this Hostname already exists", is being tracked in bug #1670873

Revision history for this message
Anastasia (anastasia-macmood) wrote :

PR referenced in comment # 15 has landed and has been forward-ported as part of a larger commit.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.