Bug #1684143 “applications deployed to lxd on aws instances fail...” : Bugs : Canonical Juju

james beedy (jamesbeedy) on 2017-04-19

description:

updated

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-19:

#1

@anastasia-macmood will you put some priority on this for me please?

james beedy (jamesbeedy) on 2017-04-19

description:	updated
description:	updated

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-19:

#2

I can get past this issue when using a different aws account/creds. See http://paste.ubuntu.com/24414510/

What is interesting about ^ is that the instances are neglecting to deploy to 1a. In my example with the breakage above, I'm deploying instances to a space with subnets only in 1a, so my instances only land in 1a.

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-19:

#3

the issue possibly has something to do with instances not deploying to 1a http://paste.ubuntu.com/24414962/

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-19:

#4

After some digging, I realized some instances were getting an interface named ens3, and some were getting and interface named eth0. The instances getting ens3 are the instances on which lxd fails.

https://bugs.launchpad.net/juju/+bug/1684248

Menno Finlay-Smits (menno.smits) on 2017-04-20

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.2.0

Anastasia (anastasia-macmood) on 2017-04-20

Changed in juju:
milestone:	2.2.0 → 2.2-rc1

Revision history for this message

John A Meinel (jameinel) wrote on 2017-04-20:

#5

Given all of the containers in the "working" scenario are only using private lxd bridges behind NAT anyway (they all show up as 10.0.0.X), I'm not sure why eth0 vs ens3 would be a problem for us.

Is it possible to set the logging config to DEBUG level and include the output of both the controller machine and the host machine that is trying to start the lxd bridge?

There are a couple of things that *could* be happening:

1) bug #1668547 if something happened where a new Xenial image is suddenly including a new LXD version (previously that version of LXD was only available via 'xenial-backports' if people explicitly requested it.)

That *might* explain the ens3 thing, as in "the images were updated and now use the normal Xenial naming convention" (not a problem for us) "and also a newer version of lxd agent" (a problem as we see that 'lxdbr0' exists, but isn't configured correctly)

2) Something else about ens3 that is causing us to not route containers correctly.

Another way to determine if (1) is true is if you could run something like:
juju run lxd --version
And see if there is a difference in the version reported by the machines that *can* create containers vs the ones that *cannot*.

3) It doesn't quite follow why it would be a *credentials* thing, unless visibility of Ubuntu images is somehow different in different regions. So it's entirely possible that the image idea is a red herring.

bug #1668547 is still my best guess as to what would be causing this.

The particular fix for that one shouldn't be terribly hard, but hasn't been a priority and needs a bit of digging to make sure its the right fix. Obviously if Ubuntu images are going to suddenly ship with a different set of packages out-of-the-box the importance goes up significantly.

4) Lots of other possibilities that I wouldn't be able to predict, which is why the logs would be useful. It would help pin down what might be triggering the failures. (do we detect that there is an lxdbr0, but not find an address for it, etc, do we configure containers with lxdbr0 as their bridge, but they don't get IP addresses, etc)

Given all of the containers in the "working" scenario are only using private lxd bridges behind NAT anyway (they all show up as 10.0.0.X), I'm not sure why eth0 vs ens3 would be a problem for us.

Is it possible to set the logging config to DEBUG level and include the output of both the controller machine and the host machine that is trying to start the lxd bridge?

There are a couple of things that *could* be happening:

1) bug #1668547 if something happened where a new Xenial image is suddenly including a new LXD version (previously that version of LXD was only available via 'xenial-backports' if people explicitly requested it.)

That *might* explain the ens3 thing, as in "the images were updated and now use the normal Xenial naming convention" (not a problem for us) "and also a newer version of lxd agent" (a problem as we see that 'lxdbr0' exists, but isn't configured correctly)

2) Something else about ens3 that is causing us to not route containers correctly.

Another way to determine if (1) is true is if you could run something like:
 juju run lxd --version
And see if there is a difference in the version reported by the machines that *can* create containers vs the ones that *cannot*.

3) It doesn't quite follow why it would be a *credentials* thing, unless visibility of Ubuntu images is somehow different in different regions. So it's entirely possible that the image idea is a red herring.

bug #1668547 is still my best guess as to what would be causing this.

The particular fix for that one shouldn't be terribly hard, but hasn't been a priority and needs a bit of digging to make sure its the right fix. Obviously if Ubuntu images are going to suddenly ship with a different set of packages out-of-the-box the importance goes up significantly.

4) Lots of other possibilities that I wouldn't be able to predict, which is why the logs would be useful. It would help pin down what might be triggering the failures. (do we detect that there is an lxdbr0, but not find an address for it, etc, do we configure containers with lxdbr0 as their bridge, but they don't get IP addresses, etc)

Changed in juju:
status:	Triaged → Incomplete

Anastasia (anastasia-macmood) on 2017-04-20

Changed in juju:
importance:	High → Undecided
milestone:	2.2-rc1 → none

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-20:

#6

@jmeinel

(1 & 2) lxd version seems to be consistent between an instance with working lxd and eth0 vs non-working with ens3 http://paste.ubuntu.com/24420624/. The kernel versions seem to be similar also http://paste.ubuntu.com/24420635/.

(3) As far as credentials/accounts go .... I was getting successful lxd deploys on aws instances (with eth0 interface name) all last week, and for all of time before yesterday. Now I can't get an instance to deploy with ethX interface name at all it seems.

(4) I can ssh into affected instances with ens3 device, and `lxc launch ubuntu:16.04 u1`, and have the container come up and have addressability ... but it seems the lxd bridge auto configures the subnet to a 172.x.x.x subnet instead of a 10.0.0.0. Once I do this, I still cannot `juju deploy` lxd's to the affected machine.

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-20:

#7

Update: I just deployed some instances and they have eth0 interface names.

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-20:

#8

I think I have just disproved that there is any connection between the interface name and lxd working, see http://paste.ubuntu.com/24420934/

Machine 13 was deployed yesterday, and has interface name ens3. Machine 15 was deployed today (5 mins ago), and has interface eth0. Both have failing lxd.

Anastasia (anastasia-macmood) on 2017-04-20

Changed in juju:
status:	Incomplete → New

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-21:

#9

LXD deploy to instance still failing on my creativedrive aws account -> http://paste.ubuntu.com/24427051/

LXD deploy to instance working on my charm-dev aws account -> http://paste.ubuntu.com/24427079/

What is the difference?

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-21:

#10

@jmeinel, @rharding lxd failing on machine 9 -> http://paste.ubuntu.com/24427795/

juju show-machine 9 -> http://paste.ubuntu.com/24427794/

It seems the constraints are being passed through to the container ....

Revision history for this message

james beedy (jamesbeedy) wrote on 2017-04-21:

#11

"It seems the constraints are being passed through to the container" - this explains why I can get successful lxd deploys on aws instances on my charm dev aws account; because there are no spaces or subnets defined/used there.

What is still at large here, is the fact that I am just now affected by this. I have been doing these ops for quite a while now, with no sign of this. See http://paste.ubuntu.com/24427822/

Tim Penhey (thumper) on 2017-04-26

tags:

added: intermittent-failure

John A Meinel (jameinel) on 2017-05-09

Changed in juju:
status:	New → Triaged
status:	Triaged → In Progress
importance:	Undecided → Critical
assignee:	nobody → John A Meinel (jameinel)
milestone:	none → 2.2-beta4

Canonical Juju QA Bot (juju-qa-bot) on 2017-05-11

Changed in juju:
milestone:	2.2-beta4 → 2.2-rc1

Canonical Juju

applications deployed to lxd on aws instances failing

Bug Description

Other bug subscribers

Remote bug watches