applications deployed to lxd on aws instances failing

Bug #1684143 reported by james beedy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
In Progress
Critical
John A Meinel

Bug Description

juju deployed lxd is failing across the board for me right now on aws instances using the JAAS controller.

I created a new model this morning (first model I've created on 2.1.2), and went to deploy some charms, and my lxd status just shows "down": http://paste.ubuntu.com/24413976/

I initially thought it might be a 2.1.2 thing, so I verified a one of my older 2.0.3 models.

You can see here (2.0.3) http://paste.ubuntu.com/24414148/, this use case was working for me, and then machine 10 which was deployed this morning exhibits the bug.

To reproduce:

1) deploy an instance to aws `juju add-machine`
2) deploy something to lxd on the newly created machine `juju deploy redis --to lxd:#`

james beedy (jamesbeedy)
description: updated
Revision history for this message
james beedy (jamesbeedy) wrote :

@anastasia-macmood will you put some priority on this for me please?

james beedy (jamesbeedy)
description: updated
description: updated
Revision history for this message
james beedy (jamesbeedy) wrote :

I can get past this issue when using a different aws account/creds. See http://paste.ubuntu.com/24414510/

What is interesting about ^ is that the instances are neglecting to deploy to 1a. In my example with the breakage above, I'm deploying instances to a space with subnets only in 1a, so my instances only land in 1a.

Revision history for this message
james beedy (jamesbeedy) wrote :

the issue possibly has something to do with instances not deploying to 1a http://paste.ubuntu.com/24414962/

Revision history for this message
james beedy (jamesbeedy) wrote :

After some digging, I realized some instances were getting an interface named ens3, and some were getting and interface named eth0. The instances getting ens3 are the instances on which lxd fails.

https://bugs.launchpad.net/juju/+bug/1684248

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.2.0
Changed in juju:
milestone: 2.2.0 → 2.2-rc1
Revision history for this message
John A Meinel (jameinel) wrote :

Given all of the containers in the "working" scenario are only using private lxd bridges behind NAT anyway (they all show up as 10.0.0.X), I'm not sure why eth0 vs ens3 would be a problem for us.

Is it possible to set the logging config to DEBUG level and include the output of both the controller machine and the host machine that is trying to start the lxd bridge?

There are a couple of things that *could* be happening:

1) bug #1668547 if something happened where a new Xenial image is suddenly including a new LXD version (previously that version of LXD was only available via 'xenial-backports' if people explicitly requested it.)

That *might* explain the ens3 thing, as in "the images were updated and now use the normal Xenial naming convention" (not a problem for us) "and also a newer version of lxd agent" (a problem as we see that 'lxdbr0' exists, but isn't configured correctly)

2) Something else about ens3 that is causing us to not route containers correctly.

Another way to determine if (1) is true is if you could run something like:
 juju run lxd --version
And see if there is a difference in the version reported by the machines that *can* create containers vs the ones that *cannot*.

3) It doesn't quite follow why it would be a *credentials* thing, unless visibility of Ubuntu images is somehow different in different regions. So it's entirely possible that the image idea is a red herring.

bug #1668547 is still my best guess as to what would be causing this.

The particular fix for that one shouldn't be terribly hard, but hasn't been a priority and needs a bit of digging to make sure its the right fix. Obviously if Ubuntu images are going to suddenly ship with a different set of packages out-of-the-box the importance goes up significantly.

4) Lots of other possibilities that I wouldn't be able to predict, which is why the logs would be useful. It would help pin down what might be triggering the failures. (do we detect that there is an lxdbr0, but not find an address for it, etc, do we configure containers with lxdbr0 as their bridge, but they don't get IP addresses, etc)

Changed in juju:
status: Triaged → Incomplete
Changed in juju:
importance: High → Undecided
milestone: 2.2-rc1 → none
Revision history for this message
james beedy (jamesbeedy) wrote :

@jmeinel

(1 & 2) lxd version seems to be consistent between an instance with working lxd and eth0 vs non-working with ens3 http://paste.ubuntu.com/24420624/. The kernel versions seem to be similar also http://paste.ubuntu.com/24420635/.

(3) As far as credentials/accounts go .... I was getting successful lxd deploys on aws instances (with eth0 interface name) all last week, and for all of time before yesterday. Now I can't get an instance to deploy with ethX interface name at all it seems.

(4) I can ssh into affected instances with ens3 device, and `lxc launch ubuntu:16.04 u1`, and have the container come up and have addressability ... but it seems the lxd bridge auto configures the subnet to a 172.x.x.x subnet instead of a 10.0.0.0. Once I do this, I still cannot `juju deploy` lxd's to the affected machine.

Revision history for this message
james beedy (jamesbeedy) wrote :

Update: I just deployed some instances and they have eth0 interface names.

Revision history for this message
james beedy (jamesbeedy) wrote :

I think I have just disproved that there is any connection between the interface name and lxd working, see http://paste.ubuntu.com/24420934/

Machine 13 was deployed yesterday, and has interface name ens3. Machine 15 was deployed today (5 mins ago), and has interface eth0. Both have failing lxd.

Changed in juju:
status: Incomplete → New
Revision history for this message
james beedy (jamesbeedy) wrote :

LXD deploy to instance still failing on my creativedrive aws account -> http://paste.ubuntu.com/24427051/

LXD deploy to instance working on my charm-dev aws account -> http://paste.ubuntu.com/24427079/

What is the difference?

Revision history for this message
james beedy (jamesbeedy) wrote :

@jmeinel, @rharding lxd failing on machine 9 -> http://paste.ubuntu.com/24427795/

juju show-machine 9 -> http://paste.ubuntu.com/24427794/

It seems the constraints are being passed through to the container ....

Revision history for this message
james beedy (jamesbeedy) wrote :

"It seems the constraints are being passed through to the container" - this explains why I can get successful lxd deploys on aws instances on my charm dev aws account; because there are no spaces or subnets defined/used there.

What is still at large here, is the fact that I am just now affected by this. I have been doing these ops for quite a while now, with no sign of this. See http://paste.ubuntu.com/24427822/

Tim Penhey (thumper)
tags: added: intermittent-failure
John A Meinel (jameinel)
Changed in juju:
status: New → Triaged
status: Triaged → In Progress
importance: Undecided → Critical
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.2-beta4
Changed in juju:
milestone: 2.2-beta4 → 2.2-rc1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.