LXD with juju storage and local loopbacks

Bug #1605241 reported by Adam Stokes
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Medium
Unassigned

Bug Description

I am doing a bunch of deployments/teardowns in 2 seperate forms:

The first is,

$ juju bootstrap yung localhost --upload-tools --config image
-stream=daily --config enable-os-upgrade=false --bootstrap-series=xenial
$ juju deploy <bundle>
$ juju destroy-model default
$ juju add-model dino
$ juju deploy <bundle>

The second is,

$ juju bootstrap yung localhost --upload-tools --config image
-stream=daily --config enable-os-upgrade=false --bootstrap-series=xenial
$ juju deploy <bundle>
$ juju destroy-controller --destroy-all-models
$ juju bootstrap yung localhost --upload-tools --config image
-stream=daily --config enable-os-upgrade=false --bootstrap-series=xenial
$ juju deploy <bundle>

There are several times in both cases where machines will stay in pending/allocating stage:
https://paste.ubuntu.com/20304642/

Digging through the machine log I spotted:

2016-07-21 12:31:18 WARNING juju.network network.go:430 cannot get "lxdbr0" addresses: route ip+net: no such network interface (ignoring)

Even though my network bridge is alive and well:
eth0 Link encap:Ethernet HWaddr b8:ae:ed:74:0b:07
          inet addr:172.16.0.29 Bcast:172.16.0.255 Mask:255.255.255.0
          inet6 addr: fe80::baae:edff:fe74:b07/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:12111864 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4906790 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:17525256923 (17.5 GB) TX bytes:374279915 (374.2 MB)
          Interrupt:20 Memory:f7000000-f7020000

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:2389 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2389 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:1665510 (1.6 MB) TX bytes:1665510 (1.6 MB)

lxdbr0 Link encap:Ethernet HWaddr fe:f3:21:3a:72:73
          inet addr:10.1.78.1 Bcast:0.0.0.0 Mask:255.255.255.0
          inet6 addr: fe80::50bb:10ff:fe3e:53c3/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:4619775 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9949792 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:443834026 (443.8 MB) TX bytes:17423911337 (17.4 GB)

openstack0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
          inet addr:10.99.0.1 Bcast:0.0.0.0 Mask:255.255.255.0
          inet6 addr: fe80::7492:c9ff:fee9:43f/64 Scope:Link
          UP BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:1038 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:69592 (69.5 KB) TX bytes:648 (648.0 B)

veth8BTVT4 Link encap:Ethernet HWaddr fe:f3:21:3a:72:73
          inet6 addr: fe80::fcf3:21ff:fe3a:7273/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:44491 errors:0 dropped:0 overruns:0 frame:0
          TX packets:57014 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:60053248 (60.0 MB) TX bytes:94542924 (94.5 MB)

Another thing I saw is:

2016-07-21 12:40:15 ERROR juju.worker.environ wait.go:58 loaded invalid environment configuration: Error inserting juju-dino into database: UNIQUE constraint failed: profiles.name

Another interesting bit is that you can no longer destroy the controller:

juju destroy-controller yung --destroy-all-models
WARNING! This command will destroy the "yung" controller.
This includes all machines, applications, data and other resources.

Continue [y/N]? y
Destroying controller
Waiting for hosted model resources to be reclaimed
Waiting on 2 models, 3 machines, 3 applications
Waiting on 2 models, 3 machines, 3 applications
Waiting on 1 model, 3 machines, 3 applications
Waiting on 1 model, 3 machines, 3 applications
Waiting on 1 model, 3 machines, 3 applications
..on it goes

This was filed as a seperate bug: https://bugs.launchpad.net/juju-core/+bug/1604931

Machine-0 log: http://paste.ubuntu.com/20304298/

tags: added: conjure
description: updated
Revision history for this message
Tim Penhey (thumper) wrote :

Wild stab in the dark, but perhaps this needs to be checked:

2016-07-21 12:31:18 WARNING juju.network network.go:430 cannot get "lxdbr0" addresses: route ip+net: no such network interface (ignoring)

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0-beta14
Changed in juju-core:
assignee: nobody → Richard Harding (rharding)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote : Re: lxd instances not starting

I've seen this 2 when deploying a bundle which results in about 10 machines - 2 didn't come out of pending. "lxc list" showed them as running but they didn't even have ssh up. I got a shell using "lxc exec" and found that both containers were idle, only running initd, system-udevd and dhclient. It looked like cloud-init had run to some degree but hadn't done most of the work it would normally do.

I will attach the cloud init logs and the output from ps.

When I manually rebooted the containers, cloud-init ran again and they came up correctly.

summary: - juju2beta12 lxd instances not starting
+ lxd instances not starting
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta14 → 2.0-beta15
Changed in juju-core:
milestone: 2.0-beta15 → 2.0-beta16
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta16 → 2.0-beta17
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta17 → 2.0-beta18
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta18 → 2.0-beta19
Changed in juju:
milestone: 2.0-beta19 → 2.0-rc1
Changed in juju:
milestone: 2.0-rc1 → 2.0.0
Changed in juju:
milestone: 2.0.0 → 2.1.0
Changed in juju:
assignee: Richard Harding (rharding) → nobody
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Since this report has been originally filed, several fixes went in into both Juju 2.x (especially around networking) and LXD.

This must have been fixed as part of other work. I can no longer reproduce this behavior with Juju 2.1 (tip) and LXD 2.4.1. I have used apache-processing-mapreduce. All machines came out of 'pending' into 'started' and I can ssh into all of them.

If you are still experiencing the problem, maybe a particular bundle is causing issue? Feel free to re-open then and specify what bundle you were using.

I am marking this as Fix Committed against our current active milestone.

Changed in juju:
status: Triaged → Fix Committed
Revision history for this message
Chris Holcombe (xfactor973) wrote :

I'm hitting this also on juju 2.0.3 and lxd 2.8. I'll try upgrading to juju 2.1

Revision history for this message
Chris Holcombe (xfactor973) wrote :

I found out what my issue was. LXD was having a problem with juju storage and local loopbacks. I was getting errors like:

gluster/6 brick/6 pending creating volume: could not create block file: allocating loop backing file "/var/lib/juju/storage/loop/volume-8-6": fallocate: fallocate failed: Operation not supported: exit status 1

The problem is juju debug-log and juju status don't tell you anything happened other than indicating the unit is stuck in allocating. I became suspicious when I deployed an ubuntu unit and it worked fine. I dug through the logs and found that the next log entry after my gluster units got stuck was a line about juju storage in the ubuntu unit. So I randomly ran `juju storage` and found out that my block devices were stuck. It would really help to have this logged in the regular logs :)

Revision history for this message
Chris Holcombe (xfactor973) wrote :

I should also add that I dropped back down to juju 2.0.3 because I thought maybe 2.1 wasn't solving my problem.

summary: - lxd instances not starting
+ LXD with juju storage and local loopbacks
Changed in juju:
status: Fix Committed → Triaged
milestone: 2.1.0 → 2.2.0
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta1 → 2.2-beta2
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.2-beta2 → 2.2-beta3
Changed in juju:
milestone: 2.2-beta3 → 2.2-beta4
Changed in juju:
milestone: 2.2-beta4 → 2.2-rc1
Revision history for this message
Tim Penhey (thumper) wrote :

Taking milestone off as no-one is targetted to address this just now.

Adam, how often is this a real issue for you?

tags: added: provisioner
Changed in juju:
importance: High → Medium
milestone: 2.2-rc1 → none
Revision history for this message
Adam Stokes (adam-stokes) wrote :

Tim, I haven't seen this issue happen since I filed the report we could probably mark this incomplete and if it bubbles up again I'll file a new bug

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.