1.20.2: lxc/2 of 9 not starting stuck in pending state

Bug #1348813 reported by Chad Smith
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Triaged
High
Unassigned

Bug Description

In test runs of juju 1.20.2 for our deployment, landscape deploys 9 services to machine 0 lxcs.

In a recent deployment, lxc-2 never came up. but lxc-0 -> 8 have come up and obtained IPs.
All lxcs were created at 15:52-18:53 computer time most of the cloud-inits completed within those times.

Juju reports the 0/lxc/2 in pending state indefinitely.

I've attached juju status, machine-0.log, syslog and a tar of all logs from machine 0.

We haven't found logs that indicate what the specific problem could be though there are multiple curious br0 disabled messages for port 2 in syslog/kern.log during lxc veth interface configuration.

Jul 25 18:52:20 node0vm1 kernel: [ 285.876851] br0: port 2(vethNT30P6) entered disabled state

On machine 0:
Juju reports 0/lxc/2 in pending state, sudo lxc-ls --fancy on machine 0

root@node0vm1:~# lxc-ls --fancy
NAME STATE IPV4 IPV6 AUTOSTART
-------------------------------------------------------------
juju-machine-0-lxc-0 RUNNING 10.14.100.161 - YES
juju-machine-0-lxc-1 RUNNING 10.14.100.162 - YES
juju-machine-0-lxc-3 RUNNING 10.14.100.163 - YES
juju-machine-0-lxc-4 RUNNING 10.14.100.164 - YES
juju-machine-0-lxc-5 RUNNING 10.14.100.165 - YES
juju-machine-0-lxc-6 RUNNING 10.14.100.166 - YES
juju-machine-0-lxc-7 RUNNING 10.14.100.167 - YES
juju-machine-0-lxc-8 RUNNING 10.14.100.168 - YES
juju-trusty-template STOPPED - - NO

root@node0vm1:~# ls -ltr /var/lib/lxc/
total 36
drwxr-xr-x 3 root root 4096 Jul 25 18:52 juju-trusty-template
drwxr-xr-x 3 root root 4096 Jul 25 18:52 juju-machine-0-lxc-0
drwxr-xr-x 3 root root 4096 Jul 25 18:52 juju-machine-0-lxc-1
drwxr-xr-x 3 root root 4096 Jul 25 18:52 juju-machine-0-lxc-3
drwxr-xr-x 3 root root 4096 Jul 25 18:53 juju-machine-0-lxc-4
drwxr-xr-x 3 root root 4096 Jul 25 18:53 juju-machine-0-lxc-5
drwxr-xr-x 3 root root 4096 Jul 25 18:53 juju-machine-0-lxc-6
drwxr-xr-x 3 root root 4096 Jul 25 18:53 juju-machine-0-lxc-7
drwxr-xr-x 3 root root 4096 Jul 25 18:53 juju-machine-0-lxc-8

Revision history for this message
Chad Smith (chad.smith) wrote :
Revision history for this message
Chad Smith (chad.smith) wrote :

juju status showing deployment service placement and stuck pending lxc/2

description: updated
Revision history for this message
Chad Smith (chad.smith) wrote :

syslog from machine 0 (separated from logtar for quick reference)

Revision history for this message
Chad Smith (chad.smith) wrote :

machine-0.log (separated for from log tarfile quick reference)

Revision history for this message
Chad Smith (chad.smith) wrote :

Switched bug to private as I'm adding the link where I've pulled tools from https://juju-dist.s3.amazonaws.com/rc-testing/tools/releases/juju-1.20.2-trusty-amd64.tgz

information type: Public → Private
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.20.2
Curtis Hovey (sinzui)
information type: Private → Public
Revision history for this message
Curtis Hovey (sinzui) wrote :

This issue may be a duplicate of bug 1311668. A stale lock might be blocking

When the containers are stopped as shown in previous messages, these locks should not exist
    /var/lib/juju/locks/juju-trusty-template
    /var/lib/juju/locks/juju-precise-template

Revision history for this message
Adam Collard (adam-collard) wrote :

If it were a stale template lock surely all of the containers would be missing, and not just one?

Revision history for this message
Ian Booth (wallyworld) wrote :

Adam, yes our current thinking is that the issue in this bug is not a template lock issue but rather some other lxc start up issue. We do not understand the Juju involvement (if any) with the one container not starting. Any additional insight from an lxc perspective would be most appreciated.

Revision history for this message
Adam Collard (adam-collard) wrote :

Could https://bugs.launchpad.net/juju-core/+bug/1350008 be the mystery bug behind the curtain?

Revision history for this message
Ian Booth (wallyworld) wrote :

@Adam, bug 1350008 looks like it could be implicated. We'll fix that and hopefully it will have a positive effct

Revision history for this message
Chad Smith (chad.smith) wrote :

While I do thing the retries mentioned in bg 1350008 would be good in general, I don't think this bug shows the same "texture" as the problems we are solving in 1350008 as I don't have any cloud-init files at all in /var/lib/lxc/juju-machine-0-lxc-2/ which would have reporting the inability to download tools due to timeout.

David just ran into this problem again on 1.21dev. So we have a bit more heat and a reproducible issue. Since this bug is already dupe'd let's take the conversation over to lp:1354027

Thanks guys.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.