quickstart deployment fails to add relations when bootstrap goes "down"
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-deployer |
High
|
Unassigned | ||
Bug Description
I'm attempting to deploy a big data bundle [1] to canonistack with the following:
$ juju quickstart u/bigdata-
The deployment takes about three hours in this environment, but after it's done, there are no relations set. About an hour into the deployment, my bootstrap and juju-gui units look like this:
machines:
"0":
agent-state: down
agent-
agent-version: 1.24-beta5
dns-name: 10.55.60.13
instance-id: bb9913f5-
instance-state: ACTIVE
series: trusty
hardware: arch=amd64 cpu-cores=1 mem=2048M root-disk=10240M availability-
state-
services:
juju-gui:
charm: cs:trusty/
exposed: true
service-status:
current: unknown
since: 28 May 2015 23:30:23 UTC
units:
juju-gui/0:
current: unknown
message: agent is lost, sorry! See 'juju status-history juju-gui/0'
since: 28 May 2015 23:30:23 UTC
current: lost
message: agent is not communicating with the server
since: 28 May 2015 23:30:28 UTC
version: 1.24-beta5
machine: "0"
open-ports:
- 80/tcp
- 443/tcp
Per the juju-gui message, the status-history looks like this:
$ juju status-history juju-gui/0
TIME TYPE STATUS MESSAGE
28 May 2015 23:18:42 UTC workload unknown Waiting for agent initialization to finish
28 May 2015 23:18:42 UTC agent allocating
28 May 2015 23:19:04 UTC workload maintenance installing charm software
28 May 2015 23:19:06 UTC agent executing running install hook
28 May 2015 23:27:47 UTC agent executing running leader-elected hook
28 May 2015 23:27:52 UTC agent executing running config-changed hook
28 May 2015 23:30:16 UTC agent executing running start hook
28 May 2015 23:30:23 UTC workload unknown
28 May 2015 23:30:28 UTC agent idle
The other units continue installing, but as I mentioned earlier, no relations are added to my deployment after all units finish and change state to "started". I'm running juju from a Vivid vagrant image with the following:
vagrant@
Ubuntu 15.04 \n \l
vagrant@
ii juju 1.24-beta5-
ii juju-core 1.24-beta5-
ii juju-local 1.24-beta5-
ii juju-mongodb 2.4.10-0ubuntu2 amd64 MongoDB object/
ii python-jujuclient 0.18.5-0ubuntu1 all Python API client for juju
vagrant@
juju-deployer (0.4.3)
juju-quickstart (2.1.1)
jujubundlelib (0.1.7)
jujuclient (0.18.5)
Refs:
[1] https:/
| Kevin W Monroe (kwmonroe) wrote : | #1 |
| Curtis Hovey (sinzui) wrote : | #2 |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.25.0 |
| tags: | added: deploy quickstart |
| Kevin W Monroe (kwmonroe) wrote : | #3 |
Thanks for the triage Curtis! Just another data point here.. I fired up a Trusty vagrant image with ppa:juju/stable and see the same problem -- the bootstrap/juju-gui unit (0) state goes "down", but all units continue deploying and their states eventually become "started". However, no relations are created. Same deployment as before:
$ juju quickstart u/bigdata-
I see similar output with 1.23.3 (though less informative than 1.24):
machines:
"0":
agent-state: down
agent-
agent-version: 1.23.3
dns-name: 10.55.60.13
instance-id: c3608547-
instance-state: ACTIVE
series: trusty
hardware: arch=amd64 cpu-cores=1 mem=2048M root-disk=10240M availability-
state-
services:
juju-gui:
charm: cs:trusty/
exposed: true
units:
juju-gui/0:
machine: "0"
open-ports:
- 80/tcp
- 443/tcp
Versions, etc:
vagrant@
Ubuntu 14.04.2 LTS \n \l
vagrant@
ii juju 1.23.3-
ii juju-core 1.23.3-
ii juju-local 1.23.3-
ii juju-mongodb 2.4.9-0ubuntu3 amd64 MongoDB object/
ii juju-quickstart 2.1.1+bzr133+
ii jujubundlelib 0.1.8-1 all A Python library for working with Juju bundles.
ii python-jujuclient 0.50.1-2 amd64 Python API client for juju-core
vagrant@
juju-quickstart (2.1.1)
jujubundlelib (0.1.8)
jujuclient (0.50.1)
Fwiw, I have *only* seen this with canonistack, and it takes 3+ hours before the units in my deployment go "started". I have no problem deploying the same bundle to Azure or locally, though deployments to those substrates complete in < 1 hour. I'm not sure if this means my problem is related to the openstack provider or if it's due to the length of time for the deployment.
| Ian Booth (wallyworld) wrote : | #4 |
tl;dr; canonistack sucks for real work
IIRC canonistack is *way* overcommitted and the fact that the deployment takes 3-4 times longer on canonistack compared with other clouds just reinforces that. It may be that due to severe I/O or CPU contention the agent pingers simply cannot get their message through to the state server before timing out, hence the "agent is lost" message. Also, it may be that the relations are still being started but the deployment is taking a long time to do so. The agent state going to started doesn't necessarily mean that everything is finished being wired up yet.
| amir sanjar (asanjar) wrote : | #5 |
I am not convinced this is a canonistack issue. I have encountered similar problems, missing relations, on multiple occasions on local .It was also reported by a customer two weeks ago.
| Antonio Rosales (arosales) wrote : | #6 |
@Ian,
Thanks for the comments.
Agreed Canonistack is over-committed, but given Amir's comment 5 that we are seeing this outside of Canonistack how can we make the state server more reliable in handling this situation, and/or make it more visible to the Juju Admin, and/or possible action(s) to try and resolve?
-thanks,
Antonio
| no longer affects: | juju-core/1.24 |
| Ian Booth (wallyworld) wrote : | #7 |
@Antonio
Agree 100% with need to handle this situation better. My motivation in the comment was to provide justification for moving this bug off 1.24. Solving the problem is likely to involve non-trivial engineering effort so would implementation during a development iteration, rather than during pre release stabilisation.
If we can reproduce this issue, I'd also like to gather information of the various deployment scenarios and also log files so we can diagnose the root cause etc.
| amir sanjar (asanjar) wrote : | #8 |
@ ian,
I just had another customer reporting a similar issue. Sadly this time the issue was reported by a customer interested in signing CPP. Forwarding the email to you.
| amir sanjar (asanjar) wrote : | #9 |
I was able to reproduce an issue similar to what has been reported. I hope attached log files help
| amir sanjar (asanjar) wrote : | #10 |
yet again we had to manually add relations to avoid a potentially embarrassing moment during a big data demo by Sam.
| affects: | juju-core → juju-deployer |
| Changed in juju-deployer: | |
| milestone: | 1.25.0 → none |

This issue may relate to bug 1455260.