quickstart deployment fails to add relations when bootstrap goes "down"

Bug #1460087 reported by Kevin W Monroe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-deployer
Triaged
High
Unassigned

Bug Description

I'm attempting to deploy a big data bundle [1] to canonistack with the following:

$ juju quickstart u/bigdata-dev/apache-core-batch-processing --debug

The deployment takes about three hours in this environment, but after it's done, there are no relations set. About an hour into the deployment, my bootstrap and juju-gui units look like this:

machines:
  "0":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.24-beta5
    dns-name: 10.55.60.13
    instance-id: bb9913f5-a5ba-450e-8639-7f36f43bd980
    instance-state: ACTIVE
    series: trusty
    hardware: arch=amd64 cpu-cores=1 mem=2048M root-disk=10240M availability-zone=nova
    state-server-member-status: has-vote
services:
  juju-gui:
    charm: cs:trusty/juju-gui-27
    exposed: true
    service-status:
      current: unknown
      since: 28 May 2015 23:30:23 UTC
    units:
      juju-gui/0:
        workload-status:
          current: unknown
          message: agent is lost, sorry! See 'juju status-history juju-gui/0'
          since: 28 May 2015 23:30:23 UTC
        agent-status:
          current: lost
          message: agent is not communicating with the server
          since: 28 May 2015 23:30:28 UTC
          version: 1.24-beta5
        agent-state: started
        agent-version: 1.24-beta5
        machine: "0"
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 10.55.60.13

Per the juju-gui message, the status-history looks like this:

$ juju status-history juju-gui/0
TIME TYPE STATUS MESSAGE
28 May 2015 23:18:42 UTC workload unknown Waiting for agent initialization to finish
28 May 2015 23:18:42 UTC agent allocating
28 May 2015 23:19:04 UTC workload maintenance installing charm software
28 May 2015 23:19:06 UTC agent executing running install hook
28 May 2015 23:27:47 UTC agent executing running leader-elected hook
28 May 2015 23:27:52 UTC agent executing running config-changed hook
28 May 2015 23:30:16 UTC agent executing running start hook
28 May 2015 23:30:23 UTC workload unknown
28 May 2015 23:30:28 UTC agent idle

The other units continue installing, but as I mentioned earlier, no relations are added to my deployment after all units finish and change state to "started". I'm running juju from a Vivid vagrant image with the following:

vagrant@vagrant-ubuntu-vivid-64:~$ cat /etc/issue
Ubuntu 15.04 \n \l

vagrant@vagrant-ubuntu-vivid-64:~$ dpkg -l | grep juju
ii juju 1.24-beta5-0ubuntu1~15.04.1~juju1 all next generation service orchestration system
ii juju-core 1.24-beta5-0ubuntu1~15.04.1~juju1 amd64 Juju is devops distilled - client
ii juju-local 1.24-beta5-0ubuntu1~15.04.1~juju1 all dependency package for the Juju local provider
ii juju-mongodb 2.4.10-0ubuntu2 amd64 MongoDB object/document-oriented database for Juju
ii python-jujuclient 0.18.5-0ubuntu1 all Python API client for juju

vagrant@vagrant-ubuntu-vivid-64:~$ pip list | grep juju
juju-deployer (0.4.3)
juju-quickstart (2.1.1)
jujubundlelib (0.1.7)
jujuclient (0.18.5)

Refs:
[1] https://jujucharms.com/u/bigdata-dev/apache-core-batch-processing/

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
Revision history for this message
Curtis Hovey (sinzui) wrote :

This issue may relate to bug 1455260.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.25.0
Curtis Hovey (sinzui)
tags: added: deploy quickstart
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Thanks for the triage Curtis! Just another data point here.. I fired up a Trusty vagrant image with ppa:juju/stable and see the same problem -- the bootstrap/juju-gui unit (0) state goes "down", but all units continue deploying and their states eventually become "started". However, no relations are created. Same deployment as before:

$ juju quickstart u/bigdata-dev/apache-core-batch-processing --debug

I see similar output with 1.23.3 (though less informative than 1.24):

machines:
  "0":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.23.3
    dns-name: 10.55.60.13
    instance-id: c3608547-1d59-4c6e-92db-03fb01f4c06c
    instance-state: ACTIVE
    series: trusty
    hardware: arch=amd64 cpu-cores=1 mem=2048M root-disk=10240M availability-zone=nova
    state-server-member-status: has-vote
services:
  juju-gui:
    charm: cs:trusty/juju-gui-27
    exposed: true
    units:
      juju-gui/0:
        agent-state: down
        agent-state-info: (started)
        agent-version: 1.23.3
        machine: "0"
        open-ports:
        - 80/tcp
        - 443/tcp
        public-address: 10.55.60.13

Versions, etc:

vagrant@vagrant-ubuntu-trusty-64:~$ cat /etc/issue
Ubuntu 14.04.2 LTS \n \l

vagrant@vagrant-ubuntu-trusty-64:~$ dpkg -l | grep juju
ii juju 1.23.3-0ubuntu1~14.04.1~juju1 all next generation service orchestration system
ii juju-core 1.23.3-0ubuntu1~14.04.1~juju1 amd64 Juju is devops distilled - client
ii juju-local 1.23.3-0ubuntu1~14.04.1~juju1 all dependency package for the Juju local provider
ii juju-mongodb 2.4.9-0ubuntu3 amd64 MongoDB object/document-oriented database for Juju
ii juju-quickstart 2.1.1+bzr133+ppa36~ubuntu14.04.1 all Easy configuration of Juju environments
ii jujubundlelib 0.1.8-1 all A Python library for working with Juju bundles.
ii python-jujuclient 0.50.1-2 amd64 Python API client for juju-core

vagrant@vagrant-ubuntu-trusty-64:~$ pip list | grep juju
juju-quickstart (2.1.1)
jujubundlelib (0.1.8)
jujuclient (0.50.1)

Fwiw, I have *only* seen this with canonistack, and it takes 3+ hours before the units in my deployment go "started". I have no problem deploying the same bundle to Azure or locally, though deployments to those substrates complete in < 1 hour. I'm not sure if this means my problem is related to the openstack provider or if it's due to the length of time for the deployment.

Revision history for this message
Ian Booth (wallyworld) wrote :

tl;dr; canonistack sucks for real work

IIRC canonistack is *way* overcommitted and the fact that the deployment takes 3-4 times longer on canonistack compared with other clouds just reinforces that. It may be that due to severe I/O or CPU contention the agent pingers simply cannot get their message through to the state server before timing out, hence the "agent is lost" message. Also, it may be that the relations are still being started but the deployment is taking a long time to do so. The agent state going to started doesn't necessarily mean that everything is finished being wired up yet.

Revision history for this message
amir sanjar (asanjar) wrote :

I am not convinced this is a canonistack issue. I have encountered similar problems, missing relations, on multiple occasions on local .It was also reported by a customer two weeks ago.

Revision history for this message
Antonio Rosales (arosales) wrote :

@Ian,

Thanks for the comments.
Agreed Canonistack is over-committed, but given Amir's comment 5 that we are seeing this outside of Canonistack how can we make the state server more reliable in handling this situation, and/or make it more visible to the Juju Admin, and/or possible action(s) to try and resolve?

-thanks,
Antonio

Curtis Hovey (sinzui)
no longer affects: juju-core/1.24
Revision history for this message
Ian Booth (wallyworld) wrote :

@Antonio

Agree 100% with need to handle this situation better. My motivation in the comment was to provide justification for moving this bug off 1.24. Solving the problem is likely to involve non-trivial engineering effort so would implementation during a development iteration, rather than during pre release stabilisation.

If we can reproduce this issue, I'd also like to gather information of the various deployment scenarios and also log files so we can diagnose the root cause etc.

Revision history for this message
amir sanjar (asanjar) wrote :

@ ian,
I just had another customer reporting a similar issue. Sadly this time the issue was reported by a customer interested in signing CPP. Forwarding the email to you.

Revision history for this message
amir sanjar (asanjar) wrote :

I was able to reproduce an issue similar to what has been reported. I hope attached log files help

Revision history for this message
amir sanjar (asanjar) wrote :

yet again we had to manually add relations to avoid a potentially embarrassing moment during a big data demo by Sam.

Curtis Hovey (sinzui)
affects: juju-core → juju-deployer
Changed in juju-deployer:
milestone: 1.25.0 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.