Mojo: Continuous Delivery for Juju

juju-check-wait waits forever if juju doesn't reach stable state

Bug #1694745 reported by Daniel Manrique on 2017-05-31

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Mojo: Continuous Delivery for Juju	Triaged	Medium	Unassigned

Bug Description

Several times we've observed that, if a charm gets "stuck" in e.g. allocating/waiting for machine or other intermediate state, a spec using juju-check-wait will wait forever (we've seen times of 6-10 hours before manually killing things). This despite the fact that the juju-check-wait phase should have a default timeout of 30 minutes.

Example juju output of a "stuck" run:

https://pastebin.canonical.com/189657/ (Apologies for the private link).

Stuck applications are (these are subordinates which ran into trouble, but we've also seen this with an ordinary charm which was stuck waiting to contact an ntp server or something similar):

telegraf/2 waiting allocating 10.25.61.190 waiting for machine
telegraf/5 waiting allocating 10.25.61.192 waiting for machine

The spec in question does something like:

deploy config=go-telegraf/services target=go-telegraf delay=0
juju-check-wait
script config=go-telegraf/add-relations
juju-check-wait

The run log for the spec shows (some possibly sensitive information obfuscated)

2017-05-31 08:35:42 [INFO] deployer.import: Deploying applications...
2017-05-31 08:35:43 [INFO] deployer.import: Deploying application telegraf using /some/charm/telegraf
2017-05-31 08:35:51 [DEBUG] deployer.import: Adding units...
2017-05-31 08:35:51 [WARNING] deployer.import: Config specifies num units for subordinate: telegraf
2017-05-31 08:35:51 [DEBUG] deployer.import: Waiting for units before adding relations
2017-05-31 08:35:51 [DEBUG] deployer.env: Connecting to my-juju-controller...
2017-05-31 08:35:52 [DEBUG] deployer.env: Connected.
2017-05-31 08:35:52 [INFO] deployer.import: Adding relations...
2017-05-31 08:35:52 [INFO] deployer.cli: Deployment complete in 10.52 seconds
2017-05-31 08:35:52 [INFO] Checking Juju status (timeout=1800)
2017-05-31 08:35:56 [INFO] Running script go-telegraf/add-relations
2017-05-31 08:36:01 [INFO] Adding relation for telegraf with the following services:
my-app-lb
my-cache-lb
my-rabbitmq
my-dp-fe
my-app
my-memcached
my-prometheus
my-cache

2017-05-31 08:36:01 [INFO] Checking Juju status
2017-05-31 08:36:05 [INFO] Waiting for environment to reach steady state

The spec has been in the above state for over 6 hours.

This has also been observed with a script that does, at the end, "mojo juju-check-wait" - so it would appear to be some trouble in the way mojo handles the juju-check-wait phase.

Revision history for this message

Daniel Manrique (roadmr) wrote on 2017-05-31:

Just observed this with a juju2 environment on which agents were in a "lost" state.

Running mojo juju-check-wait at 2017-05-31 09:46:06.011687
mojo juju-check-wait completed successfully at 2017-05-31 18:53:02.976279

The check completed after I went in and kicked all the agents, but the wait time was over 9 hours and I guarantee we have no timeout=36000 anywhere in our spec :)

Junien F (axino) on 2018-03-02

Changed in mojo:
status:	New → Triaged
importance:	Undecided → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.