juju-check-wait waits forever if juju doesn't reach stable state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mojo: Continuous Delivery for Juju |
Triaged
|
Medium
|
Unassigned |
Bug Description
Several times we've observed that, if a charm gets "stuck" in e.g. allocating/waiting for machine or other intermediate state, a spec using juju-check-wait will wait forever (we've seen times of 6-10 hours before manually killing things). This despite the fact that the juju-check-wait phase should have a default timeout of 30 minutes.
Example juju output of a "stuck" run:
https:/
Stuck applications are (these are subordinates which ran into trouble, but we've also seen this with an ordinary charm which was stuck waiting to contact an ntp server or something similar):
telegraf/2 waiting allocating 10.25.61.190 waiting for machine
telegraf/5 waiting allocating 10.25.61.192 waiting for machine
The spec in question does something like:
deploy config=
juju-check-wait
script config=
juju-check-wait
The run log for the spec shows (some possibly sensitive information obfuscated)
2017-05-31 08:35:42 [INFO] deployer.import: Deploying applications...
2017-05-31 08:35:43 [INFO] deployer.import: Deploying application telegraf using /some/charm/
2017-05-31 08:35:51 [DEBUG] deployer.import: Adding units...
2017-05-31 08:35:51 [WARNING] deployer.import: Config specifies num units for subordinate: telegraf
2017-05-31 08:35:51 [DEBUG] deployer.import: Waiting for units before adding relations
2017-05-31 08:35:51 [DEBUG] deployer.env: Connecting to my-juju-
2017-05-31 08:35:52 [DEBUG] deployer.env: Connected.
2017-05-31 08:35:52 [INFO] deployer.import: Adding relations...
2017-05-31 08:35:52 [INFO] deployer.cli: Deployment complete in 10.52 seconds
2017-05-31 08:35:52 [INFO] Checking Juju status (timeout=1800)
2017-05-31 08:35:56 [INFO] Running script go-telegraf/
2017-05-31 08:36:01 [INFO] Adding relation for telegraf with the following services:
my-app-lb
my-cache-lb
my-rabbitmq
my-dp-fe
my-app
my-memcached
my-prometheus
my-cache
2017-05-31 08:36:01 [INFO] Checking Juju status
2017-05-31 08:36:05 [INFO] Waiting for environment to reach steady state
The spec has been in the above state for over 6 hours.
This has also been observed with a script that does, at the end, "mojo juju-check-wait" - so it would appear to be some trouble in the way mojo handles the juju-check-wait phase.
Changed in mojo: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Just observed this with a juju2 environment on which agents were in a "lost" state.
Running mojo juju-check-wait at 2017-05-31 09:46:06.011687
mojo juju-check-wait completed successfully at 2017-05-31 18:53:02.976279
The check completed after I went in and kicked all the agents, but the wait time was over 9 hours and I guarantee we have no timeout=36000 anywhere in our spec :)