Canonical Juju

Add a wait command

Bug #1488777 reported by Stuart Bishop on 2015-08-26

This bug affects 10 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Expired	Wishlist	Unassigned

Bug Description

Charm developers writing tests need to wait for a freshly deployed environment to settle before running tests.

Operations need to wait for an environment to settle after making changes to perform smoke tests

For the purposes of these use cases, the way to defined settled is 'are there any hooks running, or are there any hooks scheduled to run'. At this particular moment in time, the environment is in a stable state. It will remain in the stable state until further juju operations are performed (by the operator, or scheduled).

The proposed command name is 'juju wait'.

With Juju 1.24, it may be possible to simply inspect the unit and agent statuses. It depends on if this is racy, and if the races can be worked around to make it reliable.

If the statuses are unsuitable, a plugin adding this command can be found at https://launchpad.net/juju-wait. As far as I'm aware, it is the only implementation of the only algorithm that works for older versions of Juju.

Please add a 'juju wait' command with these semantics to juju-core, so developers and operations no longer need to install a plugin from a ppa: to write reliable integration tests and deployment scripts.

Tags:

Curtis Hovey (sinzui) on 2015-08-26

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → Low
tags:	added: charmers feature

Revision history for this message

Merlijn Sebrechts (merlijn-sebrechts) wrote on 2015-12-08:

Is it possible this is already present internally? What does the bundle deployer use?

Aaron Bentley (abentley) on 2016-04-01

tags:

added: jujuqa

Stuart Bishop (stub) on 2016-04-02

tags:

added: canonical-is

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2016-04-02:

Download full text (3.3 KiB)

Opinionated assertions follow. :-)

I think this is important in at least two arenas:

- Production deployment automation where post-deployment activities need to take place; and

- Test automation, where the deployed model needs to be inspected post-deployment.

The impact of not having some sort of approach in place is that testing begins too early, introducing race conditions where a test may pass when the weather is good, but may start to fail when the substrate is under load, or the internet is slower, or any other variable impacts the timing of things.

We too (OpenStack Engineering) have struggled over time, and believe we have conquered the art of systematically detecting when a Juju deployment is actually "done." We did this by implementing extended status messaging into the charms, allowing the charm declare itself "ready." When all units in a model do so, we proceed. We've found that all other approaches leave varying degrees of raciness.

We've found that in cases where not all charms possess intentional extended status advertising, the juju-wait plugin almost always reliably waits the necessary amount of time for the deployment to complete. Even with that, we've found a couple of gaps and currently have a juju-wait fork and corresponding merge proposal to address those.

Unfortunately at this time, there is no global uniform extended status message to watch across all charms.
We've implemented a predictable extended status message in the OpenStack charms, but that message is perhaps different than other charms. This means that currently, one approach may not translate well to another type of workload.

Some observations, opinions, mixed with some facts:

`juju deploy foo` exits 0 nearly immediately, so that cannot be used to block/wait.

juju-deployer exits 0, many minutes (2 to 35min in practice) before the OpenStack deployments are actually done.

Amulet's wait logic has grown to be closer to perfect in waiting for things to wrap up, but we still observe test races when using that alone, especially when subordinates are involved.

Amulet's "wait for extended status" logic is solid, if the charm is written to declare itself ready.

For legacy charms which do not / will not have extended status, juju-wait is nearly perfect in predicting readiness.

...

To sum up:

1. juju-wait is the closest thing to a global, generically usable way to block/wait for deployment readiness that I have seen and used.

2. Extended status is the only guaranteed way to block/wait for a deployment to complete, presuming that the charm author's logic checks that the required relations are met, deployed services are up, processes are running and any expected network sockets are bound, listening and responsive -- before declaring itself ready.

3. Since a typical OpenStack deployment includes two charms which do not have extended status (mysql and mongodb), we do both [1] and [2], and that successfully avoids the "Am-I-really-ready?" races we used to battle. Now we are free to chase more meaningful races in the deployed workloads. ;-)

4. Sleep not. If you find yourself about to add time.sleep() anywhere in anything outside of a retry loop,...

Opinionated assertions follow.  :-)

I think this is important in at least two arenas:

- Production deployment automation where post-deployment activities need to take place; and

- Test automation, where the deployed model needs to be inspected post-deployment.

We too (OpenStack Engineering) have struggled over time, and believe we have conquered the art of systematically detecting when a Juju deployment is actually "done."  We did this by implementing extended status messaging into the charms, allowing the charm declare itself "ready."  When all units in a model do so, we proceed.  We've found that all other approaches leave varying degrees of raciness.

We've found that in cases where not all charms possess intentional extended status advertising, the juju-wait plugin almost always reliably waits the necessary amount of time for the deployment to complete.  Even with that, we've found a couple of gaps and currently have a juju-wait fork and corresponding merge proposal to address those.

Unfortunately at this time, there is no global uniform extended status message to watch across all charms.
We've implemented a predictable extended status message in the OpenStack charms, but that message is perhaps different than other charms.  This means that currently, one approach may not translate well to another type of workload.

Some observations, opinions, mixed with some facts:

`juju deploy foo` exits 0 nearly immediately, so that cannot be used to block/wait.

juju-deployer exits 0, many minutes (2 to 35min in practice) before the OpenStack deployments are actually done.

Amulet's wait logic has grown to be closer to perfect in waiting for things to wrap up, but we still observe test races when using that alone, especially when subordinates are involved.

Amulet's "wait for extended status" logic is solid, if the charm is written to declare itself ready.

For legacy charms which do not / will not have extended status, juju-wait is nearly perfect in predicting readiness.

...

To sum up:

1.  juju-wait is the closest thing to a global, generically usable way to block/wait for deployment readiness that I have seen and used.

2.  Extended status is the only guaranteed way to block/wait for a deployment to complete, presuming that the charm author's logic checks that the required relations are met, deployed services are up, processes are running and any expected network sockets are bound, listening and responsive -- before declaring itself ready.

3.  Since a typical OpenStack deployment includes two charms which do not have extended status (mysql and mongodb), we do both [1] and [2], and that successfully avoids the "Am-I-really-ready?" races we used to battle.  Now we are free to chase more meaningful races in the deployed workloads.  ;-)

4.  Sleep not.  If you find yourself about to add time.sleep() anywhere in anything outside of a retry loop, you probably shouldn't.  It will eventually race.

Note:  this is all in the context of Juju 1.x and related tooling as of this date.  We're still evaluating how it changes, if at all, in the new and exciting Juju 2.x world.

tags:

added: uosci

Anastasia (anastasia-macmood) on 2016-08-03

Changed in juju-core:
importance:	Low → Wishlist

Revision history for this message

Paul Gear (paulgear) wrote on 2016-08-19:

Anastasia asked me to comment on this from the Canonical IS perspective, but I don't think I could do so any better than Ryan has already done.

We need a reliable method of determining when it is 100% safe to run post-deployment steps. This is all but essential for any CI infrastructure depending on juju, and I think it should be considered much higher priority than wishlist - probably at least a medium.

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:

juju-core → juju

Adam Stokes (adam-stokes) on 2017-02-10

tags:

added: conjure

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 5 years, so we're marking it Expired. If you believe this is incorrect, please update the status.

Changed in juju:
status:	Triaged → Expired
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.