Deployments fail when juju implicitly upgrade after bootstrap

Bug #1455260 reported by Curtis Hovey
52
This bug affects 9 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Ian Booth
juju-quickstart
New
Undecided
Unassigned
python-jujuclient
Confirmed
Undecided
Unassigned

Bug Description

This is a meta bug that describes a problem with many symptoms and many unadvisable workarounds.

Enterprises commonly automate the deployment of services. They can use juju-quickstart, juju-deployer, landscape autopilot, or their own bespoke script to bootstrap an environment and deploy services. It works repeatedly, reliably for many weeks or months, until a new micro version of juju is placed in the streams. The enterprise sees failures in many ways, commonly the deployment fails because the script lost connection to its watcher, or the juju failed to upgrade at the same moment that charms are installing and configuring services.

When a juju state-server is bootstrapped one its first actions, is to query streams, and start an upgrade to the highest micro version for its major.minor version. eg, the juju client installed 1.22.1, and the current version in streams is 1.22.3, start upgrading. This upgrade will complete in less than a minute. A savvy script bootstrapping an env would wait for an upgrade to complete because upgrading a single state-server is faster and more reliable that upgrading services too.

Enterprises do not like default behaviour however, and their tools were not written to account for this "surprising" behaviour. There are several strategies employed to ensure the state-server is exactly the version that was tested previously:

A. Juju CI and a few others set "agent-version: 1.22.1" to ensure the state-server matches the version under test. But many parties do not like this method because environments.yaml must change each time they upgrade to a new juju client (and server).

B. Canonical IS and many customers use --upload-tools to force the state-server to be a known version. This however makes explicit upgrades VERY unpredictable because the juju-client selects agents based on the localhost's arch, series, and $PATH (which might include development jujus). There are several bugs about failed upgrades, and --upload-tools was a factor.

C. The company never uses current juju. They choose to use a version that is not getting updates, like 1.22.x which is not 1.23.x, except that we have delivered updates to their surprise. They are also not getting the benefits of current juju.

Juju chooses to implicitly upgrade because it is a way to deliver compatibility fixes. Azure, AWS, and HP have changed their clouds, and we delivered a new micro version of juju a few days later to ensure juju "Just worked".

Several changes are needed to ensure enterprises have a reliable and repeatable experience:

1. All API clients must reconnect watchers when they are disconnected. They will be disconnected because of network issues as well as explicit disconnects durin upgrade-juju. The clients need to resume their work.

2.A If juju must upgrade, it needs to prevent clients from starting work until the upgrade is complete, this might mean bootstrap doesn't let go until the state-server is upgraded.

2.B Or juju stops implicit upgrades. No enterprises uses this feature; they work hard to disable it. Juju could instead inform the party that an upgrade is available (as is done for charms).

David Britton (dpb)
no longer affects: cloud-installer
Revision history for this message
Curtis Hovey (sinzui) wrote :

These issues involved --upload-tools and upgrade-juju:
bug 1392810, bug 1307643, bug 1325040,

Curtis Hovey (sinzui)
description: updated
description: updated
Changed in juju-deployer:
status: New → Confirmed
David Britton (dpb)
no longer affects: juju-deployer
Changed in python-jujuclient:
status: New → Confirmed
Revision history for this message
Ian Booth (wallyworld) wrote :

This bug is really 2 parts:
1. auto upgrade
2. enable api to allow deployer etc to connect even if upgrades are pending

A fix is done for part 2
https://github.com/juju/juju/pull/2441

When the machine agent starts, if there are upgrades pending, it will not enable the API to allow clients to connect until the upgrade is done.

Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

Opened bug 1459912 for the above fix.
That bug will be marked as Fix Committed and this one will be left open to track the auto upgrade fix.

no longer affects: juju-core/1.24
Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
David Britton (dpb)
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.