Transitive failure in units during upgrade/refresh

Bug #2053242 reported by Enrico Deusebio
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Heather Lanigan

Bug Description

I'm currently testing an upgrade process on a deployment with a machine charm (zookeeper on rev114 to latest), running on Juju 3.1.6 (but i have tested also on 3.1.7).

First, I deploy rev114 with

```
juju deploy zookeeper --channel 3/edge --revision 114 -n 3
```

When I refresh the charm, with `juju refresh zookeeper`, the units fails at first. This is shown both in the juju status output and in the debug log where there is a message as the following:

```
juju.worker.dependency "uniter" manifold worker returned unexpected error: preparing operation "upgrade to ch:amd64/jammy/zookeeper-121" for zookeeper/0: failed to download charm "ch:amd64/jammy/zookeeper-121" from API server: Get https://10.178.146.204:17070/model/de79be2c-6cc3-4401-8d85-2a27c9c80c7e/charms?file=%2A&url=ch%3Aamd64%2Fjammy%2Fzookeeper-121: cannot retrieve charm: ch:amd64/jammy/zookeeper-121
```

After some times (one or two minutes), the failure self-heal and resolves. However, the messages can be misleading especially during upgrades processes that are very delicate.

Although this was observed on zookeeper charm, this has proven to apply not only to this charm, but to be more general and apply to all others.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

This issue can happen with deploy as well, and has been seen in large models

Due to the async charm download feature introduced in juju 3.0, it's possible that the uniter tries to get the charm from the controller before it's completed download. At which point it errors and retries later.

Instead, the controller can return a pending download to try again later error so the uniter knows the charm is queued to be downloaded, or in process.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 3.3.3
Revision history for this message
Alex Lutay (taurus) wrote :

The failure agent status is also affecting all SQL charms on auto-tests for Juju 3 (while works well on 2.9). Example: https://github.com/canonical/pgbouncer-operator/pull/153/files#r1491728006

I confirm the issue is cosmetic, and self-healed quickly automatically, but it has high visibility and affects automated tests.

P.S. we noticed it also scares new charm users a lot.
The `agent=failed` output on `juju status` for all units after `juju refresh` is really scary the first time.

Tnx for the possible quick fix here!

tags: added: canonical-data-platform-eng
Harry Pidcock (hpidcock)
summary: - Transiently failure in units during upgrade/refresh
+ Transitive failure in units during upgrade/refresh
Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
Ian Booth (wallyworld)
Changed in juju:
milestone: 3.3.3 → 3.3.4
Changed in juju:
milestone: 3.3.4 → 3.3.5
Changed in juju:
milestone: 3.3.5 → 3.3.6
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.