Transitive failure in units during upgrade/refresh

Bug #2053242 reported by Enrico Deusebio
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Medium
Heather Lanigan

Bug Description

I'm currently testing an upgrade process on a deployment with a machine charm (zookeeper on rev114 to latest), running on Juju 3.1.6 (but i have tested also on 3.1.7).

First, I deploy rev114 with

```
juju deploy zookeeper --channel 3/edge --revision 114 -n 3
```

When I refresh the charm, with `juju refresh zookeeper`, the units fails at first. This is shown both in the juju status output and in the debug log where there is a message as the following:

```
juju.worker.dependency "uniter" manifold worker returned unexpected error: preparing operation "upgrade to ch:amd64/jammy/zookeeper-121" for zookeeper/0: failed to download charm "ch:amd64/jammy/zookeeper-121" from API server: Get https://10.178.146.204:17070/model/de79be2c-6cc3-4401-8d85-2a27c9c80c7e/charms?file=%2A&url=ch%3Aamd64%2Fjammy%2Fzookeeper-121: cannot retrieve charm: ch:amd64/jammy/zookeeper-121
```

After some times (one or two minutes), the failure self-heal and resolves. However, the messages can be misleading especially during upgrades processes that are very delicate.

Although this was observed on zookeeper charm, this has proven to apply not only to this charm, but to be more general and apply to all others.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

This issue can happen with deploy as well, and has been seen in large models

Due to the async charm download feature introduced in juju 3.0, it's possible that the uniter tries to get the charm from the controller before it's completed download. At which point it errors and retries later.

Instead, the controller can return a pending download to try again later error so the uniter knows the charm is queued to be downloaded, or in process.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 3.3.3
Revision history for this message
Alex Lutay (taurus) wrote :

The failure agent status is also affecting all SQL charms on auto-tests for Juju 3 (while works well on 2.9). Example: https://github.com/canonical/pgbouncer-operator/pull/153/files#r1491728006

I confirm the issue is cosmetic, and self-healed quickly automatically, but it has high visibility and affects automated tests.

P.S. we noticed it also scares new charm users a lot.
The `agent=failed` output on `juju status` for all units after `juju refresh` is really scary the first time.

Tnx for the possible quick fix here!

tags: added: canonical-data-platform-eng
Harry Pidcock (hpidcock)
summary: - Transiently failure in units during upgrade/refresh
+ Transitive failure in units during upgrade/refresh
Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
Ian Booth (wallyworld)
Changed in juju:
milestone: 3.3.3 → 3.3.4
Changed in juju:
milestone: 3.3.4 → 3.3.5
Changed in juju:
milestone: 3.3.5 → 3.3.6
Changed in juju:
assignee: Heather Lanigan (hmlanigan) → Caner Derici (cderici)
milestone: 3.3.6 → 3.4.4
Caner Derici (cderici)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Caner Derici (cderici) wrote :

The error I observe (with Juju 3.4) is slightly different than what's reported when I reproduce this.

```
unit-zookeeper-1: 00:28:40 ERROR juju.worker.uniter resolver loop error: preparing operation "upgrade to ch:amd64/jammy/zookeeper-134" for zookeeper/1: failed to download charm "ch:amd64/jammy/zookeeper-134" from API server: download request with archiveSha256 length 0 not valid
```

And the fix for it is up https://github.com/juju/juju/pull/17504

With that change, I no longer observe the error above, nor the one that's reported initially.

Revision history for this message
Caner Derici (cderici) wrote :

Transferring to Heather as instructed, for a more holistic approach to straighten up the async charm downloads.

Changed in juju:
assignee: Caner Derici (cderici) → Heather Lanigan (hmlanigan)
Changed in juju:
milestone: 3.4.4 → 3.4.5
Changed in juju:
milestone: 3.4.5 → 3.4.6
Changed in juju:
importance: High → Medium
milestone: 3.4.6 → none
status: In Progress → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.