juju unit agent starts trying to install before the charm is downloaded

Bug #2041263 reported by John A Meinel
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Medium
Unassigned

Bug Description

I tried doing a `juju deploy --to 0 postgresql` and it seems that the Machine agent notices and starts a new Unit agent before the charm has even finished downloading such that we know what content it is going to be running.

Specifically:
```
2023-10-26 15:18:13 INFO juju unit_agent.go:289 Starting unit workers for "postgresql/0"
2023-10-26 15:18:13 INFO juju.worker.apicaller connect.go:163 [98b178] "unit-postgresql-0" successfully connected to "10.10.30.109:17070"
2023-10-26 15:18:13 INFO juju.worker.apicaller connect.go:260 [98b178] password changed for "unit-postgresql-0"
2023-10-26 15:18:13 INFO juju.worker.apicaller connect.go:163 [98b178] "unit-postgresql-0" successfully connected to "10.10.30.109:17070"
2023-10-26 15:18:13 INFO juju.worker.upgrader upgrader.go:141 no waiter, upgrader is done
2023-10-26 15:18:13 INFO juju.worker.migrationminion worker.go:142 migration phase is now: NONE
2023-10-26 15:18:13 INFO juju.worker.logger logger.go:120 logger worker started

2023-10-26 15:18:13 ERROR juju.worker.meterstatus runner.go:93 error running "meter-status-changed": charm missing from disk
2023-10-26 15:18:13 INFO juju.worker.uniter uniter.go:363 unit "postgresql/0" started
2023-10-26 15:18:13 INFO juju.worker.uniter uniter.go:689 resuming charm install
2023-10-26 15:18:13 INFO juju.worker.uniter.charm bundles.go:82 downloading ch:amd64/jammy/postgresql-336 from API server
2023-10-26 15:18:13 INFO juju.worker.uniter uniter.go:347 unit "postgresql/0" shutting down: preparing operation "install ch:amd64/jammy/postgresql-336" for postgresql/0: failed to download charm "ch:amd64/jammy/postgresql-336" from API server: download request with archiveSha256 length 0 not valid
2023-10-26 15:18:13 ERROR juju.worker.dependency engine.go:695 "uniter" manifold worker returned unexpected error: preparing operation "install ch:amd64/jammy/postgresql-336" for postgresql/0: failed to download charm "ch:amd64/jammy/postgresql-336" from API server: download request with archiveSha256 length 0 not valid
2023-10-26 15:18:17 INFO juju.worker.uniter uniter.go:363 unit "postgresql/0" started
2023-10-26 15:18:17 INFO juju.worker.uniter uniter.go:689 resuming charm install
2023-10-26 15:18:17 INFO juju.worker.uniter.charm bundles.go:82 downloading ch:amd64/jammy/postgresql-336 from API server
```

I think we don't typically notice this because usually the machine takes long enough to get started that the download is complete by the time the agent is looking to get the charm. But it does seem like we're basing 'do I have a unit that I should start', on the wrong bit of data if we are trying to do so before the controller actually has that information, and we are depending on the ERROR and restart of the Unit agent worker to bounce long enough until the charm has finished downloading.

Note that this isn't high priority, because we do eventually bounce until the controller has finished downloading, and then we progress normally.

Similarly, though, 'meter-status-changed' shouldn't be running before we have unpacked the charm on disk for it to run anything.

Revision history for this message
Marcelo Henrique Neppel (neppel) wrote (last edit ):

This also happens when trying to refresh a charm in a LXD cloud, like the data platform charms. In that situation, we see the unit in a failed state for some seconds, before the upgrade proceeds.

Following, there are the logs of another call to juju refresh on LXD, with the easyrsa charm, which also gets into a failed state for some seconds before being upgraded.

unit-easyrsa-0: 18:09:52 ERROR juju.worker.uniter resolver loop error: preparing operation "upgrade to ch:amd64/jammy/easyrsa-48" for easyrsa/0: failed to download charm "ch:amd64/jammy/easyrsa-48" from API server: Get https://10.28.44.219:17070/model/db3cad23-9bc3-4762-8939-9e774e0451b1/charms?file=%2A&url=ch%3Aamd64%2Fjammy%2Feasyrsa-48: cannot retrieve charm: ch:amd64/jammy/easyrsa-48
unit-easyrsa-0: 18:09:52 INFO juju.worker.uniter unit "easyrsa/0" shutting down: preparing operation "upgrade to ch:amd64/jammy/easyrsa-48" for easyrsa/0: failed to download charm "ch:amd64/jammy/easyrsa-48" from API server: Get https://10.28.44.219:17070/model/db3cad23-9bc3-4762-8939-9e774e0451b1/charms?file=%2A&url=ch%3Aamd64%2Fjammy%2Feasyrsa-48: cannot retrieve charm: ch:amd64/jammy/easyrsa-48
unit-easyrsa-0: 18:09:52 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: preparing operation "upgrade to ch:amd64/jammy/easyrsa-48" for easyrsa/0: failed to download charm "ch:amd64/jammy/easyrsa-48" from API server: Get https://10.28.44.219:17070/model/db3cad23-9bc3-4762-8939-9e774e0451b1/charms?file=%2A&url=ch%3Aamd64%2Fjammy%2Feasyrsa-48: cannot retrieve charm: ch:amd64/jammy/easyrsa-48

It's important to not have this failed status, as it may lead to users think that something is broken in the upgrade process.

Revision history for this message
Carl Csaposs (carlcsaposs) wrote :

Somewhat related, we also see these errors on every deployment:
```
machine-0: 14:39:04 ERROR juju.worker.dependency "kvm-container-provisioner" manifold worker returned unexpected error: container types
machine-0: 14:39:04 ERROR juju.worker.dependency "broker-tracker" manifold worker returned unexpected error: no container types determined
```

example: https://github.com/canonical/mysql-operator/actions/runs/6891527706/job/18748446787#step:13:231

tags: added: canonical-data-platform-eng
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.