k8s: unable to fetch OCI resources - empty id is not valid

Bug #1999060 reported by James Page
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Heather Lanigan
3.1
Fix Released
High
Heather Lanigan

Bug Description

juju: 3.0/stable
substrate: microk8s

On occasion (so not 100% of the time) deployed charmed operators get stuck in waiting/allocating state with the following error message in the debug log:

controller-0: 13:54:18 INFO juju.worker.caasapplicationprovisioner.runner stopped "neutron-mysql", err: getting OCI image resources: unable to fetch OCI image resources for neutron-mysql: empty id not valid
controller-0: 13:54:18 ERROR juju.worker.caasapplicationprovisioner.runner exited "neutron-mysql": getting OCI image resources: unable to fetch OCI image resources for neutron-mysql: empty id not valid
controller-0: 13:54:18 INFO juju.worker.caasapplicationprovisioner.runner restarting "neutron-mysql" in 3s
controller-0: 13:54:18 INFO juju.worker.caasapplicationprovisioner.runner start "glance-mysql"
controller-0: 13:54:21 INFO juju.worker.caasapplicationprovisioner.runner start "neutron-mysql"
controller-0: 13:54:21 INFO juju.worker.caasapplicationprovisioner.runner stopped "traefik-internal", err: getting OCI image resources: unable to fetch OCI image resources for traefik-internal: empty id not valid
controller-0: 13:54:21 ERROR juju.worker.caasapplicationprovisioner.runner exited "traefik-internal": getting OCI image resources: unable to fetch OCI image resources for traefik-internal: empty id not valid
controller-0: 13:54:21 INFO juju.worker.caasapplicationprovisioner.runner restarting "traefik-internal" in 3s
controller-0: 13:54:23 INFO juju.worker.caasapplicationprovisioner.runner stopped "glance-mysql", err: getting OCI image resources: unable to fetch OCI image resources for glance-mysql: empty id not valid
controller-0: 13:54:23 ERROR juju.worker.caasapplicationprovisioner.runner exited "glance-mysql": getting OCI image resources: unable to fetch OCI image resources for glance-mysql: empty id not valid
controller-0: 13:54:23 INFO juju.worker.caasapplicationprovisioner.runner restarting "glance-mysql" in 3s
controller-0: 13:54:24 INFO juju.worker.caasapplicationprovisioner.runner start "traefik-internal"
controller-0: 13:54:25 INFO juju.worker.caasapplicationprovisioner.runner stopped "neutron-mysql", err: getting OCI image resources: unable to fetch OCI image resources for neutron-mysql: empty id not valid
controller-0: 13:54:25 ERROR juju.worker.caasapplicationprovisioner.runner exited "neutron-mysql": getting OCI image resources: unable to fetch OCI image resources for neutron-mysql: empty id not valid
controller-0: 13:54:25 INFO juju.worker.caasapplicationprovisioner.runner restarting "neutron-mysql" in 3s
controller-0: 13:54:26 INFO juju.worker.caasapplicationprovisioner.runner start "glance-mysql"

Oddly other applications deployed in the same model using exactly the same charm tracks and versions (mysql and traefik in this case) work fine.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

What are the steps to reproduce this issue? Will it reproduce with microk8s as a substrate?

It could be related to async charm download, though it would have resolved itself in that case. Or something else. Reproducing the issue will help track down the exact cause.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote (last edit ):

Hi Heather

microk8s as a substrate is fine for reproduction.

I've attached the terraform configuration I'm using for deployment - this is where I've seen this issue most often.

I have left one running over the weekend but the issue never resolves.

Revision history for this message
James Page (james-page) wrote :

I think this is something related to high concurrency of operations.

All of the charms that went into this state had multiple deployed instances in the same model (mysql-k8s, traefik-k8s).

When I push the concurrency that terraform uses down to 1 (rather than 10) concurrent operations I did not experience the same issue.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

I was able to reproduce. There is a timing window with bundles and the terraform provider where deploying the same charm with different names will cause not all of the applications to get the charm ID in the charm origin - preventing resource download. This would also prevent application refresh.

Reproduced with lxd and the following bundle:

applications:
  juju-qa-test:
    charm: juju-qa-test
    num_units: 3
  juju-qa-test3:
    charm: juju-qa-test
    num_units: 2

juju:PRIMARY> db.applications.find({},{"charm-origin.id":1}).pretty()
{
 "_id" : "864917d8-73f3-4ec7-8a38-90db7f68a348:juju-qa-test",
 "charm-origin" : {
  "id" : "Hw30RWzpUBnJLGtO71SX8VDWvd3WrjaJ"
 }
}
{
 "_id" : "864917d8-73f3-4ec7-8a38-90db7f68a348:juju-qa-test3",
 "charm-origin" : {
  "id" : ""
 }
}

A bug in the async charm download code.

Revision history for this message
Heather Lanigan (hmlanigan) wrote (last edit ):

correction - The bug is in the bundle deploy code, not the async charm download.

When deploying a charm individually, AddCharm is called multiple times, as well hit the already downloaded piece of the async charm download. Thus this issue doesn't reproduce outside of a bundle (and the terraform provider).

When deploying a bundle, the changes are reviewed, a common charm found with different application names and we only AddCharm once. Previously when synchronous charm download, the charm's id could be put into the all of the application's charm-orgin because the ID was known before the applications were added to juju by the bundle.

Changed in juju:
milestone: none → 3.0.3
assignee: nobody → Heather Lanigan (hmlanigan)
status: Triaged → In Progress
Revision history for this message
Heather Lanigan (hmlanigan) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.