Podspec charms stuck in waiting for container when deployed with juju 3.4

Bug #2060943 reported by Daniela Plascencia
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Committed
High
Ian Booth
3.3
Fix Committed
High
Ian Booth

Bug Description

When deployed in an environment with juju 3.4.1, podspec charms get stuck in waiting state with message "waiting for container".

Steps to reproduce:
1. Deploy a podspec charm from Charmhub (kubeflow-volumes 1.8/stable, minio ckf-1.8/stable, mlmd 1.14/stable)
2. Get juju status and observe units go into the described status

Environment:
* microk8s 1.25-strict/stable
* juju 3.4.1 (from 3.4/stable channel)
* microk8s addons dns hostpath-storage rbac metallb:10.64.140.43-10.64.140.49

Observations:

* The issue is not present when the mentioned charms are built and deployed in the same environment. For instance, when building kubeflow-volumes from branch gh:track/1.8 and then deployed with juju deploy ./path-to.charm --resource oci-image=docker.io/kubeflownotebookswg/volumes-web-app:v1.8.0, the issue is not reproducible any more.

* The issue is intermittent, but I have been able to reproduce it deploying the aforementioned charms, some of our CIs are also reporting this behaviour:
1. https://github.com/canonical/envoy-operator/actions/runs/8613060321/job/23603612172#step:4:731 and https://github.com/canonical/envoy-operator/actions/runs/8613060321/job/23603612172#step:6:16
2. https://github.com/canonical/argo-operators/actions/runs/8613058980/job/23603607202#step:4:721 and https://github.com/canonical/argo-operators/actions/runs/8613058980/job/23603607202#step:4:739

* This behaviour only applies to podspec charms, from all our charms (30+) only the podspec ones seem to be failing.

Tags: k8s
Revision history for this message
Daniela Plascencia (dnplas) wrote :
Revision history for this message
Daniela Plascencia (dnplas) wrote :
Revision history for this message
Daniela Plascencia (dnplas) wrote :
Revision history for this message
Daniela Plascencia (dnplas) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

Thanks for the bug report.

It is weird that there's no failure when deploying locally built charms.
Have you downloaded the charmhub charm (eg using juju download) and compared it with the locally built one?
One possible additional thought is that the locally built one pulls from docker.io/kubeflownotebookswg/volumes-web-app:v1.8.0 whereas I suspect the charmbhub one would be pulling the oci image from the charmhub registry. Maybe this is unreliable and occasionally fails.

We'll look at the logs when we can, but if you had some info on the above questions that would help.

Revision history for this message
Daniela Plascencia (dnplas) wrote :
Download full text (4.9 KiB)

Thanks for the follow up!

I did some more investigation:

# ---- Comparing the charms

I have downloaded the charm from CH and compared it to a locally built one, but it seems like there is not a significant difference. I just see some packages are updated (setuptools, pip and wheel). Other than that they seem pretty much equal.

# ---- juju debug-logs

I have noticed though that I am getting this message every time I deploy from CH:

application-kubeflow-volumes: 12:50:37 WARNING juju.worker.caasoperator operator "kubeflow-volumes" is shutting down, err: failed to initialize caasoperator for "kubeflow-volumes": failed to
 download charm "ch:amd64/focal/kubeflow-volumes-260" from API server: Get https://10.152.183.207:17070/model/3477bc3a-fc6a-4714-82fa-ec3383e85505/charms?file=%2A&url=ch%3Aamd64%2Ffocal%2Fku
beflow-volumes-260: cannot retrieve charm: ch:amd64/focal/kubeflow-volumes-260
application-kubeflow-volumes: 12:50:37 INFO juju.worker.caasoperator.runner runner is dying
application-kubeflow-volumes: 12:50:37 ERROR juju.worker.dependency "operator" manifold worker returned unexpected error: failed to initialize caasoperator for "kubeflow-volumes": failed to
download charm "ch:amd64/focal/kubeflow-volumes-260" from API server: Get https://10.152.183.207:17070/model/3477bc3a-fc6a-4714-82fa-ec3383e85505/charms?file=%2A&url=ch%3Aamd64%2Ffocal%2Fkub
eflow-volumes-260: cannot retrieve charm: ch:amd64/focal/kubeflow-volumes-260

I'm not sure if that is an actual issue as after that, I get the following lines:

application-kubeflow-volumes: 12:50:40 INFO juju.worker.caasoperator.charm downloading ch:amd64/focal/kubeflow-volumes-260 from API server
application-kubeflow-volumes: 12:50:40 INFO juju.downloader downloading from ch:amd64/focal/kubeflow-volumes-260
application-kubeflow-volumes: 12:50:40 INFO juju.downloader download complete ("ch:amd64/focal/kubeflow-volumes-260")
application-kubeflow-volumes: 12:50:40 INFO juju.downloader download verified ("ch:amd64/focal/kubeflow-volumes-260")
application-kubeflow-volumes: 12:50:44 INFO juju.worker.caasoperator operator "kubeflow-volumes" started
application-kubeflow-volumes: 12:50:44 INFO juju.worker.caasoperator.runner start "kubeflow-volumes/5"
application-kubeflow-volumes: 12:50:44 INFO juju.worker.leadership kubeflow-volumes/5 promoted to leadership of kubeflow-volumes
application-kubeflow-volumes: 12:50:44 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-kubeflow-volumes-5
application-kubeflow-volumes: 12:50:44 INFO juju.worker.caasoperator.uniter.kubeflow-volumes/5 unit "kubeflow-volumes/5" started ...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

Thanks for the logging. We do have some known issues in 3.x due to the change to use async download for charms sourced from charmhub. What might be happening is that the unit agent comes up and tries to download from the controller the charm it needs. But at the moment it does that, the controller has not yet fully finished downloading the charm blob from charmhub (the controller acts as a man in the middle to cache downloaded charms). This causes the agent to see an error. It bounces and by the time it comes back up, the charm is in the controller and the next attempt to download works. But....

The above might explain this which indicates there's an issue which will break things (perhaps the aborted download triggered it):

controller-0: 12:24:18 INFO juju.apiserver.connection agent login: application-kubeflow-volumes for 4fde2d04-61f5-4c0d-8806-e7a17ec28548
controller-0: 12:24:30 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: panic resulted in: assignment to entry in nil map
controller-0: 12:24:36 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: panic resulted in: assignment to entry in nil map
controller-0: 12:24:41 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: panic resulted in: assignment to entry in nil map
controller-0: 12:24:42 INFO juju.state LogTailer starting oplog tailing: recent id count=10, lastTime=2024-04-11 10:24:33.097989553 +0000 UTC, minOplogTs=2024-04-11 10:23:33.097989553 +0000 UTC
controller-0: 12:24:48 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: panic resulted in: assignment to entry in nil map
controller-0: 12:24:48 INFO juju.state LogTailer starting oplog tailing: recent id count=106, lastTime=2024-04-11 10:24:33.097989553 +0000 UTC, minOplogTs=2024-04-11 10:23:33.097989553 +0000 UTC
controller-0: 12:24:56 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: panic resulted in: assignment to entry in nil map

Is it possible to use kubectl log to get the stdout/stderr from the api-server container? We need to get to the source of the panic. We need a log file with the stack trace in it. If kubectl log comes up empty, maybe you could ssh into the container and poke around /var/log/juju

Revision history for this message
Daniela Plascencia (dnplas) wrote :
Download full text (4.7 KiB)

Thanks for the follow up!

While I don't see anything very obvious in the controller api-server container, I see this from `juju debug-log -l TRACE --replay` and of course in the logs of the api-server container. I am attaching the full log.

application-kubeflow-volumes: 10:06:56 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [946658] "application-kubeflow-volumes" cannot open api: unable to connect to API: dial tcp 10.152.183.39:17070: connect: connection refused
stack trace:
dial tcp 10.152.183.39:17070: connect: connection refused
github.com/juju/juju/api.gorillaDialWebsocket:756:
github.com/juju/juju/api.dialer.dial1:1176:
github.com/juju/juju/api.dialer.dial:1151: unable to connect to API
github.com/juju/juju/api.dialWebsocketMulti:1047:
github.com/juju/juju/api.dialAPI:705:
github.com/juju/juju/api.Open:218:
github.com/juju/juju/worker/apicaller.connectFallback:161:
github.com/juju/juju/worker/apicaller.OnlyConnect:58:
github.com/juju/juju/worker/apicaller.ManifoldConfig.startFunc.func1:97: [946658] "application-kubeflow-volumes" cannot open api
application-kubeflow-volumes: 10:06:56 DEBUG juju.worker.apicaller connecting with old password
application-kubeflow-volumes: 10:06:56 DEBUG juju.api looked up controller-service.controller-uk8s.svc.cluster.local -> [10.152.183.39]
application-kubeflow-volumes: 10:06:57 DEBUG juju.worker.apicaller [946658] failed to connect
application-kubeflow-volumes: 10:06:57 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [946658] "application-kubeflow-volumes" cannot open api: unable to connect to API: dial tcp 10.152.183.39:17070: connect: connection refused
stack trace:
dial tcp 10.152.183.39:17070: connect: connection refused
github.com/juju/juju/api.gorillaDialWebsocket:756:
github.com/juju/juju/api.dialer.dial1:1176:
github.com/juju/juju/api.dialer.dial:1151: unable to connect to API
github.com/juju/juju/api.dialWebsocketMulti:1047:
github.com/juju/juju/api.dialAPI:705:
github.com/juju/juju/api.Open:218:
github.com/juju/juju/worker/apicaller.connectFallback:161:
github.com/juju/juju/worker/apicaller.OnlyConnect:58:
github.com/juju/juju/worker/apicaller.ManifoldConfig.startFunc.func1:97: [946658] "application-kubeflow-volumes" cannot open api
application-kubeflow-volumes: 10:06:57 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [946658] "application-kubeflow-volumes" cannot open api: unable to connect to API: dial tcp 10.152.183.39:17070: connect: connection refused
model-946658c9-f947-48cb-89e1-cfc55ab360ac: 10:06:58 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [946658] "model-946658c9-f947-48cb-89e1-cfc55ab360ac" cannot open api: unable to connect to API: dial tcp 10.152.183.39:17070: connect: connection refused
application-kubeflow-volumes: 10:07:01 DEBUG juju.worker.apicaller connecting with old password
application-kubeflow-volumes: 10:07:01 DEBUG juju.api looked up controller-service.controller-uk8s.svc.cluster.local -> [10.152.183.39]
application-kubeflow-volumes: 10:07:01 DEBUG juju.worker.apicaller [946658] failed to connect
application-kubeflow-volumes: 10:07:01 DE...

Read more...

Revision history for this message
Daniela Plascencia (dnplas) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

We still only have lines like:

2024-04-16T08:07:09.109Z [jujud] 2024-04-16 08:07:09 ERROR juju.worker.dependency engine.go:695 "caas-unit-provisioner" manifold worker returned unexpected error: panic resulted in: assignment to entry in nil map
2024-04-16T08:07:09.110Z [jujud] a0a72c11-233c-4cb9-898f-69d9b2f0537c: controller-0 2024-04-16 08:07:09 ERROR juju.worker.dependency engine.go:695 "caas-unit-provisioner" manifold worker returned unexpected error: panic resulted in: assignment to entry in nil map

We're missing the cause of the panic. I've looked through the code and there's nothing that jumps as as an obvious mistake.

If neither the container logs nor and log files on disk contain the panic and stack trace, then to fix we really need a reliable way to reproduce. We'd need to hack in a lot of extra logging etc. We may be able to artificially induce a charm download delay to trigger the issue.

As I understand it's very intermittent and showing up in CI rather than a customer site. We probably won't have a resolution in the very short term - this will require some likely non trivial time to further investigate.

tags: added: k8s
Changed in juju:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Daniela Plascencia (dnplas) wrote (last edit ):

You should be able to reproduce the issue 100% of the times if you follow the steps I added in the description. I am seeing this even in my machine, you must be able to reproduce it in your local as well. The reason I mentioned it is intermittent is because sometimes (rarely) I am able to deploy podspec charms from Charmhub, but that is not always the case.

Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

Ah I see, thank you for the clarification and apologies on my misunderstanding of the nature of it. I was under the mistaken impression it only failed occasionally. But if it fails almost every time, then we obviously need to treat it as much higher importance.

I've assigned to the next 3.4 release.

Changed in juju:
importance: Medium → High
milestone: none → 3.4.3
Harry Pidcock (hpidcock)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
Revision history for this message
Ian Booth (wallyworld) wrote :
Changed in juju:
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.