Sporadic hang on installing agent

Bug #1946382 reported by Balbir Thomas
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Yang Kelvin Liu

Bug Description

It has been observed multiple times that deploying charms hangs sporadically. This is hard to reproduce. It happens with many different charms. When it happens Juju status typically shows something like this

Model Controller Cloud/Region Version SLA Timestamp
lma microk8s microk8s/localhost 2.9.15 unsupported 14:06:05+01:00

App Version Status Scale Charm Store Channel Rev OS Address Message
prometheus-k8s active 1 prometheus-k8s local 0 kubernetes 10.152.183.100
prometheus-tester waiting 0/1 prometheus-tester local 7 kubernetes installing agent

Unit Workload Agent Address Ports Message
prometheus-k8s/0* active idle 10.1.252.97
prometheus-tester/6 waiting allocating installing agent

In this case the prometheus-tester charm failed to deploy. Actually there was even a failure to deploy any pod for the Prometheus tester charm as can be seen below

$ kubectl get all --all-namespaces | grep prometheus
lma pod/prometheus-k8s-0 2/2 Running 0 95m
lma service/prometheus-k8s-endpoints ClusterIP None <none> <none> 95m
lma service/prometheus-k8s ClusterIP 10.152.183.100 <none> 9090/TCP 94m
lma statefulset.apps/prometheus-k8s 1/1 95m

Attached to this bug report is output of "juju debug-log -m controller --replay".

Revision history for this message
Balbir Thomas (balbir-thomas) wrote :
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.9.17
status: New → Triaged
importance: Undecided → High
Revision history for this message
Joseph Phillips (manadart) wrote :

Based on observed log output, we seem not to be getting our worker started here:
https://github.com/juju/juju/blob/ff8bae62fd08e8fe3c5ea42162d1873363179cca/cmd/containeragent/unit/agent.go#L342

This looks similar to the failure observed here:
https://bugs.launchpad.net/juju/+bug/1945999

Revision history for this message
Joseph Phillips (manadart) wrote :

This does not actually appear to be related to
https://bugs.launchpad.net/juju/+bug/1945999

We are not flexing the same logic, I.e. using a runner to start a worker with the same ID as one we have just killed.

Changed in juju:
milestone: 2.9.17 → 2.9.18
Changed in juju:
milestone: 2.9.18 → 2.9.19
Revision history for this message
Joseph Phillips (manadart) wrote :

2.9.18 will have enhanced logging output for these deployments.
See the QA steps in this patch for the logging level to utilise:
https://github.com/juju/juju/pull/13461

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Joseph Phillips (manadart)
Changed in juju:
milestone: 2.9.19 → 2.9.20
Revision history for this message
Jon Seager (jnsgruk) wrote :

I have a controller doing this now, logs are here: https://pastebin.ubuntu.com/p/QkVV2jv5WY/

Got to this by deploying and force removing mongodb-k8s a couple of times, nothing else.

Been stuck at "installing agent" for about 25 mins now, with no logging happening

Revision history for this message
Joseph Phillips (manadart) wrote :

Same controller with greater logging detail, and with reproduction using hello-kubecon:
https://pastebin.canonical.com/p/2SRj5hnVMR/

Interestingly, it exhibits the same storage life permission error seen here:
https://bugs.launchpad.net/juju/+bug/1950781

Easy to get here by repeating the following steps:

juju deploy <sidecar-charm>
juju remove-application --destroy-storage --force --no-wait <sidecar-charm>

Changed in juju:
status: In Progress → Triaged
assignee: Joseph Phillips (manadart) → nobody
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote (last edit ):

Seems like we have a bug on the storage provisioner worker somewhere https://pastebin.ubuntu.com/p/CM2VvS87xJ/

Revision history for this message
Ian Booth (wallyworld) wrote :

The life permission warning turns out to be bogus - the provisioner worker was not properly accounting for the fact that storage attachments go from dying->removed. This PR fixes that plus an issue observed along the way, whereby storage attachment was meant to be idempotent but there was a logic error that caused a silly error to be printed when using the CLI.

https://github.com/juju/juju/pull/13516

Revision history for this message
Ian Booth (wallyworld) wrote :

This is an issue with sidecar charms only - v1 operator charms (podspec) work properly.

The issue is that when an application is removed and storage is detached (but not removed), the PVCs created for the volume mounts remain in the k8s namespace so that the associated PVs also stay bound. When a new application is deployed, these existing PVCs should be left alone and new ones created. But for sidecar charms, that's not happening and there's an attempt to reuse one of the old PVCs from the first deploy.

There's an issue then setting up the PVCs for sidecar charms. eg here's the PVCs after removing and redeploying a v1 mariadb k8s charm, and also hellp-kubecon

https://pastebin.ubuntu.com/p/dFYvVPMjQg/

You can see that there's new PVCs for mariadb (with a new random prefix in the PVC name).
But for the sidecar charm, an existing PVC is being used (I only deployed one new unit), when instead it should have left the existing PVC alone.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote (last edit ):

Found two issues:
1. Juju is not shutting down the CAAS storage provisioner worker even after the application has been removed.
2. two unit-0 and unit-1 were created (with scale == 1) then unit-1 gets deleted immediately in 2nd deploy (I am investigating why Juju does this).

Revision history for this message
Ian Booth (wallyworld) wrote :

Here's a fix for a storage issue which appears to be complicit in causing the issue.

https://github.com/juju/juju/pull/13519

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

These two PRs fixed some related issues:
https://github.com/juju/juju/pull/13527
https://github.com/juju/juju/pull/13517
But there is still a weird bug happening: units get terminated by the cloud immediately after deployed sometimes. I will keep investigating further.

Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
milestone: 2.9.21 → 2.9.22
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

https://github.com/juju/juju/pull/13545 This PR + the above PRs should fix all the related problems.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Pietro Pasotti (ppasotti) wrote :

I think I have just experienced the same issue. I was working on a model called 'char' and I deployed/remove-application'd a charm I'm developing several times, until at some point this issue started occurring and I couldn't deploy anything anymore in that model.

Attempting to delete the model also left it in an unresponsive 'destroying' state and hangs.

juju debug-log --replay:
https://pastebin.canonical.com/p/Hk2yzRfkkJ/

application source (for the container image) is at:
https://github.com/PietroPasotti/char
charm source (don't judge :D) is at:
https://github.com/PietroPasotti/char-operator

disclaimer: the version of the charm source that caused the last error you see in the logs is NOT the one you see in github. I didn't commit that specific version because it was broken. So it's impossible to get the exact source on which that error occurred.

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Pietro, what version of Juju was this on?
Edit: Nevermind; 2.9.22 according to the pastebin.

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Hi Pietro
From the log, I can only see the charm error.
You said `I couldn't deploy anything anymore in that model.` - What happened when you try to deploy a different application?
Could you provide the juju status output or the error that Juju rejects the new deployment?

To destroy the model, did you destroy the model with `--force`?

Revision history for this message
Balbir Thomas (balbir-thomas) wrote :

Hi I am still seeing this issue on Juju 2.9.22 (actually twice today). Attached is a log from today. The model is still responsive though since I was able to scale and remove the affected application. Hope these logs help a bit more. Let me know if I can help in any other way. It may or may not be pertinent to note that in this case this was a charm without a workload. The first (leader) unit was not hung in the "installing agent" state however when trying to scale the application the subsequent units were hung in the "installing agent" state. Since this is a workload-less charm the subsequent units add no value and hence are set to WaitingStatus with a message. I do not see any reason if or why just setting Waiting Status should cause such an issue with pebble initialization. Besides given the fact that the model was still responsive I wonder if the "hang" is just a red herring and what i observed today is only a matter of not clearing the "installing agent" message. Some additional information

1. Kubectl status of the affected unit was marked as "running"
2. kubectl description of all the containers in the pod looked normal

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 8m13s default-scheduler Successfully assigned lma/prometheus-scrape-config-k8s-1 to knuth
  Normal Pulled 8m13s kubelet Container image "jujusolutions/jujud-operator:2.9.22" already present on machine
  Normal Created 8m13s kubelet Created container charm-init
  Normal Started 8m13s kubelet Started container charm-init
  Normal Pulled 8m12s kubelet Container image "jujusolutions/charm-base:ubuntu-20.04" already present on machine
  Normal Created 8m12s kubelet Created container charm
  Normal Started 8m12s kubelet Started container charm

3. There were no errors in the logs of the affected charm container ("kubectl -n lma logs prometheus-scrape-config-k8s-1 -c charm | grep ERROR")

Revision history for this message
John A Meinel (jameinel) wrote :

@Balbir, how are you defining hang, were the pods not running workloads, was it just that you saw that message in 'juju status'? (From your message it sounds like the latter)
I believe the original part of the bug was about applications not actually starting correctly, but it doesn't sound like you were running into that. Can you confirm whether the container and the application actually started correctly?

Revision history for this message
Balbir Thomas (balbir-thomas) wrote :

@jameinel You are correct indeed. In my latest report the "hang" referred only to the fact that the status message did not refresh and remained "installing agent" but the model was responsive.However since the additional units of the "workload-less" charm did not perform any useful function I could not say for sure if the units were "operational". If I recall correctly I would say the "charm" container in the units were indeed operational.

Revision history for this message
John A Meinel (jameinel) wrote :

We encountered something like this in PS5-beta where we have a model 'stg-harbor' that is running on Kubernetes, and it is trying to deploy 3 different applications:
```
stg-harbor@launchpad-bastion-ps5:~$ jsft
Model Controller Cloud/Region Version SLA Timestamp
stg-harbor prodstack-is-beta k8s-stg-lp/default 2.9.18 unsupported 16:23:12Z

App Version Status Scale Charm Store Channel Rev OS Address Message
harbor-registry waiting 0/1 harbor-registry charmhub edge 8 kubernetes installing agent
harbor-registry-three waiting 0/1 harbor-registry charmhub edge 8 kubernetes installing agent
harbor-registry-two waiting 0/1 harbor-registry charmhub edge 8 kubernetes installing agent

Unit Workload Agent Address Ports Message
harbor-registry-three/0 waiting allocating installing agent
harbor-registry-two/0 waiting allocating installing agent
harbor-registry/0 waiting allocating installing agent
```

Inside the Juju logs we see a lot of lines about:
```
2022-02-24 15:29:16 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
2022-02-24 15:29:19 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
2022-02-24 15:29:22 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
```

Which should be the Juju provisioner trying to create a password for the charm that is going to spin up, and then using that to tell Kubernetes to provision a pod with a given password.
However, it appears to be spinning, and 'kubectl' on that namespace shows only the model-operator pod.
This would indicate that we are getting some sort of error trying to provision, but not actually logging anything associated with it. (maybe the logs are in one of the 'model' logs).

Revision history for this message
John A Meinel (jameinel) wrote :

We are seeing this failure:
ERROR "caas-broker-tracker" manifold worker returned unexpected error: cannot create caas broker: unable to determine legacy status for namespace stg-harbor: etcdserver: request timed out

Which would hint towards the credential/cloud definition for that model being incorrect, and causing our requests against kubernetes to fail.

Revision history for this message
John A Meinel (jameinel) wrote :

that was in the 'machine-0.log' looking in the models logs we find:
2022-02-24 16:40:16 INFO juju.worker.caasapplicationprovisioner.runner runner.go:546 restarting "harbor-registry-three" in 3s
2022-02-24 16:40:19 INFO juju.worker.caasapplicationprovisioner.runner runner.go:556 start "harbor-registry-three"
2022-02-24 16:40:22 INFO juju.worker.caasapplicationprovisioner.runner runner.go:587 stopped "harbor-registry-three", err: getting OCI image resources: unable to fetch OCI image resources for harbor-registry-three: while getting resource from the charm store: resource "registry-image" not found
2022-02-24 16:40:22 ERROR juju.worker.caasapplicationprovisioner.runner runner.go:470 exited "harbor-registry-three": getting OCI image resources: unable to fetch OCI image resources for harbor-registry-three: while getting resource from the charm store: resource "registry-image" not found
2022-02-24 16:40:22 INFO juju.worker.caasapplicationprovisioner.runner runner.go:546 restarting "harbor-registry-three" in 3s

So at the very least the code reports "installing agent" for the application should *definitely* be reporting "could not find the resource" as a hard failure for provisioning, and not bouncing around it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.