Bug #1946382 “Sporadic hang on installing agent” : Bugs : Canonical Juju

Revision history for this message

Balbir Thomas (balbir-thomas) wrote on 2021-10-07:

#1

juju debug-log -m controller --replay Edit (41.2 KiB, text/plain)

Ian Booth (wallyworld) on 2021-10-08

Changed in juju:
milestone:	none → 2.9.17
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-10-08:

#2

Based on observed log output, we seem not to be getting our worker started here:
https://github.com/juju/juju/blob/ff8bae62fd08e8fe3c5ea42162d1873363179cca/cmd/containeragent/unit/agent.go#L342

This looks similar to the failure observed here:
https://bugs.launchpad.net/juju/+bug/1945999

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-10-14:

#3

This does not actually appear to be related to
https://bugs.launchpad.net/juju/+bug/1945999

We are not flexing the same logic, I.e. using a runner to start a worker with the same ID as one we have just killed.

Canonical Juju QA Bot (juju-qa-bot) on 2021-10-22

Changed in juju:
milestone:	2.9.17 → 2.9.18

Canonical Juju QA Bot (juju-qa-bot) on 2021-11-04

Changed in juju:
milestone:	2.9.18 → 2.9.19

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-11-04:

#4

2.9.18 will have enhanced logging output for these deployments.
See the QA steps in this patch for the logging level to utilise:
https://github.com/juju/juju/pull/13461

Changed in juju:
status:	Triaged → In Progress
assignee:	nobody → Joseph Phillips (manadart)

Canonical Juju QA Bot (juju-qa-bot) on 2021-11-17

Changed in juju:
milestone:	2.9.19 → 2.9.20

Revision history for this message

Jon Seager (jnsgruk) wrote on 2021-11-18:

#5

I have a controller doing this now, logs are here: https://pastebin.ubuntu.com/p/QkVV2jv5WY/

Got to this by deploying and force removing mongodb-k8s a couple of times, nothing else.

Been stuck at "installing agent" for about 25 mins now, with no logging happening

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-11-18:

#6

Same controller with greater logging detail, and with reproduction using hello-kubecon:
https://pastebin.canonical.com/p/2SRj5hnVMR/

Interestingly, it exhibits the same storage life permission error seen here:
https://bugs.launchpad.net/juju/+bug/1950781

Easy to get here by repeating the following steps:

juju deploy <sidecar-charm>
juju remove-application --destroy-storage --force --no-wait <sidecar-charm>

Changed in juju:
status:	In Progress → Triaged
assignee:	Joseph Phillips (manadart) → nobody

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2021-11-19 (last edit on 2021-11-19):

#7

Seems like we have a bug on the storage provisioner worker somewhere https://pastebin.ubuntu.com/p/CM2VvS87xJ/

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-11-22:

#8

The life permission warning turns out to be bogus - the provisioner worker was not properly accounting for the fact that storage attachments go from dying->removed. This PR fixes that plus an issue observed along the way, whereby storage attachment was meant to be idempotent but there was a logic error that caused a silly error to be printed when using the CLI.

https://github.com/juju/juju/pull/13516

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-11-22:

#9

This is an issue with sidecar charms only - v1 operator charms (podspec) work properly.

The issue is that when an application is removed and storage is detached (but not removed), the PVCs created for the volume mounts remain in the k8s namespace so that the associated PVs also stay bound. When a new application is deployed, these existing PVCs should be left alone and new ones created. But for sidecar charms, that's not happening and there's an attempt to reuse one of the old PVCs from the first deploy.

There's an issue then setting up the PVCs for sidecar charms. eg here's the PVCs after removing and redeploying a v1 mariadb k8s charm, and also hellp-kubecon

https://pastebin.ubuntu.com/p/dFYvVPMjQg/

You can see that there's new PVCs for mariadb (with a new random prefix in the PVC name).
But for the sidecar charm, an existing PVC is being used (I only deployed one new unit), when instead it should have left the existing PVC alone.

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2021-11-22 (last edit on 2021-11-22):

#10

Found two issues:
1. Juju is not shutting down the CAAS storage provisioner worker even after the application has been removed.
2. two unit-0 and unit-1 were created (with scale == 1) then unit-1 gets deleted immediately in 2nd deploy (I am investigating why Juju does this).

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-11-23:

#11

Here's a fix for a storage issue which appears to be complicit in causing the issue.

https://github.com/juju/juju/pull/13519

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2021-11-26:

#12

These two PRs fixed some related issues:
https://github.com/juju/juju/pull/13527
https://github.com/juju/juju/pull/13517
But there is still a weird bug happening: units get terminated by the cloud immediately after deployed sometimes. I will keep investigating further.

Canonical Juju QA Bot (juju-qa-bot) on 2021-11-26

Changed in juju:
milestone:	2.9.20 → 2.9.21

Canonical Juju QA Bot (juju-qa-bot) on 2021-12-01

Changed in juju:
milestone:	2.9.21 → 2.9.22

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2021-12-03:

#13

https://github.com/juju/juju/pull/13545 This PR + the above PRs should fix all the related problems.

Changed in juju:
status:	Triaged → In Progress
assignee:	nobody → Yang Kelvin Liu (kelvin.liu)

Yang Kelvin Liu (kelvin.liu) on 2021-12-07

Changed in juju:
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2021-12-13

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Pietro Pasotti (ppasotti) wrote on 2022-01-31:

#14

I think I have just experienced the same issue. I was working on a model called 'char' and I deployed/remove-application'd a charm I'm developing several times, until at some point this issue started occurring and I couldn't deploy anything anymore in that model.

Attempting to delete the model also left it in an unresponsive 'destroying' state and hangs.

juju debug-log --replay:
https://pastebin.canonical.com/p/Hk2yzRfkkJ/

application source (for the container image) is at:
https://github.com/PietroPasotti/char
charm source (don't judge :D) is at:
https://github.com/PietroPasotti/char-operator

disclaimer: the version of the charm source that caused the last error you see in the logs is NOT the one you see in github. I didn't commit that specific version because it was broken. So it's impossible to get the exact source on which that error occurred.

Revision history for this message

Joseph Phillips (manadart) wrote on 2022-02-01 (last edit on 2022-02-01):

#15

Pietro, what version of Juju was this on?
Edit: Nevermind; 2.9.22 according to the pastebin.

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2022-02-02:

#16

Hi Pietro
From the log, I can only see the charm error.
You said `I couldn't deploy anything anymore in that model.` - What happened when you try to deploy a different application?
Could you provide the juju status output or the error that Juju rejects the new deployment?

To destroy the model, did you destroy the model with `--force`?

Revision history for this message

Balbir Thomas (balbir-thomas) wrote on 2022-02-03:

#17

juju2.log Edit (134.8 KiB, text/plain)

Hi I am still seeing this issue on Juju 2.9.22 (actually twice today). Attached is a log from today. The model is still responsive though since I was able to scale and remove the affected application. Hope these logs help a bit more. Let me know if I can help in any other way. It may or may not be pertinent to note that in this case this was a charm without a workload. The first (leader) unit was not hung in the "installing agent" state however when trying to scale the application the subsequent units were hung in the "installing agent" state. Since this is a workload-less charm the subsequent units add no value and hence are set to WaitingStatus with a message. I do not see any reason if or why just setting Waiting Status should cause such an issue with pebble initialization. Besides given the fact that the model was still responsive I wonder if the "hang" is just a red herring and what i observed today is only a matter of not clearing the "installing agent" message. Some additional information

1. Kubectl status of the affected unit was marked as "running"
2. kubectl description of all the containers in the pod looked normal

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 8m13s default-scheduler Successfully assigned lma/prometheus-scrape-config-k8s-1 to knuth
  Normal Pulled 8m13s kubelet Container image "jujusolutions/jujud-operator:2.9.22" already present on machine
  Normal Created 8m13s kubelet Created container charm-init
  Normal Started 8m13s kubelet Started container charm-init
  Normal Pulled 8m12s kubelet Container image "jujusolutions/charm-base:ubuntu-20.04" already present on machine
  Normal Created 8m12s kubelet Created container charm
  Normal Started 8m12s kubelet Started container charm

3. There were no errors in the logs of the affected charm container ("kubectl -n lma logs prometheus-scrape-config-k8s-1 -c charm | grep ERROR")

Hi I am still seeing this issue on Juju 2.9.22 (actually twice today). Attached is a log from today. The model is still responsive though since I was able to scale and remove the affected application. Hope these logs help a bit more. Let me know if I can help in any other way. It may or may not be pertinent to note that in this case this was a charm without a workload. The first (leader) unit was not hung in the "installing agent" state however when trying to scale the application the subsequent units were hung in the "installing agent" state. Since this is a workload-less charm the subsequent units add no value and hence are set to WaitingStatus with a message. I do not see any reason if or why just setting Waiting Status should cause such an issue with pebble initialization. Besides given the fact that the model was still responsive I wonder if the "hang" is just a red herring and what i observed today is only a matter of not clearing the "installing agent" message. Some additional information

1. Kubectl status of the affected unit was marked as "running" 
2. kubectl description of all the containers in the pod looked normal

Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  8m13s  default-scheduler  Successfully assigned lma/prometheus-scrape-config-k8s-1 to knuth
  Normal  Pulled     8m13s  kubelet            Container image "jujusolutions/jujud-operator:2.9.22" already present on machine
  Normal  Created    8m13s  kubelet            Created container charm-init
  Normal  Started    8m13s  kubelet            Started container charm-init
  Normal  Pulled     8m12s  kubelet            Container image "jujusolutions/charm-base:ubuntu-20.04" already present on machine
  Normal  Created    8m12s  kubelet            Created container charm
  Normal  Started    8m12s  kubelet            Started container charm

3. There were no errors in the logs of the affected charm container ("kubectl -n lma logs prometheus-scrape-config-k8s-1 -c charm | grep ERROR")

Revision history for this message

John A Meinel (jameinel) wrote on 2022-02-16:

#18

@Balbir, how are you defining hang, were the pods not running workloads, was it just that you saw that message in 'juju status'? (From your message it sounds like the latter)
I believe the original part of the bug was about applications not actually starting correctly, but it doesn't sound like you were running into that. Can you confirm whether the container and the application actually started correctly?

Revision history for this message

Balbir Thomas (balbir-thomas) wrote on 2022-02-16:

#19

@jameinel You are correct indeed. In my latest report the "hang" referred only to the fact that the status message did not refresh and remained "installing agent" but the model was responsive.However since the additional units of the "workload-less" charm did not perform any useful function I could not say for sure if the units were "operational". If I recall correctly I would say the "charm" container in the units were indeed operational.

Revision history for this message

John A Meinel (jameinel) wrote on 2022-02-24:

#20

We encountered something like this in PS5-beta where we have a model 'stg-harbor' that is running on Kubernetes, and it is trying to deploy 3 different applications:
```
stg-harbor@launchpad-bastion-ps5:~$ jsft
Model Controller Cloud/Region Version SLA Timestamp
stg-harbor prodstack-is-beta k8s-stg-lp/default 2.9.18 unsupported 16:23:12Z

App Version Status Scale Charm Store Channel Rev OS Address Message
harbor-registry waiting 0/1 harbor-registry charmhub edge 8 kubernetes installing agent
harbor-registry-three waiting 0/1 harbor-registry charmhub edge 8 kubernetes installing agent
harbor-registry-two waiting 0/1 harbor-registry charmhub edge 8 kubernetes installing agent

Unit Workload Agent Address Ports Message
harbor-registry-three/0 waiting allocating installing agent
harbor-registry-two/0 waiting allocating installing agent
harbor-registry/0 waiting allocating installing agent
```

Inside the Juju logs we see a lot of lines about:
```
2022-02-24 15:29:16 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
2022-02-24 15:29:19 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
2022-02-24 15:29:22 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
```

Which should be the Juju provisioner trying to create a password for the charm that is going to spin up, and then using that to tell Kubernetes to provision a pod with a given password.
However, it appears to be spinning, and 'kubectl' on that namespace shows only the model-operator pod.
This would indicate that we are getting some sort of error trying to provision, but not actually logging anything associated with it. (maybe the logs are in one of the 'model' logs).

We encountered something like this in PS5-beta where we have a model 'stg-harbor' that is running on Kubernetes, and it is trying to deploy 3 different applications:
```
stg-harbor@launchpad-bastion-ps5:~$ jsft
Model       Controller         Cloud/Region        Version  SLA          Timestamp
stg-harbor  prodstack-is-beta  k8s-stg-lp/default  2.9.18   unsupported  16:23:12Z

App                    Version  Status   Scale  Charm            Store     Channel  Rev  OS          Address  Message
harbor-registry                 waiting    0/1  harbor-registry  charmhub  edge       8  kubernetes           installing agent
harbor-registry-three           waiting    0/1  harbor-registry  charmhub  edge       8  kubernetes           installing agent
harbor-registry-two             waiting    0/1  harbor-registry  charmhub  edge       8  kubernetes           installing agent

Unit                     Workload  Agent       Address  Ports  Message
harbor-registry-three/0  waiting   allocating                  installing agent
harbor-registry-two/0    waiting   allocating                  installing agent
harbor-registry/0        waiting   allocating                  installing agent
```

Inside the Juju logs we see a lot of lines about:
```
2022-02-24 15:29:16 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
2022-02-24 15:29:19 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
2022-02-24 15:29:22 INFO juju.apiserver.common password.go:100 setting password for "application-harbor-registry-three"
```

Which should be the Juju provisioner trying to create a password for the charm that is going to spin up, and then using that to tell Kubernetes to provision a pod with a given password.
However, it appears to be spinning, and 'kubectl' on that namespace shows only the model-operator pod.
This would indicate that we are getting some sort of error trying to provision, but not actually logging anything associated with it. (maybe the logs are in one of the 'model' logs).

Revision history for this message

John A Meinel (jameinel) wrote on 2022-02-24:

#21

We are seeing this failure:
ERROR "caas-broker-tracker" manifold worker returned unexpected error: cannot create caas broker: unable to determine legacy status for namespace stg-harbor: etcdserver: request timed out

Which would hint towards the credential/cloud definition for that model being incorrect, and causing our requests against kubernetes to fail.

Revision history for this message

John A Meinel (jameinel) wrote on 2022-02-24:

#22

that was in the 'machine-0.log' looking in the models logs we find:
2022-02-24 16:40:16 INFO juju.worker.caasapplicationprovisioner.runner runner.go:546 restarting "harbor-registry-three" in 3s
2022-02-24 16:40:19 INFO juju.worker.caasapplicationprovisioner.runner runner.go:556 start "harbor-registry-three"
2022-02-24 16:40:22 INFO juju.worker.caasapplicationprovisioner.runner runner.go:587 stopped "harbor-registry-three", err: getting OCI image resources: unable to fetch OCI image resources for harbor-registry-three: while getting resource from the charm store: resource "registry-image" not found
2022-02-24 16:40:22 ERROR juju.worker.caasapplicationprovisioner.runner runner.go:470 exited "harbor-registry-three": getting OCI image resources: unable to fetch OCI image resources for harbor-registry-three: while getting resource from the charm store: resource "registry-image" not found
2022-02-24 16:40:22 INFO juju.worker.caasapplicationprovisioner.runner runner.go:546 restarting "harbor-registry-three" in 3s

So at the very least the code reports "installing agent" for the application should *definitely* be reporting "could not find the resource" as a hard failure for provisioning, and not bouncing around it.

Canonical Juju

Sporadic hang on installing agent

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches