Upgrading Juju Controller and Model from 2.9.34 to 2.9.37 breaks deployed charms
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Thomas Miller |
Bug Description
Recently, we've upgraded the Juju Controller and Model from 2.9.34 to 2.9.37, resulting in all the charms being down:
```
Model Controller Cloud/Region Version SLA Timestamp
finos-legend finos-legend finos-legend 2.9.37 unsupported 14:14:10Z
App Version Status Scale Charm Channel Rev Address Exposed Message
certbot-k8s waiting 0/1 certbot-k8s edge 12 10.100.168.45 no waiting for units to settle down
gitlab-integrator waiting 0/1 finos-legend-
legend-db waiting 0/1 finos-legend-db-k8s latest/edge 13 10.100.215.55 no waiting for units to settle down
legend-engine waiting 0/1 finos-legend-
legend-ingress waiting 0/1 nginx-ingress-
legend-sdlc waiting 0/1 finos-legend-
legend-studio waiting 0/1 finos-legend-
mongodb active 1 mongodb-k8s latest/edge 16 10.100.31.68 no
Unit Workload Agent Address Ports Message
certbot-k8s/0 error lost 192.168.23.1 crash loop backoff: back-off 5m0s restarting failed container=
gitlab-integrator/0 error lost 192.168.81.251 crash loop backoff: back-off 5m0s restarting failed container=charm pod=gitlab-
legend-db/0 error lost 192.168.56.214 crash loop backoff: back-off 5m0s restarting failed container=
legend-engine/0 error lost 192.168.77.249 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-
legend-ingress/0* error lost 192.168.73.143 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-
legend-sdlc/0 error lost 192.168.24.28 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-
legend-studio/0* error lost 192.168.30.70 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-
```
Running ``juju debug-log`` or ``juju show-status-log`` does not reveal any useful information:
```
juju debug-log --include certbot-k8s/0
juju show-status-log certbot-k8s/0
Time Type Status Message
21 Nov 2022 13:39:33Z juju-unit error crash loop backoff: back-off 5m0s restarting failed container=
21 Nov 2022 13:44:26Z juju-unit error container error:
21 Nov 2022 13:44:39Z juju-unit error crash loop backoff: back-off 5m0s restarting failed container=
21 Nov 2022 13:54:41Z juju-unit error container error:
```
But, if we look at the Kubernetes pod logs, we can see some useful information:
```
kubectl logs pod/certbot-k8s-0 -n finos-legend charm-init
ERROR option provided but not defined: --containeragen
```
It seems an additional argument was added to the Kubernetes / StatefulSets, however, the ``charm-init`` container's image was **not** updated, it is still ``jujusolutions
```
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
app.
controller.
juju.
model.
creationTimes
generation: 15
labels:
app.
app.
name: certbot-k8s
namespace: finos-legend
resourceVersion: "63436553"
uid: 4c6aee14-
spec:
podManagement
replicas: 1
revisionHisto
selector:
matchLabels:
app.
serviceName: certbot-
template:
metadata:
annotations:
creationT
labels:
spec:
automount
containers:
- args:
- run
- --http
- :38812
- --verbose
command:
- /charm/bin/pebble
env:
- name: JUJU_CONTAINER_
value: certbot-nginx
- name: HTTP_PROBE_PORT
value: "3856"
image: jujusolutions/
httpGet:
path: /v1/health?
port: 38812
scheme: HTTP
name: charm
httpGet:
path: /v1/health?
port: 38812
scheme: HTTP
resources: {}
httpGet:
path: /startup
port: 3856
scheme: HTTP
- mountPath: /var/lib/
name: charm-data
subPath: containeragent/
- mountPath: /charm/bin
name: charm-data
readOnly: true
subPath: charm/bin
- mountPath: /var/lib/juju
name: charm-data
subPath: var/lib/juju
- mountPath: /charm/containers
name: charm-data
subPath: charm/containers
workingDir: /var/lib/juju
- args:
- run
- --create-dirs
- --hold
- --http
- :38813
- --verbose
command:
- /charm/bin/pebble
env:
- name: JUJU_CONTAINER_NAME
value: certbot-nginx
- name: PEBBLE_SOCKET
value: /charm/
image: registry.
httpGet:
path: /v1/health?
port: 38813
scheme: HTTP
name: certbot-nginx
httpGet:
path: /v1/health?
port: 38813
scheme: HTTP
resources: {}
- mountPath: /charm/bin/pebble
name: charm-data
readOnly: true
subPath: charm/bin/pebble
- mountPath: /charm/container
name: charm-data
subPath: charm/container
dnsPolicy: ClusterFirst
imagePull
- name: certbot-
initConta
- args:
- init
- --containeragen
- /containeragent
- --charm-
- "3"
- --data-dir
- /var/lib/juju
- --bin-dir
- /charm/bin
command:
- /opt/containeragent
env:
- name: JUJU_CONTAINER_
value: certbot-nginx
- name: JUJU_K8S_POD_NAME
- name: JUJU_K8S_POD_UUID
envFrom:
- secretRef:
name: certbot-
image: jujusolutions/
name: charm-init
resources: {}
- mountPath: /var/lib/juju
name: charm-data
subPath: var/lib/juju
- mountPath: /containeragent
name: charm-data
subPath: containeragent/
- mountPath: /charm/bin
name: charm-data
subPath: charm/bin
- mountPath: /charm/containers
name: charm-data
subPath: charm/containers
workingDir: /var/lib/juju
restartPo
scheduler
securityC
serviceAc
serviceAc
terminati
volumes:
- emptyDir: {}
name: charm-data
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
status:
collisionCount: 0
currentRevision: certbot-
observedGener
replicas: 1
updateRevision: certbot-
updatedReplicas: 1
```
NOTE: There are a few ``juju.is/version: 2.9.34`` labels in the statefulset spec as well (not included above).
This issue **can** be resolved by just upgrading the charm through ``juju refresh``, however, that will not work if the charm is already at the latest revision:
```
juju refresh certbot-k8s
charm "certbot-k8s": already up-to-date
```
A solution for this would be to update the statefulset itself, to update the ``charm-init`` container image to ``jujusolutions
However, that is not the only problem.
The other charms were refreshed, and did not have this issue, their ``charm-init`` container image was up-to-date. The charms are periodically becoming ``Active``, and after ~1 minute they are in an ``Error`` state. ``juju debug-log`` doesn't reveal much information in this case either:
```
unit-certbot-k8s-0: 14:47:51 INFO juju.cmd running containerAgent [2.9.37 fd867c0a2675913
unit-certbot-k8s-0: 14:47:51 INFO juju.cmd.
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.
unit-certbot-k8s-0: 14:47:51 INFO juju.api cannot resolve "controller-
unit-certbot-k8s-0: 14:47:51 INFO juju.api connection established to "wss://
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.logger logger worker started
unit-certbot-k8s-0: 14:47:51 WARNING juju.worker.
```
``juju show-status-log`` doesn't show too much information either:
```
21 Nov 2022 14:48:40Z juju-unit error crash loop backoff: back-off 2m40s restarting failed container=charm pod=certbot-
21 Nov 2022 14:48:40Z workload maintenance
21 Nov 2022 14:51:28Z juju-unit executing running start hook
21 Nov 2022 14:51:29Z juju-unit idle
21 Nov 2022 14:51:33Z juju-unit executing running certbot-
21 Nov 2022 14:51:33Z workload active
21 Nov 2022 14:51:33Z juju-unit idle
21 Nov 2022 14:52:08Z workload maintenance stopping charm software
21 Nov 2022 14:52:08Z juju-unit executing running stop hook
21 Nov 2022 14:52:10Z juju-unit error crash loop backoff: back-off 5m0s restarting failed container=charm pod=certbot-
21 Nov 2022 14:52:10Z workload maintenance
```
Looking at the Kubernetes pod logs, we can actually see a bit more information, including the fact that the pod was terminated / received a termination signal:
```
kubectl logs pod/certbot-k8s-0 -n finos-legend
Defaulted container "charm" out of: charm, certbot-nginx, charm-init (init)
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
2022-11-
```
By running ``kubectl describe`` on the pod, we can see the events:
```
Normal Created 13m kubelet Created container charm-init
Normal Started 13m kubelet Started container charm-init
Normal Started 13m kubelet Started container certbot-nginx
Normal Pulled 13m kubelet Container image "registry.
Normal Created 13m kubelet Created container certbot-nginx
Normal Created 11m (x3 over 13m) kubelet Created container charm
Normal Started 11m (x3 over 13m) kubelet Started container charm
Normal Pulled 11m (x4 over 13m) kubelet Container image "jujusolutions/
Normal Killing 11m (x3 over 12m) kubelet Container charm failed startup probe, will be restarted
Warning BackOff 8m35s kubelet Back-off restarting failed container
Warning Unhealthy 3m6s (x15 over 12m) kubelet Startup probe failed: Get "http://
```
Because the ``Startup`` probe fails, the pod eventually gets reset by Kubernetes.
In parallel, I've deployed a new instance of that charm in a different model and it had no issues. I've then compared the Pod Specs between the old (now broken) Pod and the new Pod and saw that the new Pod does not have any startup probe (See old Pod spec above). If we update the Kubernetes StatefulSet for the charm and remove the startup probe, the charm becomes and stays Active.
tl;dr: Upgrading a Juju Controller and Model will result in the Charm's Kubernetes StatefulSets being updated (added the ``--containerag
tags: | added: canonical-is-upgrades |
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → High |
milestone: | none → 2.9.38 |
assignee: | nobody → Thomas Miller (tlmiller) |
Changed in juju: | |
status: | Fix Committed → Fix Released |
This follows: /github. com/juju/ juju/pull/ 14635
https:/