Upgrading Juju Controller and Model from 2.9.34 to 2.9.37 breaks deployed charms

Bug #1997253 reported by Claudiu Belu
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Thomas Miller

Bug Description

Recently, we've upgraded the Juju Controller and Model from 2.9.34 to 2.9.37, resulting in all the charms being down:

```
Model Controller Cloud/Region Version SLA Timestamp
finos-legend finos-legend finos-legend 2.9.37 unsupported 14:14:10Z

App Version Status Scale Charm Channel Rev Address Exposed Message
certbot-k8s waiting 0/1 certbot-k8s edge 12 10.100.168.45 no waiting for units to settle down
gitlab-integrator waiting 0/1 finos-legend-gitlab-integrator-k8s latest/edge 33 10.100.122.219 no waiting for units to settle down
legend-db waiting 0/1 finos-legend-db-k8s latest/edge 13 10.100.215.55 no waiting for units to settle down
legend-engine waiting 0/1 finos-legend-engine-k8s latest/edge 25 10.100.250.179 no waiting for units to settle down
legend-ingress waiting 0/1 nginx-ingress-integrator latest/edge 46 10.100.105.142 no waiting for units to settle down
legend-sdlc waiting 0/1 finos-legend-sdlc-k8s latest/edge 48 10.100.134.110 no waiting for units to settle down
legend-studio waiting 0/1 finos-legend-studio-k8s latest/edge 27 10.100.161.43 no waiting for units to settle down
mongodb active 1 mongodb-k8s latest/edge 16 10.100.31.68 no

Unit Workload Agent Address Ports Message
certbot-k8s/0 error lost 192.168.23.1 crash loop backoff: back-off 5m0s restarting failed container=charm-init pod=certbot-k8s-0_finos-legend(9b8bd97f-c57d-421c-96e1-beda999055bc)
gitlab-integrator/0 error lost 192.168.81.251 crash loop backoff: back-off 5m0s restarting failed container=charm pod=gitlab-integrator-0_finos-legend(efd01819-d9e6-4021-b762-03bd30fd11c6)
legend-db/0 error lost 192.168.56.214 crash loop backoff: back-off 5m0s restarting failed container=charm-init pod=legend-db-0_finos-legend(4aa02ef0-1761-4249-a11e-0455ce7cec49)
legend-engine/0 error lost 192.168.77.249 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-engine-0_finos-legend(4a21297b-cebf-4c40-b332-0a4e0a7ef9b7)
legend-ingress/0* error lost 192.168.73.143 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-ingress-0_finos-legend(81e6a143-b8f7-4faa-b435-a008bb5a3245)
legend-sdlc/0 error lost 192.168.24.28 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-sdlc-0_finos-legend(cd657e08-4f9e-4e2b-9f4b-52547ca65ba1)
legend-studio/0* error lost 192.168.30.70 crash loop backoff: back-off 5m0s restarting failed container=charm pod=legend-studio-0_finos-legend(0039b742-2850-4c53-8e68-14cf2c112a4f)
```

Running ``juju debug-log`` or ``juju show-status-log`` does not reveal any useful information:

```
juju debug-log --include certbot-k8s/0

juju show-status-log certbot-k8s/0
Time Type Status Message
21 Nov 2022 13:39:33Z juju-unit error crash loop backoff: back-off 5m0s restarting failed container=charm-init pod=certbot-k8s-0_finos-legend(9b8bd97f-c57d-421c-96e1-beda999055bc)
21 Nov 2022 13:44:26Z juju-unit error container error:
21 Nov 2022 13:44:39Z juju-unit error crash loop backoff: back-off 5m0s restarting failed container=charm-init pod=certbot-k8s-0_finos-legend(9b8bd97f-c57d-421c-96e1-beda999055bc)
21 Nov 2022 13:54:41Z juju-unit error container error:

```

But, if we look at the Kubernetes pod logs, we can see some useful information:

```
kubectl logs pod/certbot-k8s-0 -n finos-legend charm-init
ERROR option provided but not defined: --containeragent-pebble-dir
```

It seems an additional argument was added to the Kubernetes / StatefulSets, however, the ``charm-init`` container's image was **not** updated, it is still ``jujusolutions/jujud-operator:2.9.34``:

```
apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    app.juju.is/uuid: 279459a4
    controller.juju.is/id: 578786eb-2e3a-4235-8401-2d97cb151031
    juju.is/version: 2.9.34
    model.juju.is/id: 994a2b25-4ea2-40d0-8940-f51454152132
  creationTimestamp: "2022-02-28T13:06:28Z"
  generation: 15
  labels:
    app.kubernetes.io/managed-by: juju
    app.kubernetes.io/name: certbot-k8s
  name: certbot-k8s
  namespace: finos-legend
  resourceVersion: "63436553"
  uid: 4c6aee14-fd47-4dd1-9ff3-fadea403029d
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: certbot-k8s
  serviceName: certbot-k8s-endpoints
  template:
    metadata:
      annotations:
        controller.juju.is/id: 578786eb-2e3a-4235-8401-2d97cb151031
        juju.is/version: 2.9.34
        model.juju.is/id: 994a2b25-4ea2-40d0-8940-f51454152132
      creationTimestamp: null
      labels:
        app.kubernetes.io/name: certbot-k8s
    spec:
      automountServiceAccountToken: true
      containers:
      - args:
        - run
        - --http
        - :38812
        - --verbose
        command:
        - /charm/bin/pebble
        env:
        - name: JUJU_CONTAINER_NAMES
          value: certbot-nginx
        - name: HTTP_PROBE_PORT
          value: "3856"
        image: jujusolutions/charm-base:ubuntu-20.04
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health?level=alive
            port: 38812
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: charm
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health?level=ready
            port: 38812
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        securityContext:
          runAsGroup: 0
          runAsUser: 0
        startupProbe:
          failureThreshold: 2
          httpGet:
            path: /startup
            port: 3856
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/pebble/default
          name: charm-data
          subPath: containeragent/pebble
        - mountPath: /charm/bin
          name: charm-data
          readOnly: true
          subPath: charm/bin
        - mountPath: /var/lib/juju
          name: charm-data
          subPath: var/lib/juju
        - mountPath: /charm/containers
          name: charm-data
          subPath: charm/containers
        workingDir: /var/lib/juju
      - args:
        - run
        - --create-dirs
        - --hold
        - --http
        - :38813
        - --verbose
        command:
        - /charm/bin/pebble
        env:
        - name: JUJU_CONTAINER_NAME
          value: certbot-nginx
        - name: PEBBLE_SOCKET
          value: /charm/container/pebble.socket
        image: registry.jujucharms.com/charm/gmp2uydjv8qosoap03gsy06g9733pexeg9d3k/certbot-nginx-image@sha256:688d49104532c4614f365e2404c98ffa032fc84e5c43dea2e0aab15dc3baed84
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health?level=alive
            port: 38813
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: certbot-nginx
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /v1/health?level=ready
            port: 38813
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        securityContext:
          runAsGroup: 0
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /charm/bin/pebble
          name: charm-data
          readOnly: true
          subPath: charm/bin/pebble
        - mountPath: /charm/container
          name: charm-data
          subPath: charm/containers/certbot-nginx
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: certbot-k8s-certbot-nginx-secret
      initContainers:
      - args:
        - init
        - --containeragent-pebble-dir
        - /containeragent/pebble
        - --charm-modified-version
        - "3"
        - --data-dir
        - /var/lib/juju
        - --bin-dir
        - /charm/bin
        command:
        - /opt/containeragent
        env:
        - name: JUJU_CONTAINER_NAMES
          value: certbot-nginx
        - name: JUJU_K8S_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: JUJU_K8S_POD_UUID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.uid
        envFrom:
        - secretRef:
            name: certbot-k8s-application-config
        image: jujusolutions/jujud-operator:2.9.37
        imagePullPolicy: IfNotPresent
        name: charm-init
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/juju
          name: charm-data
          subPath: var/lib/juju
        - mountPath: /containeragent/pebble
          name: charm-data
          subPath: containeragent/pebble
        - mountPath: /charm/bin
          name: charm-data
          subPath: charm/bin
        - mountPath: /charm/containers
          name: charm-data
          subPath: charm/containers
        workingDir: /var/lib/juju
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: certbot-k8s
      serviceAccountName: certbot-k8s
      terminationGracePeriodSeconds: 300
      volumes:
      - emptyDir: {}
        name: charm-data
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
status:
  collisionCount: 0
  currentRevision: certbot-k8s-6dc79f6c4d
  observedGeneration: 15
  replicas: 1
  updateRevision: certbot-k8s-c7bf6949
  updatedReplicas: 1
```

NOTE: There are a few ``juju.is/version: 2.9.34`` labels in the statefulset spec as well (not included above).

This issue **can** be resolved by just upgrading the charm through ``juju refresh``, however, that will not work if the charm is already at the latest revision:

```
juju refresh certbot-k8s
charm "certbot-k8s": already up-to-date
```

A solution for this would be to update the statefulset itself, to update the ``charm-init`` container image to ``jujusolutions/jujud-operator:2.9.37``. Doing so will resolve the error above.

However, that is not the only problem.

The other charms were refreshed, and did not have this issue, their ``charm-init`` container image was up-to-date. The charms are periodically becoming ``Active``, and after ~1 minute they are in an ``Error`` state. ``juju debug-log`` doesn't reveal much information in this case either:

```
unit-certbot-k8s-0: 14:47:51 INFO juju.cmd running containerAgent [2.9.37 fd867c0a267591313571dee9c60f3f9e71120581 gc go1.19.3]
unit-certbot-k8s-0: 14:47:51 INFO juju.cmd.containeragent.unit start "unit"
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.upgradesteps upgrade steps for 2.9.37 have already been run.
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.probehttpserver starting http server on [::]:65301
unit-certbot-k8s-0: 14:47:51 INFO juju.api cannot resolve "controller-service.controller-finos-legend.svc.cluster.local": lookup controller-service.controller-finos-legend.svc.cluster.local: operation was canceled
unit-certbot-k8s-0: 14:47:51 INFO juju.api connection established to "wss://10.100.157.228:17070/model/994a2b25-4ea2-40d0-8940-f51454152132/api"
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.apicaller [994a2b] "unit-certbot-k8s-0" successfully connected to "10.100.157.228:17070"
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.migrationminion migration phase is now: NONE
unit-certbot-k8s-0: 14:47:51 INFO juju.worker.logger logger worker started
unit-certbot-k8s-0: 14:47:51 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
```

``juju show-status-log`` doesn't show too much information either:

```
21 Nov 2022 14:48:40Z juju-unit error crash loop backoff: back-off 2m40s restarting failed container=charm pod=certbot-k8s-0_finos-legend(b43f12a0-76ef-4e5c-96c3-ff419ae2af81)
21 Nov 2022 14:48:40Z workload maintenance
21 Nov 2022 14:51:28Z juju-unit executing running start hook
21 Nov 2022 14:51:29Z juju-unit idle
21 Nov 2022 14:51:33Z juju-unit executing running certbot-nginx-pebble-ready hook
21 Nov 2022 14:51:33Z workload active
21 Nov 2022 14:51:33Z juju-unit idle
21 Nov 2022 14:52:08Z workload maintenance stopping charm software
21 Nov 2022 14:52:08Z juju-unit executing running stop hook
21 Nov 2022 14:52:10Z juju-unit error crash loop backoff: back-off 5m0s restarting failed container=charm pod=certbot-k8s-0_finos-legend(b43f12a0-76ef-4e5c-96c3-ff419ae2af81)
21 Nov 2022 14:52:10Z workload maintenance
```

Looking at the Kubernetes pod logs, we can actually see a bit more information, including the fact that the pod was terminated / received a termination signal:

```
kubectl logs pod/certbot-k8s-0 -n finos-legend
Defaulted container "charm" out of: charm, certbot-nginx, charm-init (init)
2022-11-21T14:47:51.046Z [pebble] HTTP API server listening on ":38812".
2022-11-21T14:47:51.046Z [pebble] Started daemon.
2022-11-21T14:47:51.053Z [pebble] POST /v1/services 5.725658ms 202
2022-11-21T14:47:51.053Z [pebble] Started default services with change 13.
2022-11-21T14:47:51.056Z [pebble] Service "container-agent" starting: /charm/bin/containeragent unit --data-dir /var/lib/juju --append-env "PATH=$PATH:/charm/bin" --show-log --charm-modified-version 3
2022-11-21T14:47:51.091Z [container-agent] 2022-11-21 14:47:51 INFO juju.cmd supercommand.go:56 running containerAgent [2.9.37 fd867c0a267591313571dee9c60f3f9e71120581 gc go1.19.3]
2022-11-21T14:47:51.091Z [container-agent] starting containeragent unit command
2022-11-21T14:47:51.091Z [container-agent] containeragent unit "unit-certbot-k8s-0" start (2.9.37 [gc])
2022-11-21T14:47:51.091Z [container-agent] 2022-11-21 14:47:51 INFO juju.cmd.containeragent.unit runner.go:556 start "unit"
2022-11-21T14:47:51.091Z [container-agent] 2022-11-21 14:47:51 INFO juju.worker.upgradesteps worker.go:60 upgrade steps for 2.9.37 have already been run.
2022-11-21T14:47:51.092Z [container-agent] 2022-11-21 14:47:51 INFO juju.worker.probehttpserver server.go:157 starting http server on [::]:65301
2022-11-21T14:47:51.102Z [container-agent] 2022-11-21 14:47:51 INFO juju.api apiclient.go:1055 cannot resolve "controller-service.controller-finos-legend.svc.cluster.local": lookup controller-service.controller-finos-legend.svc.cluster.local: operation was canceled
2022-11-21T14:47:51.102Z [container-agent] 2022-11-21 14:47:51 INFO juju.api apiclient.go:688 connection established to "wss://10.100.157.228:17070/model/994a2b25-4ea2-40d0-8940-f51454152132/api"
2022-11-21T14:47:51.105Z [container-agent] 2022-11-21 14:47:51 INFO juju.worker.apicaller connect.go:163 [994a2b] "unit-certbot-k8s-0" successfully connected to "10.100.157.228:17070"
2022-11-21T14:47:51.128Z [container-agent] 2022-11-21 14:47:51 INFO juju.worker.migrationminion worker.go:142 migration phase is now: NONE
2022-11-21T14:47:51.128Z [container-agent] 2022-11-21 14:47:51 INFO juju.worker.logger logger.go:120 logger worker started
2022-11-21T14:47:51.133Z [container-agent] 2022-11-21 14:47:51 WARNING juju.worker.proxyupdater proxyupdater.go:282 unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
2022-11-21T14:48:38.062Z [pebble] Exiting on terminated signal.
2022-11-21T14:48:38.064Z [pebble] Stopping all running services.
2022-11-21T14:48:38.885Z [container-agent] 2022-11-21 14:48:38 ERROR juju.cmd.containeragent.unit runner.go:459 fatal "unit": agent should be terminated
2022-11-21T14:48:38.888Z [pebble] Service "container-agent" stopped
```

By running ``kubectl describe`` on the pod, we can see the events:

```
  Normal Created 13m kubelet Created container charm-init
  Normal Started 13m kubelet Started container charm-init
  Normal Started 13m kubelet Started container certbot-nginx
  Normal Pulled 13m kubelet Container image "registry.jujucharms.com/charm/gmp2uydjv8qosoap03gsy06g9733pexeg9d3k/certbot-nginx-image@sha256:688d49104532c4614f365e2404c98ffa032fc84e5c43dea2e0aab15dc3baed84" already present on machine
  Normal Created 13m kubelet Created container certbot-nginx
  Normal Created 11m (x3 over 13m) kubelet Created container charm
  Normal Started 11m (x3 over 13m) kubelet Started container charm
  Normal Pulled 11m (x4 over 13m) kubelet Container image "jujusolutions/charm-base:ubuntu-20.04" already present on machine
  Normal Killing 11m (x3 over 12m) kubelet Container charm failed startup probe, will be restarted
  Warning BackOff 8m35s kubelet Back-off restarting failed container
  Warning Unhealthy 3m6s (x15 over 12m) kubelet Startup probe failed: Get "http://192.168.29.101:3856/startup": dial tcp 192.168.29.101:3856: connect: connection refused
```

Because the ``Startup`` probe fails, the pod eventually gets reset by Kubernetes.

In parallel, I've deployed a new instance of that charm in a different model and it had no issues. I've then compared the Pod Specs between the old (now broken) Pod and the new Pod and saw that the new Pod does not have any startup probe (See old Pod spec above). If we update the Kubernetes StatefulSet for the charm and remove the startup probe, the charm becomes and stays Active.

tl;dr: Upgrading a Juju Controller and Model will result in the Charm's Kubernetes StatefulSets being updated (added the ``--containeragent-pebble-dir`` argument to the ``charm-init`` container), but it doesn't update the init container's container image to the newer version, resulting in a broken charm. Plus, it doesn't update the other parts of the StatefulSet specs, including the startup probe, which resulted in the Kubernetes Pods being restarted periodically because the startup probe was no longer passing / no longer exists.

Junien F (axino)
tags: added: canonical-is-upgrades
Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.9.38
assignee: nobody → Thomas Miller (tlmiller)
Revision history for this message
Thomas Miller (tlmiller) wrote :

Hi Claudiu,

We have successfully replicated the issue and have almost completed a fix for this. If you would like to unblock the charms in the mean time you can.

1. Upgrade the model to 2.9.37
2. Then edit the statefulset's of each charm and upgrade their respective images to 2.9.37 tag for the jujusolutions image.

Revision history for this message
Harry Pidcock (hpidcock) wrote :
Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.