upgrading Podspec to Sidecar charms fails on AKS

Bug #2073529 reported by Orfeas Kourkakis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Medium
Harry Pidcock

Bug Description

Trying out the suggested by `juju refresh` upgrade path for podspec charms to sidecar, the charms get stuck in the following state

```
envoy res:oci-image@cc06b3e unknown 0/1 envoy latest/edge 253 10.0.3.103 no
katib-controller res:oci-image@31ccd70 unknown 0/1 katib-controller latest/edge 700 10.0.16.155 no
kubeflow-volumes res:oci-image@2261827 unknown 0/1 kubeflow-volumes latest/edge 326 10.0.14.97 no
```

unable to spin up new units. For context, `latest/edge` is the new channel. Looking at the pods, it looks like there are still the podspec operator pods up.
* Logs from controller `api-server` container from around that time https://pastebin.canonical.com/p/wPYFYWFpWD/
* All logs from the same container are attached as well

## Debugging tried
1. `juju scale-application` does nothing (to 0 or 1)
2. Cannot `juju refresh` to previous charm `ERROR cannot downgrade from v2 charm format to v1`
3. Tried to completely remove the charms after them being stuck and re-deploy (so we can follow a possible workaround) twice on different clusters and they all ended up in the same stuck state `unknown 0/1` state with the `operator` pods mentioned above still being there.
4. Restarted the controller (by killing its pod) but this didn't unblock the charms

That means that if this happens, there is no way right now to unblock those charms. (I need to try deleting manually their deployment).

## Reproduce
1. Create AKS cluster 1.29 https://charmed-kubeflow.io/docs/create-aks-cluster-for-mlops
2. Deploy kubeflow 1.8/stable https://charmed-kubeflow.io/docs/deploy-charmed-kubeflow-to-aks#heading--set-up-juju
3. Try refreshing those specific charms:
```
juju scale-application katib-controller 0
juju scale-application kubeflow-volumes 0
juju scale-application envoy 0
# wait for units to disappear
juju remove-relation mlmd envoy
juju refresh katib-controller --channel latest/edge --trust
juju refresh kubeflow-volumes --channel latest/edge --trust
juju refresh envoy --channel latest/edge --trust
# wait for refresh to complete
juju scale-application katib-controller 1
juju scale-application kubeflow-volumes 1
juju scale-application envoy 1
```

## Environment
Juju 3.4.4
AKS 1.29
On Microk8s and EKS 1.29, the upgrade path works.

Revision history for this message
Orfeas Kourkakis (orfeas-k) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

So it's working on microk8s and EKS but not AKS????? Weird.

Ultimately, whomever picks this up will need info from the cluster, not just juju. They'd need get/describe yaml of the affected pods, plus status format yaml of the juju model before and after the upgrade operation.

How feasible it is to redeploy rather than upgrade?

Revision history for this message
Orfeas Kourkakis (orfeas-k) wrote :

>So it's working on microk8s and EKS but not AKS????? Weird.

Yes exactly.

> They'd need get/describe yaml of the affected pods, plus status format yaml of the juju model before and after the upgrade operation.

Error is reproducible, so it shouldn't be hard to get any information. That being said, I deployed once more and attaching below what was asked for.
* juju status before refresh: https://pastebin.canonical.com/p/5DSzfmwp88/
* juju status after refresh: https://pastebin.canonical.com/p/tG29Qfhb8k/
* envoy operator pod: https://pastebin.canonical.com/p/TkBYzx4ZJW/
* envoy operator describe: https://pastebin.canonical.com/p/rwjSDJWZDm/
* katib controller operator pod: https://pastebin.canonical.com/p/nrZJXcSt2n/
* katib controller operator describe: https://pastebin.canonical.com/p/VdNz2CmpY5/
* kubeflow volumes pod: https://pastebin.canonical.com/p/8WCWYwGJFt/
* kubeflow volumes describe: https://pastebin.canonical.com/p/WRQpx8YXNb/

(re status-before-refresh, I scaled the apps to 0 and back to 1 before refreshing, that's why they have `/1` as units)

> How feasible it is to redeploy rather than upgrade?

It is, as noted in our [alternative upgrade path](https://docs.google.com/document/d/1Wg32O5PF8RMy7ng7hY9gX37lHnwmszyBt4D2lI_MSjQ/edit#heading=h.k8y9mwjyl482), even if it's not an ideal UX.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Harry Pidcock (hpidcock)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.