Inconsistent events on pod restart

Bug #2021891 reported by Carl Csaposs
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Undecided
Unassigned

Bug Description

If a pod is deleted with `kubectl delete pod`, these events are fired: stop -> upgrade-charm -> config-changed -> start (consistent with https://juju.is/docs/sdk/start-event#heading--emission-sequence)

If a pod is deleted with `sudo microk8s stop` and `sudo microk8s start`, these events are fired: start

Expected behavior:
`upgrade-charm` is not fired when a pod is deleted with `kubectl delete pod` (or `upgrade-charm` is fired after `sudo microk8s stop` and `sudo microk8s start)

Steps to reproduce:
1. juju add-model foo
2. juju deploy mysql-router-k8s --channel 8.0/edge
3. Wait for idle
4. kubectl -n foo delete pod mysql-router-k8s-0
5. Wait for idle
6. sudo microk8s stop
7. sudo microk8s start
8. Wait for idle
9. Check juju debug-log --replay or jhack tail --replay

Versions:
MicroK8s v1.26.4 revision 5219 from snap channel 1.26/stable
Juju 2.9.43-ubuntu-amd64 from snap channel 2.9/edge

Also reproduced on MicroK8s v1.26.4 revision 5222 from snap channel 1.26-strict/stable and Juju 3.2-beta3-genericlinux-amd64 from snap channel 3.2/beta

Additional context:
In both pod deletion scenarios, the debug log shows `INFO juju.worker.uniter reboot detected; triggering implicit start hook to notify charm`

Initially discovered in mysql-router-k8s bug report: https://github.com/canonical/mysql-router-k8s-operator/issues/85

Revision history for this message
Ian Booth (wallyworld) wrote :

Pod churn is different from cluster start / stop. In the latter case, there is no pod churn, juju is killed (can't run the stop hook), and just running the start hook is reasonable when the cluster comes up again - the same pod as before is just being started up. Whereas for pod churn, juju gets a chance to see the pod is deleted and hence can run the stop hook, and then when a replacement pod comes up, it's treated as new and hence the standard new agent hooks run.

Charm hooks are supposed to be idempotent - the same hook can be run multiple times and the charm should do the right thing. The linked issue seems to imply that start hook running twice starts a second instance. This is a charm hook implementation issue.

Changed in juju:
status: New → Invalid
Revision history for this message
Pietro Pasotti (ppasotti) wrote :
description: updated
Revision history for this message
Carl Csaposs (carlcsaposs) wrote (last edit ):

In the case of microk8s stop / start, there appears to be pod churn—a different pod starts when the cluster comes up again

Steps to reproduce:
1. juju add-model foo
2. juju deploy mysql-k8s --channel 8.0/edge --trust
3. juju deploy mysql-router-k8s --channel 8.0/edge
4. juju relate mysql-router-k8s mysql-k8s
5. Wait for idle
6. kubectl -n foo describe pod mysql-router-k8s-0 # Look at container `Restart Count` (should be 0) or State start time
7. kubectl -n foo exec mysql-router-k8s-0 --container mysql-router -- ls /etc/mysqlrouter # Two files exist
8. sudo microk8s stop
9. sudo microk8s start
10. kubectl -n foo exec mysql-router-k8s-0 --container mysql-router -- ls /etc/mysqlrouter # No files exist
11. kubectl -n foo describe pod mysql-router-k8s-0 # Container restart count is 1, state start time is newer, last state is terminated

Revision history for this message
Ian Booth (wallyworld) wrote :

k8s guarantees the pod identity of stateful set pods, but they can be rescheduled as needed to satisfy various constraints. If a pod is rescheduled, any PV will get reattached, but the local file system will not be restored. This is just standard k8s behaviour the charm needs to deal with. It must not assume that info stored on the local container file system is durable. Anything that needs to be persisted across pod restarts (for whatever reason) needs to be put into unit state (for config. key pairs etc), or on a storage volume declared in te charm metadata.

Revision history for this message
Carl Csaposs (carlcsaposs) wrote (last edit ):

I understand that the local container file system is not durable. I'm still confused about why for some pod churns (i.e. `kubectl delete`) the charm gets an `upgrade-charm` event after the new pod starts but for other pod churns (i.e. `microk8s stop`) the charm does not get that event. From the debug log, it's clear that juju detected a reboot.

Currently, `upgrade-charm` being fired on `kubectl delete` suggests that charm developers can use that event to detect a pod restart—but it detects only some pod restarts.

Is there a reason that `upgrade-charm` is currently fired after `kubectl delete`?

Revision history for this message
Ian Booth (wallyworld) wrote :

Starting and stopping the cluster does not cause pod churn though - the exact same pods/containers are running after the cluster restarts. This is different to actually asking k8s to delete a pod - a new pod with a different internal id is created to replace the deleted pod. This new pod has containers with fresh container filesystems which means that transient state info used by the unit agent is deleted. This state info includes the "charm modified version" which is used to track whether the charm upgrade hook needs to be run. So a brand new pod will run the upgrade-charm hook, allowing the charm to recreate any container local state it needs to run.

If you want to detect that a pod has been restarted (for whatever reason), can't you use the "start" hook which is run every time?

Revision history for this message
Carl Csaposs (carlcsaposs) wrote :

> the exact same pods/containers are running after the cluster restarts

I don't think this is correct—using the steps to reproduce from https://bugs.launchpad.net/juju/+bug/2021891/comments/3, different containers are running after the cluster restart.

See here for comparison of `kubectl describe pod` between cluster restart and kubectl delete pod: https://gist.github.com/carlcsaposs-canonical/d51505044dcc830bd40ed2cccce08d71/revisions. The last revision is after `sudo microk8s stop` and `sudo microk8s start`. The second to last revision is after `kubectl delete pod`

> If you want to detect that a pod has been restarted (for whatever reason), can't you use the "start" hook which is run every time?

This is what we're now doing. But why does `upgrade-charm` run for kubectl delete even if the charm isn't being upgraded? Is this because of a limitation of how juju keeps track of upgrades, or is it intended to be used so that the container local state can be recreated by the charm?

Revision history for this message
Ian Booth (wallyworld) wrote :

upgrade-charm runs because the pod is deleted via kubectl and juju tracks that and when the replacement pod is started and the local container filesystem is empty, the unit agent's knowledge of the charm it is running is gone

a cluster restart does not result in pod deletion from the statefulset so it is treated differently

tags: added: canonical-data-platform-eng
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.