Workload container probes are too unforgiving
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Ben Hoyt |
Bug Description
In kubernetes, default container workload probes are unforgiving, i.e. one single timeout will restart the container (https:/
For reference, the same setting for the controller pod/containers are more forgiving:
https:/
The issue is being observed for mysql-k8s charm when scaling up the application from a non-empty database.
Due to the heavy data transfer on the initialization of new instance(s), it was observed timeouts from liveness probe (through kubelite logs and modified pebble to log call time) from either the transfer source, destiny or both, rendering unnecessary and hard to recover errors.
One avenue of testing was to patch/replace the statefulset as first charm action, in order to apply a more forgiven set of parameters, e.g.:
delay=300s timeout=2s period=5s #success=1 #failure=30
While this does work for the initial deployment, units added after it are not picking these settings.
Instead, on scale up the statefulset spec template seems to be merged with the original sts (instead of the patched sts), rendering these parameters with the stock values.
Another avenue of test was to modify these parameters directly into juju, making all workload containers with the more forgiving liveness probe parameters, which virtually solves the issue.
In summary, what's a sane approach for making these parameters (or even the probe type) configurable by the charm on deployment time?
Would it be too much of a stretch to define that in a per application fashion on the charm metadata?
tags: | added: canonical-data-platform-eng |
Changed in juju: | |
status: | New → In Progress |
importance: | Undecided → High |
assignee: | nobody → Ben Hoyt (benhoyt) |
milestone: | none → 3.4.1 |
Changed in juju: | |
status: | In Progress → Fix Committed |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Versions: juju-3.1/3.3, microk8s-1.28/1.29
Steps to reproduce:
juju deploy mysql-k8s -n 3 --trust [--config profile=testing OR --config profile- limit-memory= 2400 if memory is constrained]
juju deploy mysql-test-app
juju relate mysql-k8s:database mysql-test-app
# I've observed that having COS lite seems to speed up the failures small-overlay. yaml ]
juju deploy cos-lite --trust [--overlay storage-
# wait some 10 minutes to mysql-test-app generate some data into the database
juju scale-application mysql-k8s 7
# watch container churn