Workload container probes are too unforgiving

Bug #2052517 reported by Paulo Machado
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ben Hoyt

Bug Description

In kubernetes, default container workload probes are unforgiving, i.e. one single timeout will restart the container (https://github.com/juju/juju/blob/a90328ccf574d4be960b0236a5c21cd030111fd2/caas/kubernetes/provider/application/application.go#L68).

For reference, the same setting for the controller pod/containers are more forgiving:
https://github.com/juju/juju/blob/a90328ccf574d4be960b0236a5c21cd030111fd2/caas/kubernetes/provider/bootstrap.go#L1313

The issue is being observed for mysql-k8s charm when scaling up the application from a non-empty database.

Due to the heavy data transfer on the initialization of new instance(s), it was observed timeouts from liveness probe (through kubelite logs and modified pebble to log call time) from either the transfer source, destiny or both, rendering unnecessary and hard to recover errors.

One avenue of testing was to patch/replace the statefulset as first charm action, in order to apply a more forgiven set of parameters, e.g.:
   delay=300s timeout=2s period=5s #success=1 #failure=30

While this does work for the initial deployment, units added after it are not picking these settings.

Instead, on scale up the statefulset spec template seems to be merged with the original sts (instead of the patched sts), rendering these parameters with the stock values.

Another avenue of test was to modify these parameters directly into juju, making all workload containers with the more forgiving liveness probe parameters, which virtually solves the issue.

In summary, what's a sane approach for making these parameters (or even the probe type) configurable by the charm on deployment time?
Would it be too much of a stretch to define that in a per application fashion on the charm metadata?

tags: added: canonical-data-platform-eng
Revision history for this message
Paulo Machado (paulomachado) wrote (last edit ):

Versions: juju-3.1/3.3, microk8s-1.28/1.29

Steps to reproduce:

juju deploy mysql-k8s -n 3 --trust [--config profile=testing OR --config profile-limit-memory=2400 if memory is constrained]
juju deploy mysql-test-app
juju relate mysql-k8s:database mysql-test-app

# I've observed that having COS lite seems to speed up the failures
juju deploy cos-lite --trust [--overlay storage-small-overlay.yaml ]

# wait some 10 minutes to mysql-test-app generate some data into the database
juju scale-application mysql-k8s 7

# watch container churn

Revision history for this message
Thomas Miller (tlmiller) wrote :

Hi Paulo,

Are you able to provide the checks you are using your Pebble service layer so that we can correlate everything going on.

Kind Regards
Tom

Revision history for this message
Paulo Machado (paulomachado) wrote :
Revision history for this message
Paulo Machado (paulomachado) wrote :

Logs for the containers attached.

I've included logs for patched juju[3] and pebble[4] also.

[3] forgiving liveness parameter https://github.com/juju/juju/compare/3.3...paulomach:juju:3.3
[4] log health endpoint elapsed time https://github.com/canonical/pebble/compare/master...paulomach:pebble:master

Revision history for this message
Paulo Machado (paulomachado) wrote :

Logs for the containers attached.

I've included logs for patched juju[3] and pebble[4] also.

[3] forgiving liveness parameter https://github.com/juju/juju/compare/3.3...paulomach:juju:3.3
[4] log health endpoint elapsed time https://github.com/canonical/pebble/compare/master...paulomach:pebble:master

Revision history for this message
Paulo Machado (paulomachado) wrote :

Another idea is to get rid of workload container livenessProbe completely (assuming that *_pebble_ready is triggered by the readinessProbe).
We have the charm container to do it, and in a way approximating the behavior between k8s and VM.

Revision history for this message
Thomas Miller (tlmiller) wrote :

Hey Paulo,

Are you able to provide the output of kubectl describe pod for the pod in question that is failing without any patched values?

Revision history for this message
Paulo Machado (paulomachado) wrote :

Hi Thomas, I can provide you with that. Just notice that, in the logs there are *both* patched and non patched pod logs.

Revision history for this message
Paulo Machado (paulomachado) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

I thought that the intent was that pebble itself should always be quickly responsive, and charms got to define how pebble queries the workload, thus if *pebble* is being unresponsive, it should be restarted.

So charms do have a way to define what their health checks should be (wrt pebble <-> application) but not wrt K8s <-> pebble.

Revision history for this message
Paulo Machado (paulomachado) wrote :

Hi John

> So charms do have a way to define what their health checks should be (wrt pebble <-> application) but not wrt K8s <-> pebble

That's super reasonable and blend in nicely.
But what to do when pebble is not quick enough on it own (no charm checks involved)?

As observed, the responses are there, with status code 20x (i.e. healthy) only failing due timing out (1s).

Revision history for this message
Thomas Miller (tlmiller) wrote :

Hey John,

You are correct. I suspect pebble is being starved out in this instance on the CPU and that is why it can't respond in time. The alternative theory is we are hitting the dead lock in Pebble that is currently being fixed. Just trying to gather the information so I can build out monitoring to prove which case it is. Good chance also that it could be something else but the above is my two best guesses at the moment.

Revision history for this message
James Page (james-page) wrote :

Back linking this to bug 2049857 in snap-openstack (microstack) - we have a fair few users reporting this issue during resize ops as well as Solutions QA who are hitting it fairly reliably in testing.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi @paulomachado, the pebble issue that @tlmiller is referring to is:
https://github.com/canonical/pebble/issues/314

It has been marked as released on the latest version of Pebble (1.8.0 I believe), released last Friday. Can you give it a try?

Revision history for this message
Paulo Machado (paulomachado) wrote :

Hi @pguimaraes, thank you for pointing that out.
Although the symptom that I'm seeing seems to differ, I'll give it a try today and report back.

Revision history for this message
Paulo Machado (paulomachado) wrote :

No progress with pebble 1.8.0:

❯ for p in {0..8}; do kubectl logs -n ${MODEL} mysql-k8s-$p --container=mysql |rg "alive.*[0-9]s"|tail -n 1; done
2024-02-19T14:07:11.720Z [pebble] 172.31.2.182:48450 GET /v1/health?level=alive 1.004315412s 200
2024-02-19T14:43:06.442Z [pebble] 172.31.2.182:43490 GET /v1/health?level=alive 2.097824161s 200
2024-02-19T14:43:07.789Z [pebble] 172.31.2.182:54682 GET /v1/health?level=alive 1.115903076s 200
2024-02-19T14:30:03.865Z [pebble] 172.31.2.182:60740 GET /v1/health?level=alive 1.448630762s 200
2024-02-19T14:34:08.549Z [pebble] 172.31.2.182:55330 GET /v1/health?level=alive 1.166540946s 200
2024-02-19T14:24:37.535Z [pebble] 172.31.2.182:47080 GET /v1/health?level=alive 1.400687858s 200
2024-02-19T14:29:05.302Z [pebble] 172.31.2.182:44432 GET /v1/health?level=alive 1.454553133s 200

Revision history for this message
Paulo Machado (paulomachado) wrote :

@tlmiller @jameinel what workaround can we have for this issue?

Revision history for this message
Ben Hoyt (benhoyt) wrote :

I talked to the Juju team today, and they'd like to understand what's causing this first and why Pebble is responding so slowly, before making any changes. We *could* just bump up the timeout in Juju, or the number of retries, but that might have unintended consequences and mask legitimate problems for other users.

That Pebble health check endpoint (GET /v1/health) should be ultra-quick to respond, as extremely little code (especially if there are no checks configured). So the fact that it's taking over 1s regularly is concerning and the underlying issue here. Are you able to measure what CPU and I/O usage is going on during that time? Is it the K8s host that's overloaded, or the container itself?

One thing the Juju team suggested is that you could use the "renice" and "ionice" commands to bump up the priority of Pebble (PID 1) and see if that makes a difference.

Revision history for this message
Paulo Machado (paulomachado) wrote (last edit ):

Some remarks:

* it only happens on mysql workload containers, under write/replication load
* the node is not under memory pressure, as reported by k8s
* there's no limit/request being applied to containers
* pod QoS class is BestEffort

Tests made were:

1. set pebble niceness to -5: no difference
2. set pebble ioniceness to 5: no difference

Attached iotop, cpustat and pod logs (filtered for over 1s pebble responses) captured over a ~7 minutes interval.

Revision history for this message
Paulo Machado (paulomachado) wrote :

fwiw, I've tested with postgresql-k8s charm and also got the timeouts

Revision history for this message
John A Meinel (jameinel) wrote :

Simon dug into the Pebble code a bit, and came across this function:

https://github.com/canonical/pebble/blob/150cb9fe962c33856fa8839afbdc8d7ae27e4638/internals/overlord/checkstate/manager.go#L218

Which seems to do the right thing and runs the `check` before grabbing the mutex. However, once it grabs the mutex if the check has failed, it then goes down into 'action()'
https://github.com/canonical/pebble/blob/150cb9fe962c33856fa8839afbdc8d7ae27e4638/internals/overlord/checkstate/manager.go#L250

And that potentially does a lot of work while holding the mutex.
And since it is the same mutex that is held by info()
this could easily be blocking the health check.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Hi team, I've done some tests alongside Paulo. Find here the logs, cluster status, juju status:
https://github.com/phvalguima/pebble/tree/lp2052517-pebble-timestamps/TEST_LOGS

This branch has a lot of logs in the critical path of v1Health logic.

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

Looking at the logs, indeed the slow down happens in this line. I can notice it several times:
https://github.com/phvalguima/pebble/blob/lp2052517-pebble-timestamps/internals/daemon/daemon.go#L267

These are extracts from 2x of the 7 mysql units. One of them got a very long delay:

2024-02-22T16:36:10.648Z [pebble] daemon.go ServerHTTP from 192.168.0.228:33542 request state Lock: 271.987µs
2024-02-22T16:36:10.919Z [pebble] daemon.go ServerHTTP from 192.168.0.228:33530 acquired state Lock: 1.362945519s

2024-02-22T16:36:04.941Z [pebble] daemon.go ServerHTTP from 192.168.0.228:35728 request state Lock: 165.343µs
2024-02-22T16:36:05.133Z [pebble] daemon.go ServerHTTP from 192.168.0.228:35728 acquired state Lock: 191.844274ms

2024-02-22T16:36:04.941Z [pebble] daemon.go ServerHTTP from 192.168.0.228:35738 request state Lock: 206.75µs
2024-02-22T16:36:05.131Z [pebble] daemon.go ServerHTTP from 192.168.0.228:35738 acquired state Lock: 190.404511ms

2024-02-22T16:36:09.939Z [pebble] daemon.go ServerHTTP from 192.168.0.228:37566 request state Lock: 388.745µs
2024-02-22T16:36:10.163Z [pebble] daemon.go ServerHTTP from 192.168.0.228:37566 acquired state Lock: 224.851392ms

Revision history for this message
Ben Hoyt (benhoyt) wrote (last edit ):

Pedro and Paulo: really good digging, thanks. I've just had a good discussion with them, and I'll try to summarize here.

Responding to John's message and Simon's find about the check manager calling c.action() with the checkData mutex held: this is an interesting find, and I'll put it on my to-do list, as we probably should release the lock while calling c.action(). However, the action should be quick, and in addition, it's a red herring in this case, because in Pedro and Paulo's case they have no health checks defined at all, so this code path isn't being exercised. Juju is asking K8s to hit the Pebble /v1/health check endpoint, but if you have no health checks defined, it basically does nothing.

Does nothing ... except that ServeHTTP (up one level) acquires the state lock in daemon.go:267 that Pedro linked to. So if other Pebble operations have acquired the state lock and are taking their time (due to load or whatever), this will prevent /v1/health from responding in a timely fashion. This is what's happening: the charm is running various commands using the Pebble exec API, which needs the state lock, and sometimes /v1/health gets a delay as a result of it needing the lock in ServeHTTP. There are quite a number of commands being exec'd by the charm, though it's one at a time ... so it shouldn't be a problem, but with the load, every so often it is.

In terms of what I think we should do:

1) I'll see if we can change Pebble's ServeHTTP to avoid acquiring the state lock at all (I'm not even sure what restart.Pending does -- it's a holdover from snapd, I think). This will mean that not every handler needs to acquire it, and certainly /v1/health shouldn't need to.

2) I'm going to talk to the Juju team today, requesting that we bump up the container probe timeout from 1s to 10s. One second is very aggressive when we have no retries, and is clearly causing problems like we see here when the system is under load.

3) In the meantime, Paulo is going to see if they can reduce the load from MySQL, for example reducing the number of cores/threads to one less than the number of cores on the system, so that Pebble and other tasks have some room to breathe. He's not sure what knobs are available yet, but there does seem to be some that might help.

I'd also like to do a bit more digging into their logs and reproduce the issue locally with stress testing to see if we can understand why Pebble seems to get "stuck" so easily.

Revision history for this message
Ben Hoyt (benhoyt) wrote (last edit ):

Just talked to the Juju team about 2), and there's some debate about whether the better thing there would be to increase the K8s probe timeout (Harry's preference) or the number of retries (John's preference). So as far as Pebble+Juju goes, we're going to start with the Pebble ServeHTTP locking fix and go from there -- I've opened https://github.com/canonical/pebble/issues/366 to track that.

Revision history for this message
Alex Lutay (taurus) wrote (last edit ):

Thank you Ben for a good news!

Re: increase the K8s probe timeout or the number of retries.

Maybe allow charm to set it depending on the installation size, historical load, hardware age, etc?

re: 3) Paulo to reduce the load from MySQL ... keep one core for Pebble to breathe.

IMHO: 'Pebble' should be 'niced' above 'workload' (or juju should nice/ionice 'workload' down to guaranties priority for Pebble(K8s)/Systemd+Snap(VM)). We should utilize all resources effectively but order consumers priority.

Revision history for this message
John A Meinel (jameinel) wrote :

```
  _, rst := restart.Pending(st)
```

That is a very weird statement. And why I always recommend that people spell out their variables. I think it is checking to see if it needs to restart, but why would you throw away the first return.

Revision history for this message
Ben Hoyt (benhoyt) wrote :

Hi folks, an update here for you to try, though you'll likely have to build Juju from source to try it. I've made the Pebble health check endpoint not acquire the state lock (https://github.com/canonical/pebble/pull/369), and that's merged and released in Pebble. I've also updated Juju 3.4 to use that version (https://github.com/juju/juju/pull/16981). As I write this, that PR isn't merged yet, but it should be soon. If the PR is merged, you can check out the Juju 3.4 branch and build and use that; if it's not merged, check out the PR branch and build that.

From my testing, I'm almost certain this will fix the issue you're having, but if you are able to test with a custom version of Juju before the next 3.4 version is released, that would be great.

Revision history for this message
Paulo Machado (paulomachado) wrote :

Hi Ben,

I'll give it a go today and report back here. Thank you

Revision history for this message
Paulo Machado (paulomachado) wrote :

Hi all,

on my test setup (where I was getting the timeouts), I'm happy to say that I've got response time no bigger than 10ms.
I'll keep tests running today and report back if any issue.

Thanks

Revision history for this message
Ben Hoyt (benhoyt) wrote :

Paulo, that's great to hear, thanks for testing it.

Changed in juju:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Ben Hoyt (benhoyt)
milestone: none → 3.4.1
Ben Hoyt (benhoyt)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Haw Loeung (hloeung) wrote :

@benhoyt, is it possible to backport this to Juju 2.9? So I think that would be backporting the changes in the PR below to Pebble 1.4 and updating Juju to use the latest Pebble 1.4?

https://github.com/canonical/pebble/pull/369

Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Ben Hoyt (benhoyt) wrote :

See also https://github.com/juju/juju/pull/17037, where the Juju team bumped up the failure threshold for liveness probes from 1 to 3 (but the failure threshold for readiness and startup probes stay at 1). That PR also adds a separate startup probe (pointing to the same as the liveness check: /v1/health?level=alive).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.