Comment 1 for bug 1989076

Revision history for this message
Ben Hoyt (benhoyt) wrote :

I believe John Meinel's diagnosis (with Tom Haddon) on Mattermost explains this (please re-open if not):

> 1) pebble health checks normally take about 10ms to complete (running in a loop w/ curl) so 1s does seem generous there
> 2) We have evidence in one of the pods that died that 'normally' does not mean 'always'. We had:
> GET /...task?=timeout=4s 4.010
> GET /...task?=timeout=4s 7.603
>
> IOW, while normally Pebble responded very fast after the 4s timeout, there were times when it suddenly took >3.6s for it to actually respond.
> 3) The current thought is that the k8s worker node is using Ceph backed by spinning rust and that is causing latency in the system. And while IS can certainly not poke the bear, all of our tooling should be cognizant of that kind of inconsistency, and not fall over when it happens.
> 4) I think for Health checks, it is probably fine to just change it from failure=1 to failure=3, so that one blip doesn't immediately restart containers (these restarts are what explains the ConnectionError no socket found, because when K8s rescheduled one of the containers, its socket goes away for a bit)