FileNotFoundError as part of container.exec inside a pebble poll hook

Bug #1989076 reported by Arturo Enrique Seijas Fernández
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

This pastebin illustrates the problem best: https://paste.ubuntu.com/p/pNxcZCvnRK/

We have an indico-nginx-pebble-ready hook that's firing and resulting in a FileNotFoundError error when pebble tries to connect through the socket, as it can be seen in the pastebin above.

Unfortunately, we can't give instructions to reproduce the issue but we have two deployments in which this is happening recurrently.

I provide a link to the charm sources, in case it's useful https://github.com/canonical/indico-operator/blob/main/src/charm.py

Tags: canonical-is
Tom Haddon (mthaddon)
tags: added: canonical-is
Revision history for this message
Ben Hoyt (benhoyt) wrote :

I believe John Meinel's diagnosis (with Tom Haddon) on Mattermost explains this (please re-open if not):

> 1) pebble health checks normally take about 10ms to complete (running in a loop w/ curl) so 1s does seem generous there
> 2) We have evidence in one of the pods that died that 'normally' does not mean 'always'. We had:
> GET /...task?=timeout=4s 4.010
> GET /...task?=timeout=4s 7.603
>
> IOW, while normally Pebble responded very fast after the 4s timeout, there were times when it suddenly took >3.6s for it to actually respond.
> 3) The current thought is that the k8s worker node is using Ceph backed by spinning rust and that is causing latency in the system. And while IS can certainly not poke the bear, all of our tooling should be cognizant of that kind of inconsistency, and not fall over when it happens.
> 4) I think for Health checks, it is probably fine to just change it from failure=1 to failure=3, so that one blip doesn't immediately restart containers (these restarts are what explains the ConnectionError no socket found, because when K8s rescheduled one of the containers, its socket goes away for a bit)

Revision history for this message
Ben Hoyt (benhoyt) wrote :

Well, there are a couple of issues to fix here (eg: we may change failure=1 to failure=3), so I'll leave open to represent that.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Tom Haddon (mthaddon) wrote :

Hi Ben,

I'm not quite sure what you mean by "please re-open if not". That sounds to me like you're closing the issue, but you haven't closed the issue. And I agree with the comments you added here from John's diagnosis, but that's an explanation of the issue, not a fix for it, as I understand it. Can you clarify?

It seems to me like we may have a problem where if a check is taking more than 4 seconds the process.wait_output will fail, but also the pebble health check may fail too, causing containers to be rescheduled. That seems problematic - shouldn't the health check execution time be independent of whatever pebble is doing elsewhere in terms of executing commands in the container or waiting for command execution to finish?

Thanks, Tom

Revision history for this message
Tom Haddon (mthaddon) wrote :

I've tried to reproduce this issue locally by modifying the charm to do a "sleep 10" as part of and update status hook. However, this doesn't seem to cause container restarts:

https://paste.ubuntu.com/p/p3M47RdQTc/

In this case, pebble appears to be retrying the check until it fails. However, looking in the logs for the environment where this problem is exhibiting I see something similar, e.g.:

2022-09-07T08:13:22.934Z [pebble] GET /v1/changes/3/wait?timeout=4.000s 4.000201612s 504
2022-09-07T08:13:25.319Z [pebble] GET /v1/changes/3/wait?timeout=4.000s 2.383641707s 200

2022-09-07T08:13:30.000Z [pebble] GET /v1/changes/4/wait?timeout=4.000s 4.001198157s 504
2022-09-07T08:13:30.801Z [pebble] GET /v1/changes/4/wait?timeout=4.000s 800.091334ms 200

So it doesn't appear to be as straightforward as "if pebble is busy executing something, health checks will fail".

Revision history for this message
Weii Wang (weii-wang) wrote :

I created a timeline of the events happening during the crashloop, and can confirm that health check timeout is the reason for container restarting. Health checks will take more than 1 second to finish sometimes, the timing of that seems to be random.

https://paste.ubuntu.com/p/b2G8hQndZN/

Revision history for this message
Tom Haddon (mthaddon) wrote :

Fwiw we discovered the k8s worker this was running on was using root disk on slow ceph (Ceph with HDD behind it). We switched it to using a ceph pool with SSD and the problems disappeared. This confirms the suspicion that it's not an inherent problem with juju/pebble, but only one that manifests under certain poor infrastructure performance conditions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.