Canonical Juju

jujuc unable to execute in k8s charm container

Bug #2051179 reported by Danny Cocks on 2024-01-25

12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Medium	Ian Booth

Bug Description

After some unknown cause, a customer reported to us that a unit was in a bad state. When we investigated, it was found that the charm hooks were unable to run properly, as when they ran `juju-log` it died with a SIGKILL.

This was not found to have any connection to an out-of-memory KILL or other causes coming from the k8s host. Nothing relevant was reported in the hosts kern/syslog.

By manually doing a `kubectl exec -it ... -- /bin/bash` into the container, I was able to run `/charm/bin/jujuc` and it would die immediately with the message "KILLED". On a different pod which was working properly, this would instead exit with the message "ERROR JUJU_CONTEXT_ID not set".

The "fix" was to simply restart the pod, however we don't know why it entered this state in the first place.

This happened for 3 pods, corresponding to the charms:
- jupyter-ui (rev 25)
- seldon-core (rev 354)
- training-operator (rev 215)

I tried some more debugging:
- Running `/charm/bin/pebble` manually in the container did not provoke the same KILL result. It exited with the expected help text.
- Installing strace and running `strace jujuc` showed that the process was being KILLed very early, even before it had a chance to setup it's signal hooks.
- Running `strace --inject=all:delay_enter=10000 ./jujuc` slowed down the syscalls and the process was KILLED right after the first one.

Could this be due to pebble? Or an odd configuration issue with the kubelet?

Revision history for this message

Danny Cocks (dannycocks) wrote on 2024-01-25:

#1

I forgot versions of the juju here. This is on a k8s model with juju agent version 2.9.43.

Revision history for this message

Danny Cocks (dannycocks) wrote on 2024-01-25:

#2

In case it helps, here is the charm container info from kubectl describe.

Containers:
  charm:
    Container ID: containerd://3f0fe1e29d4be42b1346d18c835232d245203d5fd8e282d09b47078103015505
    Image: jujusolutions/charm-base:ubuntu-20.04
    Image ID: docker.io/jujusolutions/charm-base@sha256:65645e2aaa0632da3e4f79d4a02ed32f891f049858cb1a5b5a8ad4cf3628f726

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-01-25:

#3

Assigning this one to Ian for further comment.

Being very hand-wavy, is it possible that some combination of hook deferral or restart, combined with the behaviour in [1] might result in a strange loop like this?

[1] https://github.com/juju/juju/pull/15224

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → Ian Booth (wallyworld)

Revision history for this message

Thomas Miller (tlmiller) wrote on 2024-03-28:

#4

Harry and myself are currently investigating another case of this at the moment.

Revision history for this message

Thomas Miller (tlmiller) wrote on 2024-03-28:

#5

We have done some debugging on this today with Harry. Looking at the SOS report from one of the affected nodes we can see the Grafana charm is banging around with SIGKILL on the juju debug-log command. Running the command manually with strace we can also see that the program doesn't even hit its main function before being killed.

The kernel logs look very empty and we currently suspect that if we can get the full version of these logs it will contain the reason why this is happening.

There does not appear to be any juju code paths that could be causing this at the moment.

The problem also goes away when scheduling the pod to another node. At the moment our thinking this is node specific.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.