Canonical Juju

Switching primaries in a juju controller caused CrashLoopBackOff in some pods

Bug #2039418 reported by Tom Haddon on 2023-10-16

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Confirmed	Undecided	Unassigned

Bug Description

Earlier today we had an issue where we were seeing high load on a juju controller. We switched primaries and a number of k8s pods (not all) went into CrashLoopBackoff with the following being the entirety of the log output in the charm-init container:

ERROR opening "/charm/bin/containeragent" for writing: open /charm/bin/containeragent: text file busy

Deleting pods (or triggering a charm upgrade which lead to a rescheduling of pods) seems to fix the issue.

The controller in this case was running Juju 2.9.44

See original description

Tags:

Tom Haddon (mthaddon) on 2023-10-16

tags:	added: canonical-is
description:	updated

Revision history for this message

Kian Parvin (kian-parvin) wrote on 2023-10-16:

Similarly in our case, we have a k8s charm deployed against the same controller and saw the number of ready pods in the stateful set drop to 0 when the Juju controller primary was changed.

Looking at our Grafana dashboards, almost to the minute when the command `rs.stepDown(120)` was run on the primary, the number of ready pods dropped and then started coming back up 4-5 minutes later. In this case the pods didn't enter a CrashLoopBackoff though.

And this behaviour is identical to what we see when the Juju controllers are restarted as mentioned in https://bugs.launchpad.net/juju/+bug/2036594

Here is the output from `juju debug-log --replay` around the time. I've removed the controller IPs but out of an abundance of caution it's a Canonical only pastebin - https://pastebin.canonical.com/p/hQC53MqGr3/

And finally logs from the charm-init container don't seem all too helpful
$ kubectl logs <unit-0> -n <namespace> -c charm-init
starting containeragent init command

Revision history for this message

Thomas Miller (tlmiller) wrote on 2023-10-16:

Thanks for the information everyone. I have had a look through some logs with Tom and I don't think the crash loop backoff is related to Juju for the time being. It seems underlying infrastructure related. That is not to say that we shouldn't rule it out.

As for the pods restarting with the controller this is a bug that we need to deal with.

Haw Loeung (hloeung) on 2023-10-17

Changed in juju:
status:	New → Confirmed

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-01-12:

Found another instance of this on a wordpress model today. I'm not convinced this is related to the underlying infrastructure, per the title of the bug it seems to happen when primaries switch. Deleted the pod to restore service.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.