Switching primaries in a juju controller caused CrashLoopBackOff in some pods

Bug #2039418 reported by Tom Haddon
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Confirmed
Undecided
Unassigned

Bug Description

Earlier today we had an issue where we were seeing high load on a juju controller. We switched primaries and a number of k8s pods (not all) went into CrashLoopBackoff with the following being the entirety of the log output in the charm-init container:

ERROR opening "/charm/bin/containeragent" for writing: open /charm/bin/containeragent: text file busy

Deleting pods (or triggering a charm upgrade which lead to a rescheduling of pods) seems to fix the issue.

The controller in this case was running Juju 2.9.44

Tags: canonical-is
Tom Haddon (mthaddon)
tags: added: canonical-is
description: updated
Revision history for this message
Kian Parvin (kian-parvin) wrote :

Similarly in our case, we have a k8s charm deployed against the same controller and saw the number of ready pods in the stateful set drop to 0 when the Juju controller primary was changed.

Looking at our Grafana dashboards, almost to the minute when the command `rs.stepDown(120)` was run on the primary, the number of ready pods dropped and then started coming back up 4-5 minutes later. In this case the pods didn't enter a CrashLoopBackoff though.

And this behaviour is identical to what we see when the Juju controllers are restarted as mentioned in https://bugs.launchpad.net/juju/+bug/2036594

Here is the output from `juju debug-log --replay` around the time. I've removed the controller IPs but out of an abundance of caution it's a Canonical only pastebin - https://pastebin.canonical.com/p/hQC53MqGr3/

And finally logs from the charm-init container don't seem all too helpful
$ kubectl logs <unit-0> -n <namespace> -c charm-init
starting containeragent init command

Revision history for this message
Thomas Miller (tlmiller) wrote :

Thanks for the information everyone. I have had a look through some logs with Tom and I don't think the crash loop backoff is related to Juju for the time being. It seems underlying infrastructure related. That is not to say that we shouldn't rule it out.

As for the pods restarting with the controller this is a bug that we need to deal with.

Haw Loeung (hloeung)
Changed in juju:
status: New → Confirmed
Revision history for this message
Tom Haddon (mthaddon) wrote :

Found another instance of this on a wordpress model today. I'm not convinced this is related to the underlying infrastructure, per the title of the bug it seems to happen when primaries switch. Deleted the pod to restore service.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.