Kubernetes master lost all configuration/reset

Bug #1816635 reported by Tom Haddon on 2019-02-19
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Master Charm
Mike Wilson

Bug Description

About 11 hours ago, a relatively newly provisioned k8s cluster ran out of memory on the k8s-master (it was running in a VM with only 2GB of RAM and was the only master in the cluster). We can see a number of tracebacks in syslog, and the end result was that the cluster config was entirely reset - secrets that had been created disappeared, applications and pods that had been created disappeared.

Please let me know which logs you'd like to see to help figure out what the problem was.

This appears to be the first interesting log entry in syslog on the kubernetes-master from the time of the incident https://pastebin.canonical.com/p/4Hjy8Q8tK7/.

Mike Wilson (knobby) wrote :

The official CDK bundles have a constraint on the master nodes for 4 gigs of memory. This is required to prevent the master node from running out of memory. I think this is just a case of not provisioning enough memory for the master.

Tom Haddon (mthaddon) wrote :

Ok, thanks. Sounds like an invalid bug.

Changed in charm-kubernetes-master:
status: New → Invalid

Let’s find out why that happened. Out of memory shouldn’t catastrophically
kill a cluster like that. There any number of reasons for an OOM. The
disappearing everything else is scary.

Joel Sing (jsing) on 2019-02-21
Changed in charm-kubernetes-master:
status: Invalid → New
summary: - Kubernetes master ran out of memory, and ended up resetting the cluster
+ Kubernetes master lost all configuration/reset
Tom Haddon (mthaddon) wrote :

It sounds like we do want to track down why this particular failure case caused a cluster reset, as even with more RAM provisioned, this could happen for other reasons (other processes on the master running out of control, for instance).

Also, if the recommendation of 4G RAM minimum remains, could we have the charms/juju register an error in juju status if you try to provision a kubernetes master with less than this?

Mike Wilson (knobby) on 2019-02-21
Changed in charm-kubernetes-master:
assignee: nobody → Mike Wilson (knobby)
tags: added: ci-regression-test
Mike Wilson (knobby) wrote :

I agree that the charms should tell you if the machine isn't up to spec.

As for oom killing, can you tell me more about your cluster? I have a cluster running on 2 gigs and I've thrown a bunch of helm charts at it, but so far things are ok. Was there something special about your setup? Did it take hundreds of pods before you noticed the issue?

Tom Haddon (mthaddon) wrote :

Sorry for the delayed response. We'd been running it for about a week with one application, and had done about 20 deployments (image updates) to that application. The cluster wasn't heavily loaded as far as we could tell.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers