Kubernetes master lost all configuration/reset after OOM

Bug #1816635 reported by Tom Haddon
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes Control Plane Charm
Triaged
Medium
Unassigned

Bug Description

About 11 hours ago, a relatively newly provisioned k8s cluster ran out of memory on the k8s-master (it was running in a VM with only 2GB of RAM and was the only master in the cluster). We can see a number of tracebacks in syslog, and the end result was that the cluster config was entirely reset - secrets that had been created disappeared, applications and pods that had been created disappeared.

Please let me know which logs you'd like to see to help figure out what the problem was.

This appears to be the first interesting log entry in syslog on the kubernetes-master from the time of the incident https://pastebin.canonical.com/p/4Hjy8Q8tK7/.

Revision history for this message
Mike Wilson (knobby) wrote :

The official CDK bundles have a constraint on the master nodes for 4 gigs of memory. This is required to prevent the master node from running out of memory. I think this is just a case of not provisioning enough memory for the master.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Ok, thanks. Sounds like an invalid bug.

Changed in charm-kubernetes-master:
status: New → Invalid
Revision history for this message
Dean Henrichsmeyer (dean) wrote : Re: [Bug 1816635] Re: Kubernetes master ran out of memory, and ended up resetting the cluster

Let’s find out why that happened. Out of memory shouldn’t catastrophically
kill a cluster like that. There any number of reasons for an OOM. The
disappearing everything else is scary.

Joel Sing (jsing)
Changed in charm-kubernetes-master:
status: Invalid → New
summary: - Kubernetes master ran out of memory, and ended up resetting the cluster
+ Kubernetes master lost all configuration/reset
Revision history for this message
Tom Haddon (mthaddon) wrote : Re: Kubernetes master lost all configuration/reset

It sounds like we do want to track down why this particular failure case caused a cluster reset, as even with more RAM provisioned, this could happen for other reasons (other processes on the master running out of control, for instance).

Also, if the recommendation of 4G RAM minimum remains, could we have the charms/juju register an error in juju status if you try to provision a kubernetes master with less than this?

Mike Wilson (knobby)
Changed in charm-kubernetes-master:
assignee: nobody → Mike Wilson (knobby)
tags: added: ci-regression-test
Revision history for this message
Mike Wilson (knobby) wrote :

I agree that the charms should tell you if the machine isn't up to spec.

As for oom killing, can you tell me more about your cluster? I have a cluster running on 2 gigs and I've thrown a bunch of helm charts at it, but so far things are ok. Was there something special about your setup? Did it take hundreds of pods before you noticed the issue?

Revision history for this message
Tom Haddon (mthaddon) wrote :

Sorry for the delayed response. We'd been running it for about a week with one application, and had done about 20 deployments (image updates) to that application. The cluster wasn't heavily loaded as far as we could tell.

Changed in charm-kubernetes-master:
assignee: Mike Wilson (knobby) → nobody
George Kraft (cynerva)
summary: - Kubernetes master lost all configuration/reset
+ Kubernetes master lost all configuration/reset after OOM
Changed in charm-kubernetes-master:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
George Kraft (cynerva) wrote :

I doubt we will have any luck reproducing this, but we can give it a try.

George Kraft (cynerva)
Changed in charm-kubernetes-master:
importance: Critical → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.