Kubernetes Control Plane Charm

Kubernetes master lost all configuration/reset after OOM

Bug #1816635 reported by Tom Haddon on 2019-02-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Kubernetes Control Plane Charm	Triaged	Medium	Unassigned

Bug Description

About 11 hours ago, a relatively newly provisioned k8s cluster ran out of memory on the k8s-master (it was running in a VM with only 2GB of RAM and was the only master in the cluster). We can see a number of tracebacks in syslog, and the end result was that the cluster config was entirely reset - secrets that had been created disappeared, applications and pods that had been created disappeared.

Please let me know which logs you'd like to see to help figure out what the problem was.

This appears to be the first interesting log entry in syslog on the kubernetes-master from the time of the incident https://pastebin.canonical.com/p/4Hjy8Q8tK7/.

Tags:

Revision history for this message

Mike Wilson (knobby) wrote on 2019-02-19:

The official CDK bundles have a constraint on the master nodes for 4 gigs of memory. This is required to prevent the master node from running out of memory. I think this is just a case of not provisioning enough memory for the master.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2019-02-19:

Ok, thanks. Sounds like an invalid bug.

Changed in charm-kubernetes-master:
status:	New → Invalid

Revision history for this message

Dean Henrichsmeyer (dean) wrote on 2019-02-19: Re: [Bug 1816635] Re: Kubernetes master ran out of memory, and ended up resetting the cluster

Let’s find out why that happened. Out of memory shouldn’t catastrophically
kill a cluster like that. There any number of reasons for an OOM. The
disappearing everything else is scary.

Joel Sing (jsing) on 2019-02-21

Changed in charm-kubernetes-master:
status:	Invalid → New
summary:	- Kubernetes master ran out of memory, and ended up resetting the cluster + Kubernetes master lost all configuration/reset

Revision history for this message

Tom Haddon (mthaddon) wrote on 2019-02-21: Re: Kubernetes master lost all configuration/reset

It sounds like we do want to track down why this particular failure case caused a cluster reset, as even with more RAM provisioned, this could happen for other reasons (other processes on the master running out of control, for instance).

Also, if the recommendation of 4G RAM minimum remains, could we have the charms/juju register an error in juju status if you try to provision a kubernetes master with less than this?

Mike Wilson (knobby) on 2019-02-21

Changed in charm-kubernetes-master:
assignee:	nobody → Mike Wilson (knobby)
tags:	added: ci-regression-test

Revision history for this message

Mike Wilson (knobby) wrote on 2019-02-21:

I agree that the charms should tell you if the machine isn't up to spec.

As for oom killing, can you tell me more about your cluster? I have a cluster running on 2 gigs and I've thrown a bunch of helm charts at it, but so far things are ok. Was there something special about your setup? Did it take hundreds of pods before you noticed the issue?

Revision history for this message

Tom Haddon (mthaddon) wrote on 2019-03-05:

Sorry for the delayed response. We'd been running it for about a week with one application, and had done about 20 deployments (image updates) to that application. The cluster wasn't heavily loaded as far as we could tell.

Tim Van Steenburgh (tvansteenburgh) on 2020-04-23

Changed in charm-kubernetes-master:
assignee:	Mike Wilson (knobby) → nobody

George Kraft (cynerva) on 2020-05-08

summary:	- Kubernetes master lost all configuration/reset + Kubernetes master lost all configuration/reset after OOM
Changed in charm-kubernetes-master:
importance:	Undecided → Critical
status:	New → Triaged

Revision history for this message

George Kraft (cynerva) wrote on 2020-05-08:

I doubt we will have any luck reproducing this, but we can give it a try.

George Kraft (cynerva) on 2020-05-20

Changed in charm-kubernetes-master:
importance:	Critical → Medium

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.