Kubernetes: compute hosts run out of memory and reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Jim Gauld |
Bug Description
Brief Description
-----------------
While testing in a 2+2+2 kubernetes configuration, I see that compute-0 went for a spontaneous reboot. It seems that maintenance rebooted the host due to heartbeat failure, but looking on compute-0, it seems that the issue was caused by it running out of memory and the oom killer kicked in. This host had only been up for less than 13 hours.
Running memtop for 10 minutes on either compute shows a leak of more than 130MB (in 10 minutes) - that is going by the Avail column. For example:
compute-0:~$ memtop --delay=30 --repeat 10000
memtop 0.1 -- selected options: delay = 30.000s, repeat = 10000, period = 300000.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-02-06 21:42:02.213 128726.2 112389.3 14135.2 1386.1 75.6 2620.4 7924.8 11341.1 0.1 0.0 2688.4 16336.9 10711.9 48640.0 5625.0 54844.0
2019-02-06 21:42:32.213 128726.2 112392.8 14130.4 1386.1 75.7 2626.0 7924.5 11341.1 0.1 0.0 2688.5 16333.5 10710.4 48640.0 5623.1 54844.0
2019-02-06 21:43:02.213 128726.2 112400.4 14121.4 1386.1 75.8 2631.6 7927.9 11341.1 0.1 0.0 2690.4 16325.8 10707.2 48640.0 5618.7 54844.0
2019-02-06 21:43:32.214 128726.2 112404.0 14116.4 1386.2 75.9 2637.0 7928.0 11341.1 0.1 0.0 2690.3 16322.2 10706.7 48640.0 5615.5 54844.0
2019-02-06 21:44:02.214 128726.2 112415.0 14104.2 1386.2 76.0 2644.0 7929.5 11341.1 0.1 0.0 2693.8 16311.3 10700.7 48640.0 5610.5 54844.0
2019-02-06 21:44:32.214 128726.2 112420.5 14097.4 1386.3 76.1 2649.5 7929.9 11341.1 0.1 0.0 2693.5 16305.8 10698.9 48640.0 5606.8 54844.0
2019-02-06 21:45:02.215 128726.2 112432.9 14083.5 1386.3 76.2 2655.3 7943.6 11341.1 0.1 0.0 2698.9 16293.3 10691.4 48640.0 5602.0 54844.0
2019-02-06 21:45:32.215 128726.2 112433.5 14081.5 1386.3 76.2 2661.1 7943.4 11341.1 0.1 0.0 2699.7 16292.7 10692.3 48640.0 5600.4 54844.0
2019-02-06 21:46:02.215 128726.2 112443.5 14069.8 1386.4 76.3 2667.7 7944.3 11341.1 0.1 0.0 2700.7 16282.7 10688.5 48640.0 5594.2 54844.0
2019-02-06 21:46:32.216 128726.2 112446.7 14065.4 1386.4 76.4 2672.5 7944.3 11341.1 0.1 0.0 2699.5 16279.5 10687.1 48640.0 5592.4 54844.0
2019-02-06 21:47:02.216 128726.2 112459.8 14050.9 1386.4 76.5 2679.0 7950.0 11341.1 0.1 0.0 2705.3 16266.4 10682.3 48640.0 5584.1 54844.0
2019-02-06 21:47:32.216 128726.2 112464.7 14045.2 1386.5 76.6 2683.5 7949.8 11341.1 0.1 0.0 2706.7 16261.5 10679.3 48640.0 5582.2 54844.0
2019-02-06 21:48:02.217 128726.2 112477.1 14031.0 1386.5 76.7 2690.8 7957.1 11341.1 0.1 0.0 2711.0 16249.1 10670.7 48640.0 5578.4 54844.0
2019-02-06 21:48:32.217 128726.2 112479.1 14027.7 1386.5 76.7 2696.4 8039.1 11341.1 0.1 0.0 2710.6 16247.1 10670.6 48640.0 5577.0 54844.0
2019-02-06 21:49:02.217 128726.2 112486.7 14018.6 1386.6 76.8 2701.0 7962.6 11341.1 0.1 0.0 2711.6 16239.5 10664.3 48640.0 5575.1 54844.0
2019-02-06 21:49:32.218 128726.2 112489.9 14014.2 1386.6 76.9 2706.8 7959.0 11341.1 0.1 0.0 2711.9 16236.3 10664.0 48640.0 5572.3 54844.0
2019-02-06 21:50:02.218 128726.2 112515.3 13987.5 1386.6 77.0 2712.8 7973.9 11341.1 0.1 0.0 2730.5 16210.9 10645.4 48640.0 5565.5 54844.0
2019-02-06 21:50:32.218 128726.2 112517.4 13984.1 1386.7 77.1 2718.1 7973.4 11341.1 0.1 0.0 2730.3 16208.8 10643.8 48640.0 5565.1 54844.0
2019-02-06 21:51:02.218 128726.2 112526.5 13973.7 1386.7 77.2 2724.7 7973.7 11341.1 0.1 0.0 2730.6 16199.7 10639.4 48640.0 5560.3 54844.0
In this compute host, the problem pod seems to be garbd - it was created here:
2019-02-
es.io/secret/
2019-02-
2-11e9-
2019-02-
2019-02-
2019-02-
dcb95d7--wjv55-eth0 osh-openstack-
template-
ID="a994bf8b0f9
Looks like the pod was deleted here:
2019-02-
2-11e9-
2019-02-
es.io/secret/
2019-02-
ca3a1a98-
2019-02-
ode "compute-0" DevicePath ""
2019-02-
63983308ba39a9c81" endpoint=
erateName:
04, loc:(*time.
:"osh-openstack
:[]string(nil), ClusterName:""}, Spec:v3.
PNATs:[
And ever since that time the following logs have been coming out:
2019-02-
1 errors similar to this. Turn up verbosity to see them.
2019-02-
1 errors similar to this. Turn up verbosity to see them.
2019-02-
1 errors similar to this. Turn up verbosity to see them.
An upstream bug report that seems to describe this issue (it hasn’t been fixed yet):
https:/
Severity
--------
Major: System/Feature is usable but degraded
Steps to Reproduce
------------------
Not sure what triggered the issue.
Expected Behavior
------------------
Compute hosts should not run out of memory over time.
Actual Behavior
----------------
Compute hosts run out of memory and reboot after approximately 12 hours.
Reproducibility
---------------
Intermittent - not seen in all labs.
System Configuration
-------
2+2+2 system
Branch/Pull Time/Commit
-------
###
### StarlingX
### Release 19.01
###
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="f/stein"
JOB="STX_
<email address hidden>"
BUILD_NUMBER="40"
BUILD_HOST=
BUILD_DATE=
Timestamp/Logs
--------------
See above
Changed in starlingx: | |
assignee: | Chris Friesen (cbf123) → Jim Gauld (jgauld) |
tags: |
added: stx.2.0 removed: stx.2019.05 |
Marking as release gating; issue related to container env.