Kubernetes: compute hosts run out of memory and reboot

Bug #1815106 reported by Frank Miller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Jim Gauld

Bug Description

Brief Description
-----------------
While testing in a 2+2+2 kubernetes configuration, I see that compute-0 went for a spontaneous reboot. It seems that maintenance rebooted the host due to heartbeat failure, but looking on compute-0, it seems that the issue was caused by it running out of memory and the oom killer kicked in. This host had only been up for less than 13 hours.

Running memtop for 10 minutes on either compute shows a leak of more than 130MB (in 10 minutes) - that is going by the Avail column. For example:

compute-0:~$ memtop --delay=30 --repeat 10000
memtop 0.1 -- selected options: delay = 30.000s, repeat = 10000, period = 300000.000s, non-strict, unit = MiB
yyyy-mm-dd hh:mm:ss.fff Tot Used Free Ca Buf Slab CAS CLim Dirty WBack Anon Avail 0:Avail 0:HFree 1:Avail 1:HFree
2019-02-06 21:42:02.213 128726.2 112389.3 14135.2 1386.1 75.6 2620.4 7924.8 11341.1 0.1 0.0 2688.4 16336.9 10711.9 48640.0 5625.0 54844.0
2019-02-06 21:42:32.213 128726.2 112392.8 14130.4 1386.1 75.7 2626.0 7924.5 11341.1 0.1 0.0 2688.5 16333.5 10710.4 48640.0 5623.1 54844.0
2019-02-06 21:43:02.213 128726.2 112400.4 14121.4 1386.1 75.8 2631.6 7927.9 11341.1 0.1 0.0 2690.4 16325.8 10707.2 48640.0 5618.7 54844.0
2019-02-06 21:43:32.214 128726.2 112404.0 14116.4 1386.2 75.9 2637.0 7928.0 11341.1 0.1 0.0 2690.3 16322.2 10706.7 48640.0 5615.5 54844.0
2019-02-06 21:44:02.214 128726.2 112415.0 14104.2 1386.2 76.0 2644.0 7929.5 11341.1 0.1 0.0 2693.8 16311.3 10700.7 48640.0 5610.5 54844.0
2019-02-06 21:44:32.214 128726.2 112420.5 14097.4 1386.3 76.1 2649.5 7929.9 11341.1 0.1 0.0 2693.5 16305.8 10698.9 48640.0 5606.8 54844.0
2019-02-06 21:45:02.215 128726.2 112432.9 14083.5 1386.3 76.2 2655.3 7943.6 11341.1 0.1 0.0 2698.9 16293.3 10691.4 48640.0 5602.0 54844.0
2019-02-06 21:45:32.215 128726.2 112433.5 14081.5 1386.3 76.2 2661.1 7943.4 11341.1 0.1 0.0 2699.7 16292.7 10692.3 48640.0 5600.4 54844.0
2019-02-06 21:46:02.215 128726.2 112443.5 14069.8 1386.4 76.3 2667.7 7944.3 11341.1 0.1 0.0 2700.7 16282.7 10688.5 48640.0 5594.2 54844.0
2019-02-06 21:46:32.216 128726.2 112446.7 14065.4 1386.4 76.4 2672.5 7944.3 11341.1 0.1 0.0 2699.5 16279.5 10687.1 48640.0 5592.4 54844.0
2019-02-06 21:47:02.216 128726.2 112459.8 14050.9 1386.4 76.5 2679.0 7950.0 11341.1 0.1 0.0 2705.3 16266.4 10682.3 48640.0 5584.1 54844.0
2019-02-06 21:47:32.216 128726.2 112464.7 14045.2 1386.5 76.6 2683.5 7949.8 11341.1 0.1 0.0 2706.7 16261.5 10679.3 48640.0 5582.2 54844.0
2019-02-06 21:48:02.217 128726.2 112477.1 14031.0 1386.5 76.7 2690.8 7957.1 11341.1 0.1 0.0 2711.0 16249.1 10670.7 48640.0 5578.4 54844.0
2019-02-06 21:48:32.217 128726.2 112479.1 14027.7 1386.5 76.7 2696.4 8039.1 11341.1 0.1 0.0 2710.6 16247.1 10670.6 48640.0 5577.0 54844.0
2019-02-06 21:49:02.217 128726.2 112486.7 14018.6 1386.6 76.8 2701.0 7962.6 11341.1 0.1 0.0 2711.6 16239.5 10664.3 48640.0 5575.1 54844.0
2019-02-06 21:49:32.218 128726.2 112489.9 14014.2 1386.6 76.9 2706.8 7959.0 11341.1 0.1 0.0 2711.9 16236.3 10664.0 48640.0 5572.3 54844.0
2019-02-06 21:50:02.218 128726.2 112515.3 13987.5 1386.6 77.0 2712.8 7973.9 11341.1 0.1 0.0 2730.5 16210.9 10645.4 48640.0 5565.5 54844.0
2019-02-06 21:50:32.218 128726.2 112517.4 13984.1 1386.7 77.1 2718.1 7973.4 11341.1 0.1 0.0 2730.3 16208.8 10643.8 48640.0 5565.1 54844.0
2019-02-06 21:51:02.218 128726.2 112526.5 13973.7 1386.7 77.2 2724.7 7973.7 11341.1 0.1 0.0 2730.6 16199.7 10639.4 48640.0 5560.3 54844.0

In this compute host, the problem pod seems to be garbd - it was created here:

2019-02-06T19:06:44.312 compute-0 kubelet[57915]: info I0206 19:06:44.312423 57915 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "osh-openstack-garbd-garbd-token-9tnbq" (UniqueName: "kubernet
es.io/secret/57dcc8ff-2a42-11e9-9512-6805ca3a1a98-osh-openstack-garbd-garbd-token-9tnbq") pod "osh-openstack-garbd-garbd-cddcb95d7-wjv55" (UID: "57dcc8ff-2a42-11e9-9512-6805ca3a1a98")
2019-02-06T19:06:44.312 compute-0 kubelet[57915]: info I0206 19:06:44.312459 57915 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "garbd-bin" (UniqueName: "kubernetes.io/configmap/57dcc8ff-2a4
2-11e9-9512-6805ca3a1a98-garbd-bin") pod "osh-openstack-garbd-garbd-cddcb95d7-wjv55" (UID: "57dcc8ff-2a42-11e9-9512-6805ca3a1a98")
2019-02-06T19:06:44.416 compute-0 systemd[1]: info Started Kubernetes transient mount for /var/lib/kubelet/pods/57dcc8ff-2a42-11e9-9512-6805ca3a1a98/volumes/kubernetes.io~secret/osh-openstack-garbd-garbd-token-9tnbq.
2019-02-06T19:06:44.416 compute-0 systemd[1]: info Starting Kubernetes transient mount for /var/lib/kubelet/pods/57dcc8ff-2a42-11e9-9512-6805ca3a1a98/volumes/kubernetes.io~secret/osh-openstack-garbd-garbd-token-9tnbq.
2019-02-06T19:06:44.779 compute-0 kubelet[57915]: info 2019-02-06 19:06:44.779 [INFO][75660] calico.go 166: Calico CNI found existing endpoint: &{{WorkloadEndpoint projectcalico.org/v3} {compute--0-k8s-osh--openstack--garbd--garbd--cd
dcb95d7--wjv55-eth0 osh-openstack-garbd-garbd-cddcb95d7- openstack 57dcc8ff-2a42-11e9-9512-6805ca3a1a98 810097 0 2019-02-06 19:06:44 +0000 UTC <nil> <nil> map[projectcalico.org/orchestrator:k8s application:garbd component:server pod-
template-hash:cddcb95d7 release_group:osh-openstack-garbd projectcalico.org/namespace:openstack] map[] [] nil [] } {k8s compute-0 osh-openstack-garbd-garbd-cddcb95d7-wjv55 eth0 [] [] [kns.openstack] calibdcd363781d []}} Container
ID="a994bf8b0f963f5b002ed6b1af6e3399df6fde33eed50f263983308ba39a9c81" Namespace="openstack" Pod="osh-openstack-garbd-garbd-cddcb95d7-wjv55" WorkloadEndpoint="compute--0-k8s-osh--openstack--garbd--garbd--cddcb95d7--wjv55-"

Looks like the pod was deleted here:

2019-02-07T07:20:12.664 compute-0 kubelet[57927]: info I0207 07:20:12.664107 57927 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "garbd-bin" (UniqueName: "kubernetes.io/configmap/57dcc8ff-2a4
2-11e9-9512-6805ca3a1a98-garbd-bin") pod "osh-openstack-garbd-garbd-cddcb95d7-wjv55" (UID: "57dcc8ff-2a42-11e9-9512-6805ca3a1a98")
2019-02-07T07:20:12.664 compute-0 kubelet[57927]: info I0207 07:20:12.664778 57927 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "osh-openstack-garbd-garbd-token-9tnbq" (UniqueName: "kubernet
es.io/secret/57dcc8ff-2a42-11e9-9512-6805ca3a1a98-osh-openstack-garbd-garbd-token-9tnbq") pod "osh-openstack-garbd-garbd-cddcb95d7-wjv55" (UID: "57dcc8ff-2a42-11e9-9512-6805ca3a1a98")
2019-02-07T07:20:12.776 compute-0 kubelet[57927]: info I0207 07:20:12.776189 57927 reconciler.go:301] Volume detached for volume "osh-openstack-garbd-garbd-token-9tnbq" (UniqueName: "kubernetes.io/secret/57dcc8ff-2a42-11e9-9512-6805
ca3a1a98-osh-openstack-garbd-garbd-token-9tnbq") on node "compute-0" DevicePath ""
2019-02-07T07:20:12.776 compute-0 kubelet[57927]: info I0207 07:20:12.776211 57927 reconciler.go:301] Volume detached for volume "garbd-bin" (UniqueName: "kubernetes.io/configmap/57dcc8ff-2a42-11e9-9512-6805ca3a1a98-garbd-bin") on n
ode "compute-0" DevicePath ""
2019-02-07T07:20:12.911 compute-0 kubelet[57927]: info 2019-02-07 07:20:12.911 [INFO][59320] k8s.go 349: Endpoint deletion will be handled by Kubernetes deletion of the Pod. ContainerID="a994bf8b0f963f5b002ed6b1af6e3399df6fde33eed50f2
63983308ba39a9c81" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"compute--0-k8s-osh--openstack--garbd--garbd--cddcb95d7--wjv55-eth0", Gen
erateName:"osh-openstack-garbd-garbd-cddcb95d7-", Namespace:"openstack", SelfLink:"", UID:"57dcc8ff-2a42-11e9-9512-6805ca3a1a98", ResourceVersion:"955208", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:636850768
04, loc:(*time.Location)(0x1da8ce0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"application":"garbd", "component":"server", "pod-template-hash":"cddcb95d7", "release_group"
:"osh-openstack-garbd", "projectcalico.org/namespace":"openstack", "projectcalico.org/orchestrator":"k8s"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers
:[]string(nil), ClusterName:""}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"compute-0", ContainerID:"", Pod:"osh-openstack-garbd-garbd-cddcb95d7-wjv55", Endpoint:"eth0", IPNetworks:[]string{"172.16.2.5/32"}, I
PNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.openstack"}, InterfaceName:"calibdcd363781d", MAC:"", Ports:[]v3.EndpointPort(nil)}}

And ever since that time the following logs have been coming out:

2019-02-07T07:20:22.285 compute-0 kubelet[57927]: info E0207 07:20:22.285418 57927 kubelet_volumes.go:140] Orphaned pod "57dcc8ff-2a42-11e9-9512-6805ca3a1a98" found, but volume paths are still present on disk : There were a total of
1 errors similar to this. Turn up verbosity to see them.
2019-02-07T07:20:24.269 compute-0 kubelet[57927]: info E0207 07:20:24.269096 57927 kubelet_volumes.go:140] Orphaned pod "57dcc8ff-2a42-11e9-9512-6805ca3a1a98" found, but volume paths are still present on disk : There were a total of
1 errors similar to this. Turn up verbosity to see them.
2019-02-07T07:20:26.274 compute-0 kubelet[57927]: info E0207 07:20:26.274389 57927 kubelet_volumes.go:140] Orphaned pod "57dcc8ff-2a42-11e9-9512-6805ca3a1a98" found, but volume paths are still present on disk : There were a total of
1 errors similar to this. Turn up verbosity to see them.

An upstream bug report that seems to describe this issue (it hasn’t been fixed yet):
https://github.com/kubernetes/kubernetes/issues/60987

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
Not sure what triggered the issue.

Expected Behavior
------------------
Compute hosts should not run out of memory over time.

Actual Behavior
----------------
Compute hosts run out of memory and reboot after approximately 12 hours.

Reproducibility
---------------
Intermittent - not seen in all labs.

System Configuration
--------------------
2+2+2 system

Branch/Pull Time/Commit
-----------------------
###
### StarlingX
### Release 19.01
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="f/stein"

JOB="STX_build_stein_master"
<email address hidden>"
BUILD_NUMBER="40"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-02-01 19:58:51 +0000"

Timestamp/Logs
--------------
See above

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue related to container env.

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.2019.05
Revision history for this message
Chris Friesen (cbf123) wrote :

It's worth noting that this has not been seen in other labs, and after Bart's lab was reinstalled we haven't seen the problem there again yet.

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Chris Friesen (cbf123) → Jim Gauld (jgauld)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/637051

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/637051
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=acefd544f0f02aa348e29a46be925436349e542d
Submitter: Zuul
Branch: master

commit acefd544f0f02aa348e29a46be925436349e542d
Author: Jim Gauld <email address hidden>
Date: Thu Feb 14 15:42:07 2019 -0500

    Mitigate memory leak of sessions by disabling sudo for sriov agent

    The sriov agent was polling devices via 'sudo ip link show',
    and this resulted in a severe memory leak. The usage of 'sudo'
    uses the host 'dbus-daemon', and somewhere the host does not
    clean up login sessions.

    Symptoms:
    - gradual run out of memory until system unstable, host spontaneous
      reboot due to delay or OOM
    - huge growth of kernel slab
    - thousands of /sys/fs/cgroup/systemd/user.slice/user-0.slice
      session-x*.scope files with empty 'tasks', i.e., sessions
      that should have deleted
    - huge latency seen with ssh and various systemd commands

    The problem is mitigated by disabling 'sudo' for sriov agent, using
    a helm override that configures [agent]/root_helper='' .

    Testing:
    - Verified that we could launch a VM with SR-IOV interface;
      VFs were able to set MAC and VLAN attributes.

    Closes-Bug: 1815106

    Change-Id: I0c57629c01b7407c99cc7f38b409019ab87af859
    Signed-off-by: Jim Gauld <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (f/stein)

Fix proposed to branch: f/stein
Review: https://review.openstack.org/637977

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-config (f/stein)

Change abandoned by Saul Wold (<email address hidden>) on branch: f/stein
Review: https://review.openstack.org/637977
Reason: Scott will provide a correct merge, sorry for the noise here.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (f/stein)

Fix proposed to branch: f/stein
Review: https://review.openstack.org/638217

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (f/stein)
Download full text (6.9 KiB)

Reviewed: https://review.openstack.org/638217
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=b09d0898b6eaec572be3195ae25ec15413136552
Submitter: Zuul
Branch: f/stein

commit 1c467789c43827321e4319d50065fdbab1be35a2
Author: David Sullivan <email address hidden>
Date: Wed Feb 20 00:49:17 2019 -0500

    Add replica settings for mariadb ingress pod

    There was no mariadb replica override for the ingress pod. On AIO-SX
    this caused two pods to be scheduled. When anti-affinity was added to
    mariadb this broke application-apply on AIO-SX.

    The mariadb ingress pod replication will be set to the number of
    controllers.

    Change-Id: Icf3f1979720629904ca9ddcabf59e8ecfab709e5
    Story: 2004520
    Task: 29570
    Signed-off-by: David Sullivan <email address hidden>

commit ed3c63a06da2cb04b7415cb1b5ba6340c3fa229a
Author: Erich Cordoba <email address hidden>
Date: Tue Feb 19 12:09:42 2019 -0600

    Add DNS requirement for kubernetes and helm.

    `helm init` is being execute before networking and DNS is properly
    configured in the controller. A dependency was added to kubernetes
    to setup DNS, helm manifest was updated to depend on kubernetes.

    Also, the `--skip-refresh` flag was added to helm init for second
    controller to avoid timeout scenarios on proxy enviroments.

    Closes-Bug: 1814968

    Change-Id: I65759314b3a861e7fdb428889aa5f5c1c7037661
    Suggested-by: Mingyuan Qi <email address hidden>
    Signed-off-by: Erich Cordoba <email address hidden>

commit 70ed5b099496c98b37a94b061610d48c9263f554
Author: Alex Kozyrev <email address hidden>
Date: Fri Feb 15 15:46:32 2019 -0500

    Enable Barbican provisioning in SM in kubernetes environment

    Since Barbican is in charge of storing BMC passwords for MTCE now
    we need it to run as a bare-metal service alongside with kubernetes.
    This patch enables SM provisioning for barbican in this case.

    Change-Id: Id51f679738d429e78f388b6dc42e7606ef0c41ab
    Story: 2003108
    Task: 27700
    Signed-off-by: Alex Kozyrev <email address hidden>

commit 0dd4b86526609b86d8c7395a7c9af13e7f769596
Author: David Sullivan <email address hidden>
Date: Tue Feb 12 14:09:10 2019 -0500

    Add replica and anti-affinity settings

    Add anti-affinity settings to openstack pods. Add replication to
    novncproxy, aodh, panko and rbd_provisioner services.

    Change-Id: I8091a54cab98ff295eba6e7dd6fa76827d149b5f
    Story: 2004520
    Task: 29418
    Signed-off-by: David Sullivan <email address hidden>

commit 5b94294002617b18bc0f98b206a24cec38a5b929
Author: Angie Wang <email address hidden>
Date: Thu Feb 7 23:42:25 2019 -0500

    Support stx-openstack app install with the authed local registry

    The functionality of local docker registry authentication will be
    enabled in commit https://review.openstack.org/#/c/626355/.
    However, local docker registry is currently used to pull/push images
    during application apply without authentication and no credentials
    passed to the kubernetes when pulling images ...

Read more...

tags: added: in-f-stein
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.