some pods are failing during sanity execution due to OOM killer
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Critical
|
Al Bailey |
Bug Description
Brief Description
-----------------
some pods are failing during sanity execution.
Severity
--------
Major
Steps to Reproduce
------------------
kubectl get pods --all-namespaces |egrep -v "Completed|Running"
Expected Behavior
------------------
Pods Running
Actual Behavior
----------------
CrashLoopBackOff
Reproducibility
---------------
100% simplex, duplex - BM
System Configuration
-------
simplex & Duplex
Branch/Pull Time/Commit
-------
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID=
JOB="STX_
<email address hidden>"
BUILD_NUMBER="78"
BUILD_HOST=
BUILD_DATE=
http://
Timestamp/Logs
--------------
controller-0:~$ export OS_CLOUD=
controller-0:~$ openstack service list
Failed to discover available identity versions when contacting http://
Unable to establish connection to http://
controller-0:~$ kubectl get pods --all-namespaces |egrep -v "Completed|Running"
The connection to the server 192.168.206.2:6443 was refused - did you specify the right host or port?
controller-0:~$ less -R /var/log/sysinv.log
controller-0:~$ cat /etc/build.info
###
### StarlingX
### Built from master
controller-0:~$ kubectl get pods --all-namespaces |egrep -v "Completed|Running"
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-jwzbb 0/1 CreateContainer
kube-system coredns-
kube-system ingress-4qhqh 0/1 CrashLoopBackOff 13 115m
kube-system kube-scheduler-
openstack ingress-
openstack ingress-
openstack mariadb-
openstack panko-events-
openstack rbd-provisioner
Test Activity
-------------
[Sanity]
Changed in starlingx: | |
status: | New → Confirmed |
Changed in starlingx: | |
assignee: | Cindy Xie (xxie1) → Yi Wang (wangyi4) |
Changed in starlingx: | |
assignee: | Yi Wang (wangyi4) → Al Bailey (albailey1974) |
Further information gathered from a troubleshoot zoom with Al.
SIMPLEX
Kube-system pods are continuously restarting:
NAMESPACE NAME READY STATUS RESTARTS AGE 84bb87857f- 6psdr 1/1 Running 88 9h error-pages- 69d8d88bd4- qz6ld 1/1 Running 68 9h controller- 0 1/1 Running 108 9h -manager- controller- 0 1/1 Running 46 9h controller- 0 1/1 Running 46 9h deploy- d87d7bd75- zbtc8 1/1 Running 124 9h
kube-system calico-node-5t4v4 1/1 Running 49 9h
kube-system coredns-
kube-system ingress-
kube-system ingress-q6fgn 1/1 Running 138 9h
kube-system kube-apiserver-
kube-system kube-controller
kube-system kube-scheduler-
kube-system tiller-
The cause seems to be memory usage, Out of Memory kills are logged on /var/log/kern.log, some lines from the log:
2019-04- 25T12:22: 24.427 controller-0 kernel: err [38511.705026] Out of memory: Kill process 23019 (apache2) score 1002 or sacrifice child 25T12:22: 42.593 controller-0 kernel: warning [38529.834880] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:22: 42.638 controller-0 kernel: err [38529.869458] Out of memory: Kill process 23023 (apache2) score 1002 or sacrifice child 25T12:22: 47.526 controller-0 kernel: warning [38534.756610] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:22: 47.553 controller-0 kernel: err [38534.773152] Out of memory: Kill process 158263 (cinder-api) score 1002 or sacrifice child 25T12:23: 16.786 controller-0 kernel: warning [38563.947968] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:23: 16.796 controller-0 kernel: err [38563.955317] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 25T12:23: 32.242 controller-0 kernel: warning [38579.367722] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:23: 32.261 controller-0 kernel: err [38579.379438] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 25T12:23: 35.708 controller-0 kernel: warning [38582.825486] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:23: 35.777 controller-0 kernel: err [38582.886363] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
This system has 93GB of memory:
[wrsroot@ controller- 0 ~(keystone_admin)]$ free -h
total used free shared buff/cache available
Mem: 93G 86G 663M 72M 6.2G 2.0G
Swap: 0B 0B 0B
[wrsroot@ controller- 0 ~(keystone_admin)]$ system host-memory-list controller-0 -----+- ------- -+----- ------- +------ ---+--- ------- ------+ ------- -+----- ---+--- -----+- ------+ ------- ------+ ------- ------- --+---- ------- -----+- ------- ------- ---+--- ------- ------+ ------- --...
+------