Further information gathered from a troubleshoot zoom with Al.
SIMPLEX
Kube-system pods are continuously restarting:
NAMESPACE NAME READY STATUS RESTARTS AGE kube-system calico-node-5t4v4 1/1 Running 49 9h kube-system coredns-84bb87857f-6psdr 1/1 Running 88 9h kube-system ingress-error-pages-69d8d88bd4-qz6ld 1/1 Running 68 9h kube-system ingress-q6fgn 1/1 Running 138 9h kube-system kube-apiserver-controller-0 1/1 Running 108 9h kube-system kube-controller-manager-controller-0 1/1 Running 46 9h kube-system kube-scheduler-controller-0 1/1 Running 46 9h kube-system tiller-deploy-d87d7bd75-zbtc8 1/1 Running 124 9h
The cause seems to be memory usage, Out of Memory kills are logged on /var/log/kern.log, some lines from the log:
2019-04-25T12:22:24.427 controller-0 kernel: err [38511.705026] Out of memory: Kill process 23019 (apache2) score 1002 or sacrifice child 2019-04-25T12:22:42.593 controller-0 kernel: warning [38529.834880] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510 2019-04-25T12:22:42.638 controller-0 kernel: err [38529.869458] Out of memory: Kill process 23023 (apache2) score 1002 or sacrifice child 2019-04-25T12:22:47.526 controller-0 kernel: warning [38534.756610] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510 2019-04-25T12:22:47.553 controller-0 kernel: err [38534.773152] Out of memory: Kill process 158263 (cinder-api) score 1002 or sacrifice child 2019-04-25T12:23:16.786 controller-0 kernel: warning [38563.947968] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510 2019-04-25T12:23:16.796 controller-0 kernel: err [38563.955317] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 2019-04-25T12:23:32.242 controller-0 kernel: warning [38579.367722] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510 2019-04-25T12:23:32.261 controller-0 kernel: err [38579.379438] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 2019-04-25T12:23:35.708 controller-0 kernel: warning [38582.825486] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510 2019-04-25T12:23:35.777 controller-0 kernel: err [38582.886363] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
This system has 93GB of memory:
[wrsroot@controller-0 ~(keystone_admin)]$ free -h total used free shared buff/cache available Mem: 93G 86G 663M 72M 6.2G 2.0G Swap: 0B 0B 0B
[wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-list controller-0 +-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+ | processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_total_4K | vm_hp_total_2M | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G | | | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | | | | | | | | | | | | | | | iB) | | | | | | | | | | | | +-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+ | 0 | 31903 | 14500 | 31903 | True | 1024 | 1 | 0 | None | 790784 | 13895 | 13895 | None | 0 | 1 | None | True | | 1 | 45932 | 2000 | 45932 | True | 1024 | 1 | 0 | None | 1149440 | 20209 | 20209 | None | 0 | 1 | None | True | +-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+
fm alarm also reports memory issues:
| 270.001 | Host controller-0 compute services failure, failed to enable kubernetes services | host=controller-0.services=compute | critical | 2019-04-25T12:25:26. | | | | | | 585899 | | | | | | | | 100.103 | Platform Memory threshold exceeded ; threshold 90%, actual 90% | host=controller-0.numa=node0 | critical | 2019-04-25T12:16:54. | | | | | | 929410 | | | | | | | | 100.103 | Platform Memory threshold exceeded ; threshold 80%, actual 80% | host=controller-0 | major | 2019-04-25T11:28:54. | | | | | | 929803
DUPLEX For duplex, a similar behavior was reported on sanity, where keystone stops responding, the system seems to be stable now, but keystone pod logged restarts:
[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide |grep -i keys openstack keystone-api-5d4679b847-jwpsj 1/1 Running 1 173m 172.16.166.134 controller-1 <none> <none> openstack keystone-api-5d4679b847-qvk8c 1/1 Running 5 173m 172.16.192.84 controller-0 <none> <none>
A new sanity is being executed on Duplex now that the platform is stable. I’ll update the bug with the results.
Further information gathered from a troubleshoot zoom with Al.
SIMPLEX
Kube-system pods are continuously restarting:
NAMESPACE NAME READY STATUS RESTARTS AGE 84bb87857f- 6psdr 1/1 Running 88 9h error-pages- 69d8d88bd4- qz6ld 1/1 Running 68 9h controller- 0 1/1 Running 108 9h -manager- controller- 0 1/1 Running 46 9h controller- 0 1/1 Running 46 9h deploy- d87d7bd75- zbtc8 1/1 Running 124 9h
kube-system calico-node-5t4v4 1/1 Running 49 9h
kube-system coredns-
kube-system ingress-
kube-system ingress-q6fgn 1/1 Running 138 9h
kube-system kube-apiserver-
kube-system kube-controller
kube-system kube-scheduler-
kube-system tiller-
The cause seems to be memory usage, Out of Memory kills are logged on /var/log/kern.log, some lines from the log:
2019-04- 25T12:22: 24.427 controller-0 kernel: err [38511.705026] Out of memory: Kill process 23019 (apache2) score 1002 or sacrifice child 25T12:22: 42.593 controller-0 kernel: warning [38529.834880] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:22: 42.638 controller-0 kernel: err [38529.869458] Out of memory: Kill process 23023 (apache2) score 1002 or sacrifice child 25T12:22: 47.526 controller-0 kernel: warning [38534.756610] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:22: 47.553 controller-0 kernel: err [38534.773152] Out of memory: Kill process 158263 (cinder-api) score 1002 or sacrifice child 25T12:23: 16.786 controller-0 kernel: warning [38563.947968] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:23: 16.796 controller-0 kernel: err [38563.955317] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 25T12:23: 32.242 controller-0 kernel: warning [38579.367722] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:23: 32.261 controller-0 kernel: err [38579.379438] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 25T12:23: 35.708 controller-0 kernel: warning [38582.825486] [<ffffffff98188 cd3>] out_of_ memory+ 0x4d3/0x510 25T12:23: 35.777 controller-0 kernel: err [38582.886363] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
2019-04-
This system has 93GB of memory:
[wrsroot@ controller- 0 ~(keystone_admin)]$ free -h
total used free shared buff/cache available
Mem: 93G 86G 663M 72M 6.2G 2.0G
Swap: 0B 0B 0B
[wrsroot@ controller- 0 ~(keystone_admin)]$ system host-memory-list controller-0 -----+- ------- -+----- ------- +------ ---+--- ------- ------+ ------- -+----- ---+--- -----+- ------+ ------- ------+ ------- ------- --+---- ------- -----+- ------- ------- ---+--- ------- ------+ ------- ------- --+---- ------- ------- +------ ------- -+ -----+- ------- -+----- ------- +------ ---+--- ------- ------+ ------- -+----- ---+--- -----+- ------+ ------- ------+ ------- ------- --+---- ------- -----+- ------- ------- ---+--- ------- ------+ ------- ------- --+---- ------- ------- +------ ------- -+ -----+- ------- -+----- ------- +------ ---+--- ------- ------+ ------- -+----- ---+--- -----+- ------+ ------- ------+ ------- ------- --+---- ------- -----+- ------- ------- ---+--- ------- ------+ ------- ------- --+---- ------- ------- +------ ------- -+
+------
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_total_4K | vm_hp_total_2M | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G |
| | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | | | | | | | | |
| | | | | | iB) | | | | | | | | | | | |
+------
| 0 | 31903 | 14500 | 31903 | True | 1024 | 1 | 0 | None | 790784 | 13895 | 13895 | None | 0 | 1 | None | True |
| 1 | 45932 | 2000 | 45932 | True | 1024 | 1 | 0 | None | 1149440 | 20209 | 20209 | None | 0 | 1 | None | True |
+------
fm alarm also reports memory issues:
| 270.001 | Host controller-0 compute services failure, failed to enable kubernetes services | host=controller -0.services= compute | critical | 2019-04- 25T12:25: 26. | -0.numa= node0 | critical | 2019-04- 25T12:16: 54. | 25T11:28: 54. |
| | | | | 585899 |
| | | | | |
| 100.103 | Platform Memory threshold exceeded ; threshold 90%, actual 90% | host=controller
| | | | | 929410 |
| | | | | |
| 100.103 | Platform Memory threshold exceeded ; threshold 80%, actual 80% | host=controller-0 | major | 2019-04-
| | | | | 929803
DUPLEX
For duplex, a similar behavior was reported on sanity, where keystone stops responding, the system seems to be stable now, but keystone pod logged restarts:
[wrsroot@ controller- 0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide |grep -i keys api-5d4679b847- jwpsj 1/1 Running 1 173m 172.16.166.134 controller-1 <none> <none> api-5d4679b847- qvk8c 1/1 Running 5 173m 172.16.192.84 controller-0 <none> <none>
openstack keystone-
openstack keystone-
A new sanity is being executed on Duplex now that the platform is stable. I’ll update the bug with the results.