Further information gathered from a troubleshoot zoom with Al. SIMPLEX Kube-system pods are continuously restarting: NAMESPACE NAME READY STATUS RESTARTS AGE kube-system calico-node-5t4v4 1/1 Running 49 9h kube-system coredns-84bb87857f-6psdr 1/1 Running 88 9h kube-system ingress-error-pages-69d8d88bd4-qz6ld 1/1 Running 68 9h kube-system ingress-q6fgn 1/1 Running 138 9h kube-system kube-apiserver-controller-0 1/1 Running 108 9h kube-system kube-controller-manager-controller-0 1/1 Running 46 9h kube-system kube-scheduler-controller-0 1/1 Running 46 9h kube-system tiller-deploy-d87d7bd75-zbtc8 1/1 Running 124 9h The cause seems to be memory usage, Out of Memory kills are logged on /var/log/kern.log, some lines from the log: 2019-04-25T12:22:24.427 controller-0 kernel: err [38511.705026] Out of memory: Kill process 23019 (apache2) score 1002 or sacrifice child 2019-04-25T12:22:42.593 controller-0 kernel: warning [38529.834880] [] out_of_memory+0x4d3/0x510 2019-04-25T12:22:42.638 controller-0 kernel: err [38529.869458] Out of memory: Kill process 23023 (apache2) score 1002 or sacrifice child 2019-04-25T12:22:47.526 controller-0 kernel: warning [38534.756610] [] out_of_memory+0x4d3/0x510 2019-04-25T12:22:47.553 controller-0 kernel: err [38534.773152] Out of memory: Kill process 158263 (cinder-api) score 1002 or sacrifice child 2019-04-25T12:23:16.786 controller-0 kernel: warning [38563.947968] [] out_of_memory+0x4d3/0x510 2019-04-25T12:23:16.796 controller-0 kernel: err [38563.955317] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 2019-04-25T12:23:32.242 controller-0 kernel: warning [38579.367722] [] out_of_memory+0x4d3/0x510 2019-04-25T12:23:32.261 controller-0 kernel: err [38579.379438] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child 2019-04-25T12:23:35.708 controller-0 kernel: warning [38582.825486] [] out_of_memory+0x4d3/0x510 2019-04-25T12:23:35.777 controller-0 kernel: err [38582.886363] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child This system has 93GB of memory: [wrsroot@controller-0 ~(keystone_admin)]$ free -h total used free shared buff/cache available Mem: 93G 86G 663M 72M 6.2G 2.0G Swap: 0B 0B 0B [wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-list controller-0 +-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+ | processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_total_4K | vm_hp_total_2M | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G | | | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | | | | | | | | | | | | | | | iB) | | | | | | | | | | | | +-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+ | 0 | 31903 | 14500 | 31903 | True | 1024 | 1 | 0 | None | 790784 | 13895 | 13895 | None | 0 | 1 | None | True | | 1 | 45932 | 2000 | 45932 | True | 1024 | 1 | 0 | None | 1149440 | 20209 | 20209 | None | 0 | 1 | None | True | +-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+ fm alarm also reports memory issues: | 270.001 | Host controller-0 compute services failure, failed to enable kubernetes services | host=controller-0.services=compute | critical | 2019-04-25T12:25:26. | | | | | | 585899 | | | | | | | | 100.103 | Platform Memory threshold exceeded ; threshold 90%, actual 90% | host=controller-0.numa=node0 | critical | 2019-04-25T12:16:54. | | | | | | 929410 | | | | | | | | 100.103 | Platform Memory threshold exceeded ; threshold 80%, actual 80% | host=controller-0 | major | 2019-04-25T11:28:54. | | | | | | 929803 DUPLEX For duplex, a similar behavior was reported on sanity, where keystone stops responding, the system seems to be stable now, but keystone pod logged restarts: [wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide |grep -i keys openstack keystone-api-5d4679b847-jwpsj 1/1 Running 1 173m 172.16.166.134 controller-1 openstack keystone-api-5d4679b847-qvk8c 1/1 Running 5 173m 172.16.192.84 controller-0 A new sanity is being executed on Duplex now that the platform is stable. I’ll update the bug with the results.