Comment 2 for bug 1826308

Revision history for this message
Cristopher Lemus (cjlemusc) wrote : Re: some pods are failing during sanity execution.

Further information gathered from a troubleshoot zoom with Al.

SIMPLEX

Kube-system pods are continuously restarting:

NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-node-5t4v4 1/1 Running 49 9h
kube-system coredns-84bb87857f-6psdr 1/1 Running 88 9h
kube-system ingress-error-pages-69d8d88bd4-qz6ld 1/1 Running 68 9h
kube-system ingress-q6fgn 1/1 Running 138 9h
kube-system kube-apiserver-controller-0 1/1 Running 108 9h
kube-system kube-controller-manager-controller-0 1/1 Running 46 9h
kube-system kube-scheduler-controller-0 1/1 Running 46 9h
kube-system tiller-deploy-d87d7bd75-zbtc8 1/1 Running 124 9h

The cause seems to be memory usage, Out of Memory kills are logged on /var/log/kern.log, some lines from the log:

2019-04-25T12:22:24.427 controller-0 kernel: err [38511.705026] Out of memory: Kill process 23019 (apache2) score 1002 or sacrifice child
2019-04-25T12:22:42.593 controller-0 kernel: warning [38529.834880] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:22:42.638 controller-0 kernel: err [38529.869458] Out of memory: Kill process 23023 (apache2) score 1002 or sacrifice child
2019-04-25T12:22:47.526 controller-0 kernel: warning [38534.756610] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:22:47.553 controller-0 kernel: err [38534.773152] Out of memory: Kill process 158263 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:16.786 controller-0 kernel: warning [38563.947968] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:16.796 controller-0 kernel: err [38563.955317] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:32.242 controller-0 kernel: warning [38579.367722] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:32.261 controller-0 kernel: err [38579.379438] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:35.708 controller-0 kernel: warning [38582.825486] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:35.777 controller-0 kernel: err [38582.886363] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child

This system has 93GB of memory:

[wrsroot@controller-0 ~(keystone_admin)]$ free -h
              total used free shared buff/cache available
Mem: 93G 86G 663M 72M 6.2G 2.0G
Swap: 0B 0B 0B

[wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-list controller-0
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_total_4K | vm_hp_total_2M | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G |
| | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | | | | | | | | |
| | | | | | iB) | | | | | | | | | | | |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+
| 0 | 31903 | 14500 | 31903 | True | 1024 | 1 | 0 | None | 790784 | 13895 | 13895 | None | 0 | 1 | None | True |
| 1 | 45932 | 2000 | 45932 | True | 1024 | 1 | 0 | None | 1149440 | 20209 | 20209 | None | 0 | 1 | None | True |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+

fm alarm also reports memory issues:

| 270.001 | Host controller-0 compute services failure, failed to enable kubernetes services | host=controller-0.services=compute | critical | 2019-04-25T12:25:26. |
| | | | | 585899 |
| | | | | |
| 100.103 | Platform Memory threshold exceeded ; threshold 90%, actual 90% | host=controller-0.numa=node0 | critical | 2019-04-25T12:16:54. |
| | | | | 929410 |
| | | | | |
| 100.103 | Platform Memory threshold exceeded ; threshold 80%, actual 80% | host=controller-0 | major | 2019-04-25T11:28:54. |
| | | | | 929803

DUPLEX
For duplex, a similar behavior was reported on sanity, where keystone stops responding, the system seems to be stable now, but keystone pod logged restarts:

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide |grep -i keys
openstack keystone-api-5d4679b847-jwpsj 1/1 Running 1 173m 172.16.166.134 controller-1 <none> <none>
openstack keystone-api-5d4679b847-qvk8c 1/1 Running 5 173m 172.16.192.84 controller-0 <none> <none>

A new sanity is being executed on Duplex now that the platform is stable. I’ll update the bug with the results.