StarlingX

Bug #1826308
Comment #2

Comment 2 for bug 1826308

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-04-25: Re: some pods are failing during sanity execution.

Further information gathered from a troubleshoot zoom with Al.

SIMPLEX

Kube-system pods are continuously restarting:

The cause seems to be memory usage, Out of Memory kills are logged on /var/log/kern.log, some lines from the log:

2019-04-25T12:22:24.427 controller-0 kernel: err [38511.705026] Out of memory: Kill process 23019 (apache2) score 1002 or sacrifice child
2019-04-25T12:22:42.593 controller-0 kernel: warning [38529.834880] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:22:42.638 controller-0 kernel: err [38529.869458] Out of memory: Kill process 23023 (apache2) score 1002 or sacrifice child
2019-04-25T12:22:47.526 controller-0 kernel: warning [38534.756610] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:22:47.553 controller-0 kernel: err [38534.773152] Out of memory: Kill process 158263 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:16.786 controller-0 kernel: warning [38563.947968] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:16.796 controller-0 kernel: err [38563.955317] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:32.242 controller-0 kernel: warning [38579.367722] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:32.261 controller-0 kernel: err [38579.379438] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:35.708 controller-0 kernel: warning [38582.825486] [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:35.777 controller-0 kernel: err [38582.886363] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child

This system has 93GB of memory:

[wrsroot@controller-0 ~(keystone_admin)]$ free -h
total used free shared buff/cache available
Mem: 93G 86G 663M 72M 6.2G 2.0G
Swap: 0B 0B 0B

[wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-list controller-0
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_total_4K | vm_hp_total_2M | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G |
| | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | | | | | | | | |
| | | | | | iB) | | | | | | | | | | | |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+
| 0 | 31903 | 14500 | 31903 | True | 1024 | 1 | 0 | None | 790784 | 13895 | 13895 | None | 0 | 1 | None | True |
| 1 | 45932 | 2000 | 45932 | True | 1024 | 1 | 0 | None | 1149440 | 20209 | 20209 | None | 0 | 1 | None | True |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+

fm alarm also reports memory issues:

DUPLEX
For duplex, a similar behavior was reported on sanity, where keystone stops responding, the system seems to be stable now, but keystone pod logged restarts:

A new sanity is being executed on Duplex now that the platform is stable. I’ll update the bug with the results.

Further information gathered from a troubleshoot zoom with Al.

SIMPLEX

Kube-system pods are continuously restarting:

NAMESPACE     NAME                                                 READY   STATUS             RESTARTS   AGE
kube-system   calico-node-5t4v4                                    1/1     Running            49         9h
kube-system   coredns-84bb87857f-6psdr                             1/1     Running            88         9h
kube-system   ingress-error-pages-69d8d88bd4-qz6ld                 1/1     Running            68         9h
kube-system   ingress-q6fgn                                        1/1     Running            138        9h
kube-system   kube-apiserver-controller-0                          1/1     Running            108        9h
kube-system   kube-controller-manager-controller-0                 1/1     Running            46         9h
kube-system   kube-scheduler-controller-0                          1/1     Running            46         9h
kube-system   tiller-deploy-d87d7bd75-zbtc8                        1/1     Running            124        9h

The cause seems to be memory usage, Out of Memory kills are logged on /var/log/kern.log, some lines from the log:

2019-04-25T12:22:24.427 controller-0 kernel: err [38511.705026] Out of memory: Kill process 23019 (apache2) score 1002 or sacrifice child
2019-04-25T12:22:42.593 controller-0 kernel: warning [38529.834880]  [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:22:42.638 controller-0 kernel: err [38529.869458] Out of memory: Kill process 23023 (apache2) score 1002 or sacrifice child
2019-04-25T12:22:47.526 controller-0 kernel: warning [38534.756610]  [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:22:47.553 controller-0 kernel: err [38534.773152] Out of memory: Kill process 158263 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:16.786 controller-0 kernel: warning [38563.947968]  [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:16.796 controller-0 kernel: err [38563.955317] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:32.242 controller-0 kernel: warning [38579.367722]  [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:32.261 controller-0 kernel: err [38579.379438] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child
2019-04-25T12:23:35.708 controller-0 kernel: warning [38582.825486]  [<ffffffff98188cd3>] out_of_memory+0x4d3/0x510
2019-04-25T12:23:35.777 controller-0 kernel: err [38582.886363] Out of memory: Kill process 157281 (cinder-api) score 1002 or sacrifice child

This system has 93GB of memory:

[wrsroot@controller-0 ~(keystone_admin)]$ free -h
              total        used        free      shared  buff/cache   available
Mem:            93G         86G        663M         72M        6.2G        2.0G
Swap:            0B          0B          0B

[wrsroot@controller-0 ~(keystone_admin)]$ system host-memory-list controller-0
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_total_4K | vm_hp_total_2M | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G |
|           | al(MiB) | rm(MiB)    | il(MiB) | configured     | size(M | total  | avail  | _reqd |             |                |                |                  |                |                |                  |              |
|           |         |            |         |                | iB)    |        |        |       |             |                |                |                  |                |                |                  |              |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+
| 0         | 31903   | 14500      | 31903   | True           | 1024   | 1      | 0      | None  | 790784      | 13895          | 13895          | None             | 0              | 1              | None             | True         |
| 1         | 45932   | 2000       | 45932   | True           | 1024   | 1      | 0      | None  | 1149440     | 20209          | 20209          | None             | 0              | 1              | None             | True         |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+-------------+----------------+----------------+------------------+----------------+----------------+------------------+--------------+

fm alarm also reports memory issues:

| 270.001  | Host controller-0 compute services failure, failed to enable kubernetes services | host=controller-0.services=compute | critical | 2019-04-25T12:25:26.     |
|          |                                                                                  |                                    |          | 585899                   |
|          |                                                                                  |                                    |          |                          |
| 100.103  | Platform Memory threshold exceeded ; threshold 90%, actual 90%                   | host=controller-0.numa=node0       | critical | 2019-04-25T12:16:54.     |
|          |                                                                                  |                                    |          | 929410                   |
|          |                                                                                  |                                    |          |                          |
| 100.103  | Platform Memory threshold exceeded ; threshold 80%, actual 80%                   | host=controller-0                  | major    | 2019-04-25T11:28:54.     |
|          |                                                                                  |                                    |          | 929803

DUPLEX
For duplex, a similar behavior was reported on sanity, where keystone stops responding, the system seems to be stable now, but keystone pod logged restarts:

[wrsroot@controller-0 ~(keystone_admin)]$ kubectl get pods --all-namespaces -o wide |grep -i keys
openstack     keystone-api-5d4679b847-jwpsj                        1/1     Running            1          173m    172.16.166.134   controller-1   <none>           <none>
openstack     keystone-api-5d4679b847-qvk8c                        1/1     Running            5          173m    172.16.192.84    controller-0   <none>           <none>

A new sanity is being executed on Duplex now that the platform is stable. I’ll update the bug with the results.