Bug #1970645 “Stx-openstack apply timeout because some pods are ...” : Bugs : StarlingX

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-04-27:

#1

Collected logs are available: https://files.starlingx.cengn.ca/download_file/8

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-04-28:

#2

Download full text (5.7 KiB)

I found this pods in Evicted state:
openstack cinder-volume-bb64cb587-lvqxh ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack fm-rest-api-649cd5b99d-8f9v6 ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack ingress-6b65fb7fc9-g6cf5 ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack keystone-api-66b87d555b-xgxlm ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack mariadb-ingress-bcd8fb475-r8tj7 ● 0/0 0 Evicted n/a controller-1 3h36m │
│ openstack mariadb-ingress-bcd8fb475-vpszs ● 0/0 0 Evicted n/a controller-0 3h36m │
│ openstack nova-api-proxy-78c6447cb-69txr ● 0/0 0 Evicted n/a controller-1 176m │
│ openstack nova-scheduler-7b6fd68499-5hwwh ● 0/0 0 Evicted n/a controller-1 176m │
│ openstack placement-api-68668d8dd5-mttwl ● 0/0 0 Evicted n/a controller-1 176m │

Describing the pods I found this kind of messages:
Message: The node was low on resource: ephemeral-storage. Container ingress was using 112Ki, which exceeds its request of 0.
ex:
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 3h41m default-scheduler Successfully assigned op...

I found this pods in Evicted state:
openstack                cinder-volume-bb64cb587-lvqxh                                 ●             0/0                                0 Evicted               n/a                         controller-1              172m                │
│ openstack                fm-rest-api-649cd5b99d-8f9v6                                  ●             0/0                                0 Evicted               n/a                         controller-1              172m                │
│ openstack                ingress-6b65fb7fc9-g6cf5                                      ●             0/0                                0 Evicted               n/a                         controller-1              172m                │                   
│ openstack                keystone-api-66b87d555b-xgxlm                                 ●             0/0                                0 Evicted               n/a                         controller-1              172m                │
│ openstack                mariadb-ingress-bcd8fb475-r8tj7                               ●             0/0                                0 Evicted               n/a                         controller-1              3h36m               │
│ openstack                mariadb-ingress-bcd8fb475-vpszs                               ●             0/0                                0 Evicted               n/a                         controller-0              3h36m               │
│ openstack                nova-api-proxy-78c6447cb-69txr                                ●             0/0                                0 Evicted               n/a                         controller-1              176m                │
│ openstack                nova-scheduler-7b6fd68499-5hwwh                               ●             0/0                                0 Evicted               n/a                         controller-1              176m                │
│ openstack                placement-api-68668d8dd5-mttwl                                ●             0/0                                0 Evicted               n/a                         controller-1              176m                │

Describing the pods I found this kind of messages:
 Message:        The node was low on resource: ephemeral-storage. Container ingress was using 112Ki, which exceeds its request of 0.
ex:
 Events:                                                                                                                                                                                                                                   │
│   Type     Reason          Age                     From                      Message                                                                                                                                                      │
│   ----     ------          ----                    ----                      -------                                                                                                                                                      │
│   Normal   Scheduled       3h41m                   default-scheduler         Successfully assigned openstack/mariadb-ingress-bcd8fb475-vpszs to controller-0                                                                              │
│   Normal   AddedInterface  3h41m                   multus                    Add eth0 [172.16.192.99/32]                                                                                                                                  │
│   Normal   Pulled          3h41m                   kubelet, controller-0     Container image "registry.local:9001/quay.io/airshipit/kubernetes-entrypoint:v1.0.0" already present on machine                                              │
│   Normal   Created         3h41m                   kubelet, controller-0     Created container init                                                                                                                                       │
│   Normal   Started         3h41m                   kubelet, controller-0     Started container init                                                                                                                                       │
│   Normal   Pulled          3h41m                   kubelet, controller-0     Container image "registry.local:9001/k8s.gcr.io/ingress-nginx/controller:v0.42.0" already present on machine                                                 │
│   Normal   Created         3h41m                   kubelet, controller-0     Created container ingress                                                                                                                                    │
│   Normal   Started         3h41m                   kubelet, controller-0     Started container ingress                                                                                                                                    │
│   Warning  Unhealthy       3h39m (x16 over 3h41m)  kubelet, controller-0     Readiness probe failed: dial tcp 172.16.192.99:3306: connect: connection refused                                                                             │
│   Normal   RELOAD          3h39m (x2 over 3h41m)   nginx-ingress-controller  NGINX reload triggered due to a change in configuration                                                                                                      │
│   Warning  Evicted         178m                    kubelet, controller-0     The node was low on resource: ephemeral-storage. Container ingress was using 112Ki, which exceeds its request of 0.                                          │
│   Normal   Killing         178m                    kubelet, controller-0     Stopping container ingress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2022-05-01:

#3

screening: stx.7.0 / critical - issue resulting in red sanity.
Assigning to Douglas to assign to the WR openstack team for investigation.

tags:	added: stx.7.0 stx.distro.openstack
Changed in starlingx:
importance:	Undecided → Critical
status:	New → Triaged
assignee:	nobody → Douglas Lopes Pereira (douglaspereira)

Ghada Khalil (gkhalil) on 2022-05-04

Changed in starlingx:
assignee:	Douglas Lopes Pereira (douglaspereira) → Thales Elero Cervi (tcervi)

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-10 (last edit on 2022-05-10):

#5

From the attached logs I can see more than one stx-openstack reapply failing, one fails waiting the mariadb-ingress pod and the other fails waiting the openstack-ingress:

2022-04-27 15:49:10.199 177 ERROR armada.handlers.wait [-] [chart=openstack-mariadb]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-mariadb)). These pods were not ready=['mariadb-ingress-bcd8fb475-kqcwl', 'mariadb-ingress-bcd8fb475-n6m7x'][00m
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada [-] Chart deploy [openstack-mariadb] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-mariadb)). These pods were not ready=['mariadb-ingress-bcd8fb475-kqcwl', 'mariadb-ingress-bcd8fb475-n6m7x']

2022-04-27 16:22:18.694 232 ERROR armada.handlers.wait [-] [chart=openstack-ingress]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-ingress)). These pods were not ready=['ingress-6b65fb7fc9-pls8m', 'ingress-6b65fb7fc9-qsgfv'][00m
2022-04-27 16:22:18.694 232 ERROR armada.handlers.armada [-] Chart deploy [openstack-ingress] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-ingress)). These pods were not ready=['ingress-6b65fb7fc9-pls8m', 'ingress-6b65fb7fc9-qsgfv']

Since those pods were later deleted I won't find any further information on the collected logs. Will be trying to simulate this issue from my side to figure out what's happening.

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-12 (last edit on 2022-05-13):

#6

I finally managed to reproduce a similar issue using stx and stx-openstack master builds.In my case the reapply is breaking because of missed overrides, something I could not find inside the attached logs though.

The critical point is: all the cg-openstack-*.yaml files inside /opt/platform/armada/22.06/stx-openstack/1.0-192-centos-stable-versioned seems to be disappearing on the first stx-os reapply after a controller node reboot, during the "generating application overrides" step. It happens because the first stx-os reapply after a controller node reboot is cleaning the armada-overrides.yaml content:

$ cat /opt/platform/helm/22.06/stx-openstack/1.0-192-centos-stable-versioned/armada-overrides.yaml
[]

Definitely something to look into, this is probably the root cause of the lock/unlock test failures. I need to understand what's making those files to vanish and what recent change, to platform or application, introduced this problematic behavior.

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-17 (last edit on 2022-05-17):

#7

This issue seems to have the same root cause of this other bug: https://bugs.launchpad.net/starlingx/+bug/1972019

Apparently the fix released for that bug also fixed the stx-openstack problem, since I could not reproduce the reapply failure after lock/unlock using the master build (05 12 2022)

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-05-18:

#8

Download full text (4.9 KiB)

I cannot check yet if this issue is fixed or not on bare-metal servers because that servers are affected by:
https://bugs.launchpad.net/starlingx/+bug/1973888
But I checked it on the Virtual configurations and I observed that stx-openstack apply is still failing.
I checked the pods and all of them are fine.

[sysadmin@controller-1 log(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------...

I cannot check yet if this issue is fixed or not on bare-metal servers because that servers are affected by:
https://bugs.launchpad.net/starlingx/+bug/1973888
But I checked it on the Virtual configurations and I observed that stx-openstack apply is still failing.
I checked the pods and all of them are fine.

I will attach new collected logs.

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-05-18:

#9

I attached the collected logs Edit (70.6 MiB, application/x-tar)

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-18:

#10

Hi Alexandru, thanks for checking it on a virtual environment. I also used a virtual AIO-SX to test the reapply after lock/unlock and was not able to reproduce the apply failure using the master build (20220512T035413Z).

Was this failure 100% reproducible for you? I ask because I can see two alarms related to high resource usage on your system on the same time the apply failed:

So I wonder if that was not a problem related to physical resources constraints during the reapply time. From your logs I can see that the reapply timed out waiting for the "nova-api-proxy-cdffff877-fc6fb" pod around 2022-05-18 09:02:23

2022-05-18 09:02:23.142 454 ERROR armada.handlers.wait [-] [chart=openstack-nova-api-proxy]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-cdffff877-fc6fb'][00m
2022-05-18 09:02:23.143 454 ERROR armada.handlers.armada [-] Chart deploy [openstack-nova-api-proxy] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-cdffff877-fc6fb']
2022-05-18 09:02:23.143 454 ERROR armada.handlers.armada Traceback (most recent call last)

But I can also see that later this pod was able to come up, probably ~5min after the reapply timeout. From containerization_kube.info:

Wed May 18 09:37:28 UTC 2022 : : kubectl describe nodes
...
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
openstack nova-api-proxy-cdffff877-6lkxm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m

09:37:28 - 32m ~= 09:05:28

Will try to reproduce it again on my environment later today, just wanted point out the resource constraint on your virtual environment that could be causing your apply failure.

Anyways, thanks for updating your test results here!

Hi Alexandru, thanks for checking it on a virtual environment. I also used a virtual AIO-SX to test the reapply after lock/unlock and was not able to reproduce the apply failure using the master build (20220512T035413Z).

Was this failure 100% reproducible for you? I ask because I can see two alarms related to high resource usage on your system on the same time the apply failed:

So I wonder if that was not a problem related to physical resources constraints during the reapply time. From your logs I can see that the reapply timed out waiting for the "nova-api-proxy-cdffff877-fc6fb" pod around 2022-05-18 09:02:23

2022-05-18 09:02:23.142 454 ERROR armada.handlers.wait [-] [chart=openstack-nova-api-proxy]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-cdffff877-fc6fb'][00m
2022-05-18 09:02:23.143 454 ERROR armada.handlers.armada [-] Chart deploy [openstack-nova-api-proxy] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-cdffff877-fc6fb']
2022-05-18 09:02:23.143 454 ERROR armada.handlers.armada Traceback (most recent call last)

But I can also see that later this pod was able to come up, probably ~5min after the reapply timeout. From containerization_kube.info:

Wed May 18 09:37:28 UTC 2022 :  : kubectl describe nodes
...
  Namespace                   Name                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                  ------------  ----------  ---------------  -------------  ---
openstack                   nova-api-proxy-cdffff877-6lkxm                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         32m

09:37:28 - 32m ~= 09:05:28

Will try to reproduce it again on my environment later today, just wanted point out the resource constraint on your virtual environment that could be causing your apply failure.

Anyways, thanks for updating your test results here!

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-05-18:

#11

We are using the same virtual configurations for daily sanity since few years ago.
Hmm.. I think the issue is not related to the resource constraint.
The same issue can be seen on all Virtual configurations: Simplex, Duplex, Standard and Standard External.

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-19:

#12

Hi Alexandru, thanks for providing the information here.
I was testing the master build 20220518 on an AIO-SX virtual environment and unfortunately could not reproduce the issue anymore :(

Tried several lock/unlock operations followed by stx-openstack application manual applies. During my tests all the apply operations worked and the only issue I faced was the one I reported here: https://bugs.launchpad.net/starlingx/+bug/1974221
That's just a corner case error message issue though.

Not sure if we should wait for a sanity run after https://bugs.launchpad.net/starlingx/+bug/1973888 is fixed to re-check if the stx-openstack is still reproducible.
Or maybe you could please provide me more information about your test scenario on the virtual environment. I saw you were using a DX and had at least one instance running that was probably migrated during the lock/unlock (it was rebooting). My virtual environment is shorter in processor resources so I did not bring any instance up before locking/unlocking...

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-05-22:

#13

Download full text (10.3 KiB)

Some more information from: /var/log/armada/stx-openstack-apply_2022-05-22-08-45-38.log

...
2022-05-22 09:13:43.700 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:14:43.774 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:15:43.841 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:15:44.582 138 ERROR armada.handlers.wait [-] [chart=openstack-nova-api-proxy]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-667468b59d-vzzsb']
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada [-] Chart deploy [openstack-nova-api-proxy] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-667468b59d-vzzsb']
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada Traceback (most recent call last):
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 170, in handle_result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada result = get_result()
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada return self.__get_result()
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada raise self._exception
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada result = self.fn(*self.args, **self.kwargs)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 159, in deploy_chart
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada concurrency)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 55, in execute
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada ch, cg_test_all_charts, prefix, known_releases)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 267, in _execute
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada chart_wait.wait(timer)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 142, in wait
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada ...

Some more information from: /var/log/armada/stx-openstack-apply_2022-05-22-08-45-38.log

...
2022-05-22 09:13:43.700 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:14:43.774 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:15:43.841 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:15:44.582 138 ERROR armada.handlers.wait [-] [chart=openstack-nova-api-proxy]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-667468b59d-vzzsb']
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada [-] Chart deploy [openstack-nova-api-proxy] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-667468b59d-vzzsb']
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada Traceback (most recent call last):
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 170, in handle_result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     result = get_result()
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     return self.__get_result()
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     raise self._exception
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     result = self.fn(*self.args, **self.kwargs)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 159, in deploy_chart
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     concurrency)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 55, in execute
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     ch, cg_test_all_charts, prefix, known_releases)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 267, in _execute
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     chart_wait.wait(timer)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 142, in wait
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     wait.wait(timeout=timeout)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 302, in wait
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     modified = self._wait(deadline)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/decorator.py", line 232, in fun
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     return caller(func, *(extras + args), **kw)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/retry/api.py", line 74, in retry_decorator
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     logger)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/retry/api.py", line 33, in __retry_internal
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     return f()
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 372, in _wait
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada     raise k8s_exceptions.KubernetesWatchTimeoutException(error)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-667468b59d-vzzsb']
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada
2022-05-22 09:15:44.586 138 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['openstack-nova-api-proxy']
2022-05-22 09:15:44.851 138 INFO armada.handlers.lock [-] Releasing lock
2022-05-22 09:15:44.856 138 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openstack-nova-api-proxy']
2022-05-22 09:15:44.856 138 ERROR armada.cli Traceback (most recent call last):
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2022-05-22 09:15:44.856 138 ERROR armada.cli     self.invoke()
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 219, in invoke
2022-05-22 09:15:44.856 138 ERROR armada.cli     resp = self.handle(documents, tiller)
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2022-05-22 09:15:44.856 138 ERROR armada.cli     return future.result()
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2022-05-22 09:15:44.856 138 ERROR armada.cli     return self.__get_result()
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2022-05-22 09:15:44.856 138 ERROR armada.cli     raise self._exception
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2022-05-22 09:15:44.856 138 ERROR armada.cli     result = self.fn(*self.args, **self.kwargs)
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 267, in handle
2022-05-22 09:15:44.856 138 ERROR armada.cli     return armada.sync()
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 118, in sync
2022-05-22 09:15:44.856 138 ERROR armada.cli     return self._sync()
2022-05-22 09:15:44.856 138 ERROR armada.cli   File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 198, in _sync
2022-05-22 09:15:44.856 138 ERROR armada.cli     raise armada_exceptions.ChartDeployException(failures)
2022-05-22 09:15:44.856 138 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openstack-nova-api-proxy']
2022-05-22 09:15:44.856 138 ERROR armada.cli
command terminated with exit code 1

the following pods are in Evited state:
fm-rest-api-5dcd9d9484-fjbq6
fm-rest-api-5dcd9d9484-l2gcn
horizon-7677cfcc65-d4j22
ingress-84c5f4749f-sl7j7
mariadb-ingress-bcd8fb475-sjp59
nova-api-proxy-667468b59d-lq2lx
nova-api-proxy-667468b59d-vzzsb
placement-api-796896f544-4db8b

Describing some, I can see this:

Warning  FailedScheduling  11h (x22 over 11h)  default-scheduler      0/2 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 1 node(s) had taint {node.kuber │
│ netes.io/disk-pressure: }, that the pod didn't tolerate.                                                                                                                                                                                  │
│   Normal   Scheduled         11h                 default-scheduler      Successfully assigned openstack/nova-api-proxy-667468b59d-lq2lx to controller-1                                                                                   │
│   Normal   AddedInterface    11h                 multus                 Add eth0 [172.16.166.173/32]                                                                                                                                      │
│   Normal   Pulled            11h                 kubelet, controller-1  Container image "registry.local:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1" already present on machine                                                 │
│   Normal   Created           11h                 kubelet, controller-1  Created container init                                                                                                                                            │
│   Normal   Started           11h                 kubelet, controller-1  Started container init                                                                                                                                            │
│   Normal   Pulled            11h                 kubelet, controller-1  Container image "registry.local:9001/docker.io/starlingx/stx-nova-api-proxy:master-centos-stable-20220519T055212Z.0" already present on machine                   │
│   Normal   Created           11h                 kubelet, controller-1  Created container nova-api-proxy                                                                                                                                  │
│   Normal   Started           11h                 kubelet, controller-1  Started container nova-api-proxy                                                                                                                                  │
│   Warning  Unhealthy         11h                 kubelet, controller-1  Readiness probe failed: dial tcp 172.16.166.173:8774: connect: connection refused                                                                                 │
│   Warning  Evicted           10h                 kubelet, controller-1  The node was low on resource: ephemeral-storage. Container nova-api-proxy was using 9036Ki, which exceeds its request of 0.                                       │
│   Normal   Killing           10h                 kubelet, controller-1  Stopping container nova-api-proxy

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-05-22:

#14

I attached more collected logs Edit (118.1 MiB, application/x-tar)

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-25:

#15

Thanks Alexandru.

Where can I find the test logs as well? I mean, the actual automation scripts logs, to follow the test steps and understand the actions and expected outcomes...
That would be very helpful for me to go through your recently added logs.

Cheers

Revision history for this message

Alexandru Dimofte (adimofte) wrote on 2022-05-26:

#16

I attached the Test logs. You can follow the execution on TIS_AUTOMATION.log file Edit (149.3 KiB, application/rar)

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-05-30:

#17

Download full text (4.3 KiB)

Hi, thanks for sharing those logs.
They were indeed a great help to understand better the full picture. So, apparently the majority of the Tests passed during that execution, but some "key tests" did not. In fact, the first test to fail was "test_lock_unlock_standby_controller" that tried to lock and unlock controller-1 but stx-os could not reach the applied status before the timeout:

[2022-05-22 07:24:44,669] 290 WARNING MainThread container_helper.wait_for_apps_status:: ['stx-openstack'] did not reach status applied within 360s

That reapply did not fail inside the timeout though, its last status update that I can see is from [2022-05-22 07:24:34,605] and it was:
applying | processing chart: osh-openstack-nginx-ports-control, overall completion: 10.0%

And around that time controller-1 was not stable:
sysinv 2022-05-22 07:22:47.882 835363 INFO sysinv.conductor.manager [-] Node(s) are in an unstable state. Defer audit.
sysinv 2022-05-22 07:23:29.696 835363 INFO sysinv.conductor.manager [-] Updating platform data for host: 4ddfe5d8-92ed-4df0-9912-2136257f3a81 with: {u'first_report': True}

What happened right before the reapply error that I could find on armada log:
2022-05-22 07:23:35.853 344 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock[00m

Not sure what caused that unstable status on controller-1, but the logs have a few leads on it:

[2022-05-22 07:18:31,993] 4808 WARNING MainThread host_helper.wait_for_tasks_affined:: /etc/platform/.task_affining_incomplete did not clear on controller-1

2022-05-22T07:22:02.753737 | log | 200.022 | controller-1 is now 'offline' host=controller-1.status=offline
2022-05-22T07:22:02.752373 | set | 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam
2022-05-22T07:22:02.751492 | log | 200.022 | controller-1 is now 'disabled' host=controller-1.state=disabled
2022-05-22T07:22:02.749402 | set | 200.004 | controller-1 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful host=controller-1
2022-05-22T07:22:02.748747 | log | 401.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam
2022-05-22T07:22:02.748030 | set | 400.005 | Communication failure detected with peer over port eno2 on host controller-0 | host=controller-0.network=cluster
2022-05-22T07:22:02.684878 | set | 200.005 | controller-1 experienced a persistent critical 'Management Network' communication failure.| host=controller-1.network=Management

Around the same time I could also see 3...

Hi, thanks for sharing those logs.
They were indeed a great help to understand better the full picture. So, apparently the majority of the Tests passed during that execution, but some "key tests" did not. In fact, the first test to fail was "test_lock_unlock_standby_controller" that tried to lock and unlock controller-1 but stx-os could not reach the applied status before the timeout:

[2022-05-22 07:24:44,669] 290  WARNING MainThread container_helper.wait_for_apps_status:: ['stx-openstack'] did not reach status applied within 360s

That reapply did not fail inside the timeout though, its last status update that I can see is from [2022-05-22 07:24:34,605] and it was:
 applying | processing chart: osh-openstack-nginx-ports-control, overall completion: 10.0%

And around that time controller-1 was not stable:
sysinv 2022-05-22 07:22:47.882 835363 INFO sysinv.conductor.manager [-] Node(s) are in an unstable state. Defer audit.
sysinv 2022-05-22 07:23:29.696 835363 INFO sysinv.conductor.manager [-] Updating platform data for host: 4ddfe5d8-92ed-4df0-9912-2136257f3a81 with: {u'first_report': True}

What happened right before the reapply error that I could find on armada log:
2022-05-22 07:23:35.853 344 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock[00m

Not sure what caused that unstable status on controller-1, but the logs have a few leads on it:

[2022-05-22 07:18:31,993] 4808 WARNING MainThread host_helper.wait_for_tasks_affined:: /etc/platform/.task_affining_incomplete did not clear on controller-1

2022-05-22T07:22:02.753737 | log | 200.022 | controller-1 is now 'offline'                                 host=controller-1.status=offline 
2022-05-22T07:22:02.752373 | set | 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam                                                                                   
2022-05-22T07:22:02.751492 | log | 200.022 | controller-1 is now 'disabled'                                                                                    host=controller-1.state=disabled 
2022-05-22T07:22:02.749402 | set | 200.004 | controller-1 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful   host=controller-1 
2022-05-22T07:22:02.748747 | log | 401.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam                                                                                          
2022-05-22T07:22:02.748030 | set | 400.005 | Communication failure detected with peer over port eno2 on host controller-0 | host=controller-0.network=cluster
2022-05-22T07:22:02.684878 | set | 200.005 | controller-1 experienced a persistent critical 'Management Network' communication failure.| host=controller-1.network=Management

Around the same time I could also see 3 openstack pods evicted on controller-0 due to low ephemeral-storage resource: fm-rest-api-5dcd9d9484-jnsbx, ingress-84c5f4749f-ntvnq and nova-api-proxy-667468b59d-hbthb

So that reapply failed and let the stx-openstack with the "apply-failure" status, although apparently later the pods came up since the high majority of the instance tests passed. The stand-by controller went down during the reapply so I guess a failure here would be expected, but from the logs I could not get the exact reason for the unstable state on controller-1.

The last reapply was triggered around 2022-05-22 08:45:41.481 for the "test_openstack_pod_healthy.py::test_openstack_pods_healthy [20220522 08:44:38]" that also failed with a nova-api-proxy-667468b59d-vzzsb on an Evicted status, although for this one there is no additional describe or pod logs to gather more information.

It's a very problematic and strange behavior that definitely needs to be looked into. Both the unstable controller-1 and the pod evictions are major concerns for us to have a stable sanity execution.

Do you think we can share access to those servers ins which the tests are running? Will contact will directly via email to go further on this option.

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-07-07:

#18

Pedro Almeida will be trying to reproduce the issue on a physical lab and if he is able to do it we will proceed testing a possible solution for the evicted pods issue.

Changed in starlingx:
assignee:	Thales Elero Cervi (tcervi) → nobody

Ghada Khalil (gkhalil) on 2022-07-17

tags:

added: stx.cherrypickneeded

Ghada Khalil (gkhalil) on 2022-08-03

Changed in starlingx:
assignee:	nobody → Thales Elero Cervi (tcervi)

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-08-03:

#19

Since Pedro was also not able to reproduce this issue on a physical lab, during today's community call we decided to mark the bug as "can not reproduce" and wait for our next stx with stx-openstack sanity result to check it again.

Thales Elero Cervi (tcervi) on 2022-08-03

Changed in starlingx:
status:	Triaged → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-18: Fix proposed to openstack-armada-app (master)

#20

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/openstack-armada-app/+/853693

Changed in starlingx:
status:	Invalid → In Progress

Revision history for this message

Thales Elero Cervi (tcervi) wrote on 2022-08-19:

#21

We were able to reproduce the issue internally and Pedro will be applying and testing thee suggested fix.
Thanks Pedro

Changed in starlingx:
assignee:	Thales Elero Cervi (tcervi) → nobody

Pedro Monteiro Azevedo de Moura Almeida (pmonteir) on 2022-08-19

Changed in starlingx:
assignee:	nobody → Pedro Monteiro Azevedo de Moura Almeida (pmonteir)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-22: Fix proposed to helm-charts (master)

#22

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/helm-charts/+/854006

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-25: Fix merged to helm-charts (master)

#23

Reviewed: https://review.opendev.org/c/starlingx/helm-charts/+/854006
Committed: https://opendev.org/starlingx/helm-charts/commit/b34b86880113b69840ad1a12f23d0dde62b52373
Submitter: "Zuul (22348)"
Branch: master

commit b34b86880113b69840ad1a12f23d0dde62b52373
Author: Rafael Falcão <email address hidden>
Date: Mon Aug 22 12:08:43 2022 -0300

Add resources specification to fm-rest-api

    Some pods are being evicted due to some containers
    exceeding the usage of a specific resource. The goal
    of this change is to be able to specify values of
    limits and requests for fm-rest-api pods. In this
    way we can make sure that the system has the necessary
    resources to support the api.

Test Plan:

    PASS: Check that with the 'enabled' flag set to 'false' no
          values of requests and limits are specified.
    PASS: Check that with the 'enabled' flag set to 'true' the
          default values of requests and limits takes place.
    PASS: Override the default value of requests/limits and set the
          'enabled' flag to 'true' and check that the new value
          takes place in the description of the pod.

Partial-bug: 1970645

Signed-off-by: Rafael Falcao <email address hidden>
Change-Id: I8a247c09643303f80a61b989d4b82c3835b7e601

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-26: Fix proposed to helm-charts (r/stx.7.0)

#24

Fix proposed to branch: r/stx.7.0
Review: https://review.opendev.org/c/starlingx/helm-charts/+/854780

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-26: Fix merged to openstack-armada-app (master)

#25

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/853693
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/e2d9f126ef81eafa3b8ee00fb62b52898aed2b5b
Submitter: "Zuul (22348)"
Branch: master

commit e2d9f126ef81eafa3b8ee00fb62b52898aed2b5b
Author: Pedro Almeida <email address hidden>
Date: Thu Aug 18 15:54:31 2022 -0300

Adding ephemeral-storage request on pods

    Some pods are being evicted due to some containers exceeding the
    usage of ephemeral-storage: since there was no value on the manifest,
    it was set to 0, which means that basically any value would cause the
    pod to fail.

    Test Plan:
    PASS - Build stx-openstack tarball.
    PASS - Upload/apply stx-openstack and check that new values of
           ephemeral-storage took place.
    PASS - Remove/delete stx-openstack.

Depends-On: https://review.opendev.org/c/starlingx/helm-charts/+/854006

Closes-Bug: #1970645

    Signed-off-by: Pedro Almeida <email address hidden>
    Co-authored-by: Rafael Falcao <email address hidden>
    Change-Id: Idcb61c976820574d9ac771cd3bbc1f91f8651f54

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-26: Fix merged to helm-charts (r/stx.7.0)

#26

Reviewed: https://review.opendev.org/c/starlingx/helm-charts/+/854780
Committed: https://opendev.org/starlingx/helm-charts/commit/58c69559f56459628db386b9846583580668290a
Submitter: "Zuul (22348)"
Branch: r/stx.7.0

commit 58c69559f56459628db386b9846583580668290a
Author: Rafael Falcão <email address hidden>
Date: Mon Aug 22 12:08:43 2022 -0300

Add resources specification to fm-rest-api

    Some pods are being evicted due to some containers
    exceeding the usage of a specific resource. The goal
    of this change is to be able to specify values of
    limits and requests for fm-rest-api pods. In this
    way we can make sure that the system has the necessary
    resources to support the api.

Test Plan:

    PASS: Check that with the 'enabled' flag set to 'false' no
          values of requests and limits are specified.
    PASS: Check that with the 'enabled' flag set to 'true' the
          default values of requests and limits takes place.
    PASS: Override the default value of requests/limits and set the
          'enabled' flag to 'true' and check that the new value
          takes place in the description of the pod.

Partial-bug: 1970645

    Signed-off-by: Rafael Falcao <email address hidden>
    Change-Id: I8a247c09643303f80a61b989d4b82c3835b7e601
    (cherry picked from commit b34b86880113b69840ad1a12f23d0dde62b52373)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-26: Fix proposed to openstack-armada-app (r/stx.7.0)

#27

Fix proposed to branch: r/stx.7.0
Review: https://review.opendev.org/c/starlingx/openstack-armada-app/+/854836

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-29: Fix merged to openstack-armada-app (r/stx.7.0)

#28

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/854836
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/224b44c3cc4b3dead3a57080cc56b1f2519db863
Submitter: "Zuul (22348)"
Branch: r/stx.7.0

commit 224b44c3cc4b3dead3a57080cc56b1f2519db863
Author: Pedro Almeida <email address hidden>
Date: Thu Aug 18 15:54:31 2022 -0300

Adding ephemeral-storage request on pods

    Some pods are being evicted due to some containers exceeding the
    usage of ephemeral-storage: since there was no value on the manifest,
    it was set to 0, which means that basically any value would cause the
    pod to fail.

    Test Plan:
    PASS - Build stx-openstack tarball.
    PASS - Upload/apply stx-openstack and check that new values of
           ephemeral-storage took place.
    PASS - Remove/delete stx-openstack.

Depends-On: https://review.opendev.org/c/starlingx/helm-charts/+/854006

Closes-Bug: #1970645

    Signed-off-by: Pedro Almeida <email address hidden>
    Co-authored-by: Rafael Falcao <email address hidden>
    Change-Id: Idcb61c976820574d9ac771cd3bbc1f91f8651f54
    (cherry picked from commit e2d9f126ef81eafa3b8ee00fb62b52898aed2b5b)

Ghada Khalil (gkhalil) on 2022-08-29

tags:

added: in-r-stx70

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-09-29: Fix merged to openstack-armada-app (master)

#29

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/859467
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/9fed9555eddbbb89bdbbc76b5f570a9245a56308
Submitter: "Zuul (22348)"
Branch: master

commit 9fed9555eddbbb89bdbbc76b5f570a9245a56308
Author: Rafael Falcão <email address hidden>
Date: Tue Sep 27 14:48:20 2022 -0300

Adding ephemeral-storage request on pods (FluxCD)

    Some pods are being evicted due to some containers exceeding the
    usage of ephemeral-storage. This modifications has been applied by
    [1] in the armada application. This review aims to bring this
    update to the FluxCD structure.

    Test Plan:
    PASS - Build stx-openstack tarball.
    PASS - Upload/apply stx-openstack and check that new values of
           ephemeral-storage took place.
    PASS - Remove/delete stx-openstack.

[1] - https://review.opendev.org/c/starlingx/openstack-armada-app/+/853693

Partial-bug: 1970645

Signed-off-by: Rafael Falcao <email address hidden>
Change-Id: I56c106508091bdaaf63f441972c67970cb8835cc

StarlingX

Stx-openstack apply timeout because some pods are not ready

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches