Stx-openstack apply timeout because some pods are not ready

Bug #1970645 reported by Alexandru Dimofte
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Pedro Monteiro Azevedo de Moura Almeida

Bug Description

Brief Description
-----------------
Stx-openstack apply timeout(600s), sometimes it fails because some pods are not ready

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Install stx master 20220427T013757Z and then try to lock/unlock the computes or simply try to reapply the stx-openstack

Expected Behavior
------------------
stx-openstack should apply fine

Actual Behavior
----------------
stx-openstack apply exit with timeout or even fails sometimes. It stuck at 55.0%.

| stx-openstack | 1.0-192-centos-stable- | openstack-manifest | stx-openstack.yaml | applying | processing chart: osh-openstack-placemen
| | versioned | | | | overall completion: 55.0%

...
E utils.exceptions.ContainerError: Container error.
E Details: ['stx-openstack'] did not reach status applied within 600s

from /var/log/armada/stx-openstack-apply_2022-04-27-15-19-06.log:
...
2022-04-27 15:49:10.199 177 ERROR armada.handlers.wait [-] [chart=openstack-mariadb]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-mariadb)). These pods were not ready=['mariadb-ingress-bcd8fb475-kqcwl', 'mariadb-ingress-bcd8fb475-n6m7x']
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada [-] Chart deploy [openstack-mariadb] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-mariadb)). These pods were not ready=['mariadb-ingress-bcd8fb475-kqcwl', 'mariadb-ingress-bcd8fb475-n6m7x']
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada Traceback (most recent call last):
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 170, in handle_result
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada result = get_result()
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 181, in <lambda>
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart, 1))):
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 159, in deploy_chart
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada concurrency)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 55, in execute
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada ch, cg_test_all_charts, prefix, known_releases)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 267, in _execute
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada chart_wait.wait(timer)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 142, in wait
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 302, in wait
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada modified = self._wait(deadline)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/decorator.py", line 232, in fun
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada return caller(func, *(extras + args), **kw)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/retry/api.py", line 74, in retry_decorator
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada logger)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/retry/api.py", line 33, in __retry_internal
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada return f()
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 372, in _wait
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada raise k8s_exceptions.KubernetesWatchTimeoutException(error)
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-mariadb)). These pods were not ready=['mariadb-ingress-bcd8fb475-kqcwl', 'mariadb-ingress-bcd8fb475-n6m7x']

...

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
all configurations are affected

Branch/Pull Time/Commit
-----------------------
master - 20220427T013757Z

Last Pass
---------
master - 20220420T033744Z

Timestamp/Logs
--------------
Will be attached

Test Activity
-------------
Sanity

Workaround
-------------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (5.7 KiB)

I found this pods in Evicted state:
openstack cinder-volume-bb64cb587-lvqxh ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack fm-rest-api-649cd5b99d-8f9v6 ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack ingress-6b65fb7fc9-g6cf5 ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack keystone-api-66b87d555b-xgxlm ● 0/0 0 Evicted n/a controller-1 172m │
│ openstack mariadb-ingress-bcd8fb475-r8tj7 ● 0/0 0 Evicted n/a controller-1 3h36m │
│ openstack mariadb-ingress-bcd8fb475-vpszs ● 0/0 0 Evicted n/a controller-0 3h36m │
│ openstack nova-api-proxy-78c6447cb-69txr ● 0/0 0 Evicted n/a controller-1 176m │
│ openstack nova-scheduler-7b6fd68499-5hwwh ● 0/0 0 Evicted n/a controller-1 176m │
│ openstack placement-api-68668d8dd5-mttwl ● 0/0 0 Evicted n/a controller-1 176m │

Describing the pods I found this kind of messages:
 Message: The node was low on resource: ephemeral-storage. Container ingress was using 112Ki, which exceeds its request of 0.
ex:
 Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 3h41m default-scheduler Successfully assigned op...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.7.0 / critical - issue resulting in red sanity.
Assigning to Douglas to assign to the WR openstack team for investigation.

tags: added: stx.7.0 stx.distro.openstack
Changed in starlingx:
importance: Undecided → Critical
status: New → Triaged
assignee: nobody → Douglas Lopes Pereira (douglaspereira)
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Douglas Lopes Pereira (douglaspereira) → Thales Elero Cervi (tcervi)
Revision history for this message
Thales Elero Cervi (tcervi) wrote (last edit ):

From the attached logs I can see more than one stx-openstack reapply failing, one fails waiting the mariadb-ingress pod and the other fails waiting the openstack-ingress:

2022-04-27 15:49:10.199 177 ERROR armada.handlers.wait [-] [chart=openstack-mariadb]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-mariadb)). These pods were not ready=['mariadb-ingress-bcd8fb475-kqcwl', 'mariadb-ingress-bcd8fb475-n6m7x']
2022-04-27 15:49:10.199 177 ERROR armada.handlers.armada [-] Chart deploy [openstack-mariadb] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-mariadb)). These pods were not ready=['mariadb-ingress-bcd8fb475-kqcwl', 'mariadb-ingress-bcd8fb475-n6m7x']

2022-04-27 16:22:18.694 232 ERROR armada.handlers.wait [-] [chart=openstack-ingress]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-ingress)). These pods were not ready=['ingress-6b65fb7fc9-pls8m', 'ingress-6b65fb7fc9-qsgfv']
2022-04-27 16:22:18.694 232 ERROR armada.handlers.armada [-] Chart deploy [openstack-ingress] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-ingress)). These pods were not ready=['ingress-6b65fb7fc9-pls8m', 'ingress-6b65fb7fc9-qsgfv']

Since those pods were later deleted I won't find any further information on the collected logs. Will be trying to simulate this issue from my side to figure out what's happening.

Revision history for this message
Thales Elero Cervi (tcervi) wrote (last edit ):

I finally managed to reproduce a similar issue using stx and stx-openstack master builds.In my case the reapply is breaking because of missed overrides, something I could not find inside the attached logs though.

The critical point is: all the cg-openstack-*.yaml files inside /opt/platform/armada/22.06/stx-openstack/1.0-192-centos-stable-versioned seems to be disappearing on the first stx-os reapply after a controller node reboot, during the "generating application overrides" step. It happens because the first stx-os reapply after a controller node reboot is cleaning the armada-overrides.yaml content:

$ cat /opt/platform/helm/22.06/stx-openstack/1.0-192-centos-stable-versioned/armada-overrides.yaml
[]

Definitely something to look into, this is probably the root cause of the lock/unlock test failures. I need to understand what's making those files to vanish and what recent change, to platform or application, introduced this problematic behavior.

Revision history for this message
Thales Elero Cervi (tcervi) wrote (last edit ):

This issue seems to have the same root cause of this other bug: https://bugs.launchpad.net/starlingx/+bug/1972019

Apparently the fix released for that bug also fixed the stx-openstack problem, since I could not reproduce the reapply failure after lock/unlock using the master build (05 12 2022)

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (4.9 KiB)

I cannot check yet if this issue is fixed or not on bare-metal servers because that servers are affected by:
https://bugs.launchpad.net/starlingx/+bug/1973888
But I checked it on the Virtual configurations and I observed that stx-openstack apply is still failing.
I checked the pods and all of them are fine.

[sysadmin@controller-1 log(keystone_admin)]$ system application-list
+--------------------------+-------------------------------+-------------------------------+--------------------+--------------+------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+-------------------------------+-------------------------------+--------------------+--------------+------------------------------------------------------+
| cert-manager | 1.0-33 | fluxcd-manifests | fluxcd-manifests | applied | completed |
| nginx-ingress-controller | 1.1-26 | fluxcd-manifests | fluxcd-manifests | applied | completed |
| oidc-auth-apps | 1.0-64 | oidc-auth-manifest | manifest.yaml | uploaded | completed |
| platform-integ-apps | 1.0-46 | platform-integration-manifest | manifest.yaml | applied | completed |
| rook-ceph-apps | 1.0-16 | rook-ceph-manifest | manifest.yaml | uploaded | completed |
| stx-openstack | 1.0-199-centos-stable- | openstack-manifest | stx-openstack.yaml | apply-failed | Unexpected process termination while application- |
| | versioned | | | | apply was in progress. The application status has |
| | | | | | changed from 'applying' to 'apply-failed'. |
| | | | | | |
+--------------------------+-------------------------------+-------------------------------+--------------------+--------------+------------------------------------------------------+

[sysadmin@controller-1 log(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------...

Read more...

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Hi Alexandru, thanks for checking it on a virtual environment. I also used a virtual AIO-SX to test the reapply after lock/unlock and was not able to reproduce the apply failure using the master build (20220512T035413Z).

Was this failure 100% reproducible for you? I ask because I can see two alarms related to high resource usage on your system on the same time the apply failed:

* There was an instance trying to reboot on your active controller:
| 700.005 | Instance admin-vm-1 owned by admin is rebooting on host controller-0 |
* There was a peak of memory usage:
| 100.103 | Memory threshold exceeded ; threshold 90.00%, actual 111.50% | host=controller-1.memory=platform |

So I wonder if that was not a problem related to physical resources constraints during the reapply time. From your logs I can see that the reapply timed out waiting for the "nova-api-proxy-cdffff877-fc6fb" pod around 2022-05-18 09:02:23

2022-05-18 09:02:23.142 454 ERROR armada.handlers.wait [-] [chart=openstack-nova-api-proxy]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-cdffff877-fc6fb']
2022-05-18 09:02:23.143 454 ERROR armada.handlers.armada [-] Chart deploy [openstack-nova-api-proxy] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-cdffff877-fc6fb']
2022-05-18 09:02:23.143 454 ERROR armada.handlers.armada Traceback (most recent call last)

But I can also see that later this pod was able to come up, probably ~5min after the reapply timeout. From containerization_kube.info:

Wed May 18 09:37:28 UTC 2022 : : kubectl describe nodes
...
  Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
  --------- ---- ------------ ---------- --------------- ------------- ---
openstack nova-api-proxy-cdffff877-6lkxm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 32m

09:37:28 - 32m ~= 09:05:28

Will try to reproduce it again on my environment later today, just wanted point out the resource constraint on your virtual environment that could be causing your apply failure.

Anyways, thanks for updating your test results here!

Revision history for this message
Alexandru Dimofte (adimofte) wrote :

We are using the same virtual configurations for daily sanity since few years ago.
Hmm.. I think the issue is not related to the resource constraint.
The same issue can be seen on all Virtual configurations: Simplex, Duplex, Standard and Standard External.

Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Hi Alexandru, thanks for providing the information here.
I was testing the master build 20220518 on an AIO-SX virtual environment and unfortunately could not reproduce the issue anymore :(

Tried several lock/unlock operations followed by stx-openstack application manual applies. During my tests all the apply operations worked and the only issue I faced was the one I reported here: https://bugs.launchpad.net/starlingx/+bug/1974221
That's just a corner case error message issue though.

Not sure if we should wait for a sanity run after https://bugs.launchpad.net/starlingx/+bug/1973888 is fixed to re-check if the stx-openstack is still reproducible.
Or maybe you could please provide me more information about your test scenario on the virtual environment. I saw you were using a DX and had at least one instance running that was probably migrated during the lock/unlock (it was rebooting). My virtual environment is shorter in processor resources so I did not bring any instance up before locking/unlocking...

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Download full text (10.3 KiB)

Some more information from: /var/log/armada/stx-openstack-apply_2022-05-22-08-45-38.log

...
2022-05-22 09:13:43.700 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:14:43.774 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:15:43.841 138 DEBUG armada.handlers.lock [-] Updating lock update_lock /usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py:176
2022-05-22 09:15:44.582 138 ERROR armada.handlers.wait [-] [chart=openstack-nova-api-proxy]: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-667468b59d-vzzsb']
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada [-] Chart deploy [openstack-nova-api-proxy] failed: armada.exceptions.k8s_exceptions.KubernetesWatchTimeoutException: Timed out waiting for pods (namespace=openstack, labels=(release_group=osh-openstack-nova-api-proxy)). These pods were not ready=['nova-api-proxy-667468b59d-vzzsb']
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada Traceback (most recent call last):
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 170, in handle_result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada result = get_result()
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada return self.__get_result()
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada raise self._exception
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada result = self.fn(*self.args, **self.kwargs)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 159, in deploy_chart
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada concurrency)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 55, in execute
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada ch, cg_test_all_charts, prefix, known_releases)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 267, in _execute
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada chart_wait.wait(timer)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/wait.py", line 142, in wait
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada wait.wait(timeout=timeout)
2022-05-22 09:15:44.582 138 ERROR armada.handlers.armada ...

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Thanks Alexandru.

Where can I find the test logs as well? I mean, the actual automation scripts logs, to follow the test steps and understand the actions and expected outcomes...
That would be very helpful for me to go through your recently added logs.

Cheers

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Thales Elero Cervi (tcervi) wrote :
Download full text (4.3 KiB)

Hi, thanks for sharing those logs.
They were indeed a great help to understand better the full picture. So, apparently the majority of the Tests passed during that execution, but some "key tests" did not. In fact, the first test to fail was "test_lock_unlock_standby_controller" that tried to lock and unlock controller-1 but stx-os could not reach the applied status before the timeout:

[2022-05-22 07:24:44,669] 290 WARNING MainThread container_helper.wait_for_apps_status:: ['stx-openstack'] did not reach status applied within 360s

That reapply did not fail inside the timeout though, its last status update that I can see is from [2022-05-22 07:24:34,605] and it was:
 applying | processing chart: osh-openstack-nginx-ports-control, overall completion: 10.0%

And around that time controller-1 was not stable:
sysinv 2022-05-22 07:22:47.882 835363 INFO sysinv.conductor.manager [-] Node(s) are in an unstable state. Defer audit.
sysinv 2022-05-22 07:23:29.696 835363 INFO sysinv.conductor.manager [-] Updating platform data for host: 4ddfe5d8-92ed-4df0-9912-2136257f3a81 with: {u'first_report': True}

What happened right before the reapply error that I could find on armada log:
2022-05-22 07:23:35.853 344 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock

Not sure what caused that unstable status on controller-1, but the logs have a few leads on it:

[2022-05-22 07:18:31,993] 4808 WARNING MainThread host_helper.wait_for_tasks_affined:: /etc/platform/.task_affining_incomplete did not clear on controller-1

2022-05-22T07:22:02.753737 | log | 200.022 | controller-1 is now 'offline' host=controller-1.status=offline
2022-05-22T07:22:02.752373 | set | 400.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam
2022-05-22T07:22:02.751492 | log | 200.022 | controller-1 is now 'disabled' host=controller-1.state=disabled
2022-05-22T07:22:02.749402 | set | 200.004 | controller-1 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful host=controller-1
2022-05-22T07:22:02.748747 | log | 401.005 | Communication failure detected with peer over port eno1 on host controller-0 | host=controller-0.network=oam
2022-05-22T07:22:02.748030 | set | 400.005 | Communication failure detected with peer over port eno2 on host controller-0 | host=controller-0.network=cluster
2022-05-22T07:22:02.684878 | set | 200.005 | controller-1 experienced a persistent critical 'Management Network' communication failure.| host=controller-1.network=Management

Around the same time I could also see 3...

Read more...

Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Pedro Almeida will be trying to reproduce the issue on a physical lab and if he is able to do it we will proceed testing a possible solution for the evicted pods issue.

Changed in starlingx:
assignee: Thales Elero Cervi (tcervi) → nobody
Ghada Khalil (gkhalil)
tags: added: stx.cherrypickneeded
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Thales Elero Cervi (tcervi)
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Since Pedro was also not able to reproduce this issue on a physical lab, during today's community call we decided to mark the bug as "can not reproduce" and wait for our next stx with stx-openstack sanity result to check it again.

Changed in starlingx:
status: Triaged → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Changed in starlingx:
status: Invalid → In Progress
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

We were able to reproduce the issue internally and Pedro will be applying and testing thee suggested fix.
Thanks Pedro

Changed in starlingx:
assignee: Thales Elero Cervi (tcervi) → nobody
Changed in starlingx:
assignee: nobody → Pedro Monteiro Azevedo de Moura Almeida (pmonteir)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to helm-charts (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to helm-charts (master)

Reviewed: https://review.opendev.org/c/starlingx/helm-charts/+/854006
Committed: https://opendev.org/starlingx/helm-charts/commit/b34b86880113b69840ad1a12f23d0dde62b52373
Submitter: "Zuul (22348)"
Branch: master

commit b34b86880113b69840ad1a12f23d0dde62b52373
Author: Rafael Falcão <email address hidden>
Date: Mon Aug 22 12:08:43 2022 -0300

    Add resources specification to fm-rest-api

    Some pods are being evicted due to some containers
    exceeding the usage of a specific resource. The goal
    of this change is to be able to specify values of
    limits and requests for fm-rest-api pods. In this
    way we can make sure that the system has the necessary
    resources to support the api.

    Test Plan:

    PASS: Check that with the 'enabled' flag set to 'false' no
          values of requests and limits are specified.
    PASS: Check that with the 'enabled' flag set to 'true' the
          default values of requests and limits takes place.
    PASS: Override the default value of requests/limits and set the
          'enabled' flag to 'true' and check that the new value
          takes place in the description of the pod.

    Partial-bug: 1970645

    Signed-off-by: Rafael Falcao <email address hidden>
    Change-Id: I8a247c09643303f80a61b989d4b82c3835b7e601

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to helm-charts (r/stx.7.0)

Fix proposed to branch: r/stx.7.0
Review: https://review.opendev.org/c/starlingx/helm-charts/+/854780

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/853693
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/e2d9f126ef81eafa3b8ee00fb62b52898aed2b5b
Submitter: "Zuul (22348)"
Branch: master

commit e2d9f126ef81eafa3b8ee00fb62b52898aed2b5b
Author: Pedro Almeida <email address hidden>
Date: Thu Aug 18 15:54:31 2022 -0300

    Adding ephemeral-storage request on pods

    Some pods are being evicted due to some containers exceeding the
    usage of ephemeral-storage: since there was no value on the manifest,
    it was set to 0, which means that basically any value would cause the
    pod to fail.

    Test Plan:
    PASS - Build stx-openstack tarball.
    PASS - Upload/apply stx-openstack and check that new values of
           ephemeral-storage took place.
    PASS - Remove/delete stx-openstack.

    Depends-On: https://review.opendev.org/c/starlingx/helm-charts/+/854006

    Closes-Bug: #1970645

    Signed-off-by: Pedro Almeida <email address hidden>
    Co-authored-by: Rafael Falcao <email address hidden>
    Change-Id: Idcb61c976820574d9ac771cd3bbc1f91f8651f54

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to helm-charts (r/stx.7.0)

Reviewed: https://review.opendev.org/c/starlingx/helm-charts/+/854780
Committed: https://opendev.org/starlingx/helm-charts/commit/58c69559f56459628db386b9846583580668290a
Submitter: "Zuul (22348)"
Branch: r/stx.7.0

commit 58c69559f56459628db386b9846583580668290a
Author: Rafael Falcão <email address hidden>
Date: Mon Aug 22 12:08:43 2022 -0300

    Add resources specification to fm-rest-api

    Some pods are being evicted due to some containers
    exceeding the usage of a specific resource. The goal
    of this change is to be able to specify values of
    limits and requests for fm-rest-api pods. In this
    way we can make sure that the system has the necessary
    resources to support the api.

    Test Plan:

    PASS: Check that with the 'enabled' flag set to 'false' no
          values of requests and limits are specified.
    PASS: Check that with the 'enabled' flag set to 'true' the
          default values of requests and limits takes place.
    PASS: Override the default value of requests/limits and set the
          'enabled' flag to 'true' and check that the new value
          takes place in the description of the pod.

    Partial-bug: 1970645

    Signed-off-by: Rafael Falcao <email address hidden>
    Change-Id: I8a247c09643303f80a61b989d4b82c3835b7e601
    (cherry picked from commit b34b86880113b69840ad1a12f23d0dde62b52373)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (r/stx.7.0)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (r/stx.7.0)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/854836
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/224b44c3cc4b3dead3a57080cc56b1f2519db863
Submitter: "Zuul (22348)"
Branch: r/stx.7.0

commit 224b44c3cc4b3dead3a57080cc56b1f2519db863
Author: Pedro Almeida <email address hidden>
Date: Thu Aug 18 15:54:31 2022 -0300

    Adding ephemeral-storage request on pods

    Some pods are being evicted due to some containers exceeding the
    usage of ephemeral-storage: since there was no value on the manifest,
    it was set to 0, which means that basically any value would cause the
    pod to fail.

    Test Plan:
    PASS - Build stx-openstack tarball.
    PASS - Upload/apply stx-openstack and check that new values of
           ephemeral-storage took place.
    PASS - Remove/delete stx-openstack.

    Depends-On: https://review.opendev.org/c/starlingx/helm-charts/+/854006

    Closes-Bug: #1970645

    Signed-off-by: Pedro Almeida <email address hidden>
    Co-authored-by: Rafael Falcao <email address hidden>
    Change-Id: Idcb61c976820574d9ac771cd3bbc1f91f8651f54
    (cherry picked from commit e2d9f126ef81eafa3b8ee00fb62b52898aed2b5b)

Ghada Khalil (gkhalil)
tags: added: in-r-stx70
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/859467
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/9fed9555eddbbb89bdbbc76b5f570a9245a56308
Submitter: "Zuul (22348)"
Branch: master

commit 9fed9555eddbbb89bdbbc76b5f570a9245a56308
Author: Rafael Falcão <email address hidden>
Date: Tue Sep 27 14:48:20 2022 -0300

    Adding ephemeral-storage request on pods (FluxCD)

    Some pods are being evicted due to some containers exceeding the
    usage of ephemeral-storage. This modifications has been applied by
    [1] in the armada application. This review aims to bring this
    update to the FluxCD structure.

    Test Plan:
    PASS - Build stx-openstack tarball.
    PASS - Upload/apply stx-openstack and check that new values of
           ephemeral-storage took place.
    PASS - Remove/delete stx-openstack.

    [1] - https://review.opendev.org/c/starlingx/openstack-armada-app/+/853693

    Partial-bug: 1970645

    Signed-off-by: Rafael Falcao <email address hidden>
    Change-Id: I56c106508091bdaaf63f441972c67970cb8835cc

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.