STX-O Master | memory threshold exceeded when app is applied

Bug #2052539 reported by Gabriel Calixto de Paula
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Thales Elero Cervi

Bug Description

Brief Description
-----------------
After applying the app STX-Openstack, the '100.103 memory threshold exceeded' alarm is displayed

Severity
--------
Major

Steps to Reproduce
------------------
-apply STX-Openstack app in a stx master deployment

Expected Behavior
------------------
app is applied without alarms

Actual Behavior
----------------
app is applied and 100.103 alarm is displayed

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DX

Branch/Pull Time/Commit
-----------------------
STX master 20240204T070002Z
STX Openstack 2024-01-31

Last Pass
---------
Nov-28 sanity report, the alarm wasn't reproduced there

Timestamp/Logs
--------------
alarms:
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+-------------------------------------------------------------+----------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------+----------------------+----------+----------------+
| 100.103 | Memory threshold exceeded ; threshold 90.00%, actual 94.77% | host=controller-0. | critical | 2024-02-06T17: |
| | | memory=platform | | 10:28.506106 |
| | | | | |
+----------+-------------------------------------------------------------+----------------------+----------+----------------+

Test Activity
-------------
Sanity

Workaround
----------
N/A

tags: added: stx.distro.openstack
Ghada Khalil (gkhalil)
tags: added: stx.9.0
Changed in starlingx:
importance: Undecided → High
Changed in starlingx:
assignee: nobody → Thales Elero Cervi (tcervi)
importance: High → Medium
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

I used kube-memory to trace "openstack" namespace pods resources usage and got the following:

+-----------+--------------------------+
Namespace | Resident Set Size (MiB) |
openstack | 5431.63 |
+-----------+--------------------------+

The top pods on memory usage are:

keystone-api : ~560 (MiB)
maria-db-server : ~367 (MiB)
neutron-dhcp-agent : ~482 (MiB)
neutron-l3-agent : ~398 (MiB)
neutron-server : ~553 (MiB)

So I tried an old app tarball that I knew was not throwing such alarm (app built on 20231218T170059Z) and to my surprise the memory usage is pretty much the same (even a bit higher) and it DOES NOT triggers the memory threshold alarm:

+-----------+--------------------------+
Namespace | Resident Set Size (MiB) |
openstack | 5509.31 |
+-----------+--------------------------+

The top pods on memory usage are:

keystone-api : ~545 (MiB)
maria-db-server : ~384 (MiB)
neutron-dhcp-agent : ~476 (MiB)
neutron-l3-agent : ~397 (MiB)
neutron-server : ~554 (MiB)

Since the StarlingX ISO is the same on both tests (20240206T070059Z build), I will review the latest commits to stx/openstack-armada-app but I am unsure that something on the app have changed in order to cause such issue.

One point to note is that the majority of our docker images are currently pointing to "stable/2023.1" branches of OpenStack repos, and those are being patched with fixes cherry-pick eventually. One new (broken) image could be causing a memory usage peak... but since I do not see differences between the "latest" and the "20231218T170059Z" apps on what regards memory usage this is probably not our case here.

Revision history for this message
Thales Elero Cervi (tcervi) wrote :

The only 3 changes made to the application since Dec 18th are the following:

e3b2509 Add platform label to pods in upstream repositories
a6401e8 Add platform label to pods in stx-openstack
abb61c3 Update app Zuul Check Jobs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Changed in starlingx:
status: New → In Progress
Changed in starlingx:
importance: Medium → High
Revision history for this message
Thales Elero Cervi (tcervi) wrote :

Just for the sake of documentation: I double checked the stx-* images built from OpenStack "stable/2023.1" branches and as of today five images received cherry-picks since 2023 Dec 18th. I fixed the following images to commit SHAs preceding 2023 Dec 18th as follows:

* stx-cinder: e215be5fc6bcd4cdbc32883d4215a289986e4b01
* stx-horizon: 5ca2d4082900dc51ae85226d903cb0a397f3e225
* stx-keystone: ea8c8aa982235240cf58bd561132aefe33439506
* stx-neutron: 877e85e5a8b8668fc4baf7ee84b71225dadfa647
* stx-nova: 698421064b4604087634f8ea219795dad0b4928c

I tested stx-openstack using the images built from commit SHAs as as peer my previous comment this did not affected the alarm raised, since the memory usage is basically the same.

The problem seems to be related with the introduction of the label "app.starlingx.io/component" to stx-openstack helm-charts, since it was done setting "isApplication: false" by default and is probably misleading the platform metrics on the usage of platform memory.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/908672
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/4cd7a5d544c2b70acf66923834ab2da05efc9465
Submitter: "Zuul (22348)"
Branch: master

commit 4cd7a5d544c2b70acf66923834ab2da05efc9465
Author: Thales Elero Cervi <email address hidden>
Date: Fri Feb 9 15:04:32 2024 -0300

    Update "isApplication" label default value to true

    When the app.starlingx.io/component label was added to pods it was
    decided that it would be controlled by a values.yaml key
    labels:isApplication and its default value was set to "false" [1]
    and [2]. This should be changed to "true" since all pods created by
    the stx-openstack charts are related to the stx-openstack app.

    Additionally, according to LP #2052539 this new label was mixing memory
    usage metrics and wrongly triggering platform alarms for "Memory
    threshold exceeded".

    [1] https://review.opendev.org/c/starlingx/openstack-armada-app/+/903918
    [2] https://review.opendev.org/c/starlingx/openstack-armada-app/+/904128

    TEST PLAN:
    PASS - Build openstack-helm package
    PASS - Build openstack-helm-infra package
    PASS - Build stx-openstack application tarball
    PASS - Upload/Apply stx-openstack application (AIO-DX)
    PASS - No alarms triggered after application is applied

    Closes-Bug: 2052539

    Change-Id: I6e3e99b48c65d6c1730eca2f85b9d700dd6b1ef9
    Signed-off-by: Thales Elero Cervi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.