StarlingX

Debug info missing for memory alarm issues

Bug #1973815 reported by Cesar Bombonate on 2022-05-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Unassigned

Bug Description

Brief Description
-----------------
Add instrumentation or log capture so that a memory alarm could be analyzed and root cause identified based on collect data captured from a field site.

Severity
--------
Minor

Steps to Reproduce
------------------
Allow a Memory alarm to trigger.
Once the Memory alarm subsides attempt to determine the cause after the fact
Current tools such as collect do not show any indication of the process at fault.

Expected Behavior
------------------
There should be someway to identify the responsible of a memory alarm after the fact.

Actual Behavior
----------------
Currently there is now way to identify the responsible for a memory alarm after the alarm has cleared and memory usage has normalized.

Reproducibility
---------------
reproducible

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
N/A

Last Pass
---------
N/A

Timestamp/Logs
--------------
N/A

Test Activity
-------------
N/A

Workaround
----------
N/A

Tags:

Cesar Bombonate (cpompeud) on 2022-05-17

summary:

- Implement enhancement to help triage memory alarms
+ Debug info missing for memory alarm issues

OpenStack Infra (hudson-openstack) on 2022-05-17

Changed in starlingx:
status:	New → In Progress

Ghada Khalil (gkhalil) on 2022-05-20

Changed in starlingx:
importance:	Undecided → Low
tags:	added: stx.fault

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-09-13: Fix merged to monitoring (master)

Download full text (4.1 KiB)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/840548
Committed: https://opendev.org/starlingx/monitoring/commit/2d6ba853b75b2fc32b698c2bc468a8254ebfe7a9
Submitter: "Zuul (22348)"
Branch: master

commit 2d6ba853b75b2fc32b698c2bc468a8254ebfe7a9
Author: Cesar Bombonate <email address hidden>
Date: Wed May 4 15:48:58 2022 -0400

Implement enhancement to help triage memory alarms

    Implement the instrumentation/logging of the
    top 10 memory intensive processes.
    The goal is to have additional info in the collect
    logs to help determine why a memory alarm is raised.

    In this change includes 3 new logs lines to the collectd log file.
    These lines show every 5 min, logging the top 10 memory rss intensive
    processes.

To do this we are walking the the cgroups folders and capturing the pids.
We then query each pid to locate the rss memory consumption.

    The first is shows top 10 memory rss processes for the platform
     showing the top 10 of each pid under cgroups,
    This excludes the cgroups/ks8-infra folder, so it will only show process
     considered to be platform related.

    The second log line contains the top 10 pids under the cgroups/ks8-infra
     folder. This will contain all of the pods and k8s related pids.
    We are validating the namespace of each pod as to exclude k8s-addon
     related pids, so it will only show process for k8s-system.

The third line we are querying all of the pids system wide and
surface the top 10 memory intensive process.

Seeing duplicates of process names is expected as we are looking at each
individual pid rather than the process name itself.

    For Reference.
    K8S_NAMESPACE_SYSTEM = [
    'kube-system', 'armada', 'cert-manager', 'portieris',
    'vault', 'notification', 'platform-deployment-manager',
    'flux-helm', 'metrics-server']

K8S_NAMESPACE_ADDON = ['monitor', 'openstack']

Test Plan:

    PASS: built image successfully.
    PASS: Installed image successfully.
    PASS: bootstrap and unlock successful
    PASS: build updated latest image.
    PASS: Check collectd service is loaded and active.
    PASS: Verify collectd CPU and Memory Consumption Growth
    PASS: Tested On Debian using the latest master Branch.

    Example Output:
    2022-06-27T17:38:50.953 controller-0 collectd[2607872]: info The top 10 memory rss processes for the platform are :
    [('ceph-mon', '950.58 MiB'), ('ceph-osd', '317.41 MiB'),
     ('sysinv-conducto', '215.15 MiB'), ('ceph-mgr', '198.13 MiB'),
     ('sysinv-api', '182.20 MiB'), ('kubelet', '158.93 MiB'),
     ('etcd', '154.39 MiB'), ('python', '131.34 MiB'),
     ('python', '131.33 MiB'), ('sysinv-api', '113.25 MiB')]
    2022-06-27T17:38:50.954 controller-0 collectd[2607872]: info The top 10 memory rss processes for the Kubernetes System are :
    [('java', '4.52 GiB'), ('java', '1.32 GiB'),
     ('java', '829.71 MiB'), ('java', '796.45 MiB'),
     ('kube-apiserver', '613.39 MiB'), ('java', '564.62 MiB'),
     ('node', '544.01 MiB'), ('helm-controller', '352.71 MiB'),
     ('metricbeat', '187.61 MiB'), ('me...

Reviewed:  https://review.opendev.org/c/starlingx/monitoring/+/840548
Committed: https://opendev.org/starlingx/monitoring/commit/2d6ba853b75b2fc32b698c2bc468a8254ebfe7a9
Submitter: "Zuul (22348)"
Branch:    master

commit 2d6ba853b75b2fc32b698c2bc468a8254ebfe7a9
Author: Cesar Bombonate <cesar.pompeudebarrosbombonate@windriver.com>
Date:   Wed May 4 15:48:58 2022 -0400

Implement enhancement to help triage memory alarms
    
    Implement the instrumentation/logging of the
    top 10 memory intensive processes.
    The goal is to have additional info in the collect
    logs to help determine why a memory alarm is raised.
    
    In this change includes 3 new logs lines to the collectd log file.
    These lines show every 5 min, logging the top 10 memory rss intensive
    processes.
    
    To do this we are walking the the cgroups folders and capturing the pids.
    We then query each pid to locate the rss memory consumption.
    
    The first is shows top 10 memory rss processes for the platform
     showing the top 10 of each pid under cgroups,
    This excludes the cgroups/ks8-infra folder, so it will only show process
     considered to be platform related.
    
    The second log line contains the top 10 pids under the cgroups/ks8-infra
     folder. This will contain all of the pods and k8s related pids.
    We are validating the namespace of each pod as to exclude k8s-addon
     related pids, so it will only show process for k8s-system.
    
    The third line we are querying all of the pids system wide and
     surface the top 10 memory intensive process.
    
    Seeing duplicates of process names is expected as we are looking at each
     individual pid rather than the process name itself.
    
    For Reference.
    K8S_NAMESPACE_SYSTEM = [
    'kube-system', 'armada', 'cert-manager', 'portieris',
    'vault', 'notification', 'platform-deployment-manager',
    'flux-helm', 'metrics-server']
    
    K8S_NAMESPACE_ADDON = ['monitor', 'openstack']
    
    Test Plan:
    
    PASS: built image successfully.
    PASS: Installed image successfully.
    PASS: bootstrap and unlock successful
    PASS: build updated latest image.
    PASS: Check collectd service is loaded and active.
    PASS: Verify collectd CPU and Memory Consumption Growth
    PASS: Tested On Debian using the latest master Branch.
    
    Example Output:
    2022-06-27T17:38:50.953 controller-0 collectd[2607872]: info The top 10 memory rss processes for the platform are :
    [('ceph-mon', '950.58 MiB'), ('ceph-osd', '317.41 MiB'),
     ('sysinv-conducto', '215.15 MiB'), ('ceph-mgr', '198.13 MiB'),
     ('sysinv-api', '182.20 MiB'), ('kubelet', '158.93 MiB'),
     ('etcd', '154.39 MiB'), ('python', '131.34 MiB'),
     ('python', '131.33 MiB'), ('sysinv-api', '113.25 MiB')]
    2022-06-27T17:38:50.954 controller-0 collectd[2607872]: info The top 10 memory rss processes for the Kubernetes System are :
    [('java', '4.52 GiB'), ('java', '1.32 GiB'),
     ('java', '829.71 MiB'), ('java', '796.45 MiB'),
     ('kube-apiserver', '613.39 MiB'), ('java', '564.62 MiB'),
     ('node', '544.01 MiB'), ('helm-controller', '352.71 MiB'),
     ('metricbeat', '187.61 MiB'), ('metricbeat', '165.78 MiB')]
    2022-06-27T17:38:50.954 controller-0 collectd[2607872]: info The top 10 memory rss processes Kubernetes Addon are :
    [('java', '4.52 GiB'), ('java', '1.32 GiB'),
     ('java', '829.71 MiB'), ('java', '796.45 MiB'),
     ('java', '564.62 MiB'), ('node', '544.01 MiB'),
     ('metricbeat', '187.61 MiB'), ('metricbeat', '165.78 MiB'),
     ('filebeat', '123.93 MiB'), ('java', '87.34 MiB')]
    2022-06-27T17:38:50.956 controller-0 collectd[2607872]: info The top 10 memory rss processes overall are :
    [('java', '4.52 GiB'), ('java', '1.32 GiB'),
     ('ceph-mon', '950.58 MiB'), ('java', '829.71 MiB'),
     ('java', '796.45 MiB'), ('kube-apiserver', '613.39 MiB'),
     ('java', '564.62 MiB'), ('node', '544.01 MiB'),
     ('helm-controller', '352.71 MiB'), ('ceph-osd', '317.41 MiB')]
    
    Closes-Bug: 1973815
    
    Signed-off-by: Cesar Bombonate <cesar.pompeudebarrosbombonate@windriver.com>
    Change-Id: I70c7222434fc3d2f0fee339dfe779a49d89e3294

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.