Debug info missing for memory alarm issues

Bug #1973815 reported by Cesar Bombonate
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Unassigned

Bug Description

Brief Description
-----------------
Add instrumentation or log capture so that a memory alarm could be analyzed and root cause identified based on collect data captured from a field site.

Severity
--------
Minor

Steps to Reproduce
------------------
Allow a Memory alarm to trigger.
Once the Memory alarm subsides attempt to determine the cause after the fact
Current tools such as collect do not show any indication of the process at fault.

Expected Behavior
------------------
There should be someway to identify the responsible of a memory alarm after the fact.

Actual Behavior
----------------
Currently there is now way to identify the responsible for a memory alarm after the alarm has cleared and memory usage has normalized.

Reproducibility
---------------
reproducible

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
N/A

Last Pass
---------
N/A

Timestamp/Logs
--------------
N/A

Test Activity
-------------
N/A

Workaround
----------
N/A

Tags: stx.fault
summary: - Implement enhancement to help triage memory alarms
+ Debug info missing for memory alarm issues
Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.fault
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)
Download full text (4.1 KiB)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/840548
Committed: https://opendev.org/starlingx/monitoring/commit/2d6ba853b75b2fc32b698c2bc468a8254ebfe7a9
Submitter: "Zuul (22348)"
Branch: master

commit 2d6ba853b75b2fc32b698c2bc468a8254ebfe7a9
Author: Cesar Bombonate <email address hidden>
Date: Wed May 4 15:48:58 2022 -0400

    Implement enhancement to help triage memory alarms

    Implement the instrumentation/logging of the
    top 10 memory intensive processes.
    The goal is to have additional info in the collect
    logs to help determine why a memory alarm is raised.

    In this change includes 3 new logs lines to the collectd log file.
    These lines show every 5 min, logging the top 10 memory rss intensive
    processes.

    To do this we are walking the the cgroups folders and capturing the pids.
    We then query each pid to locate the rss memory consumption.

    The first is shows top 10 memory rss processes for the platform
     showing the top 10 of each pid under cgroups,
    This excludes the cgroups/ks8-infra folder, so it will only show process
     considered to be platform related.

    The second log line contains the top 10 pids under the cgroups/ks8-infra
     folder. This will contain all of the pods and k8s related pids.
    We are validating the namespace of each pod as to exclude k8s-addon
     related pids, so it will only show process for k8s-system.

    The third line we are querying all of the pids system wide and
     surface the top 10 memory intensive process.

    Seeing duplicates of process names is expected as we are looking at each
     individual pid rather than the process name itself.

    For Reference.
    K8S_NAMESPACE_SYSTEM = [
    'kube-system', 'armada', 'cert-manager', 'portieris',
    'vault', 'notification', 'platform-deployment-manager',
    'flux-helm', 'metrics-server']

    K8S_NAMESPACE_ADDON = ['monitor', 'openstack']

    Test Plan:

    PASS: built image successfully.
    PASS: Installed image successfully.
    PASS: bootstrap and unlock successful
    PASS: build updated latest image.
    PASS: Check collectd service is loaded and active.
    PASS: Verify collectd CPU and Memory Consumption Growth
    PASS: Tested On Debian using the latest master Branch.

    Example Output:
    2022-06-27T17:38:50.953 controller-0 collectd[2607872]: info The top 10 memory rss processes for the platform are :
    [('ceph-mon', '950.58 MiB'), ('ceph-osd', '317.41 MiB'),
     ('sysinv-conducto', '215.15 MiB'), ('ceph-mgr', '198.13 MiB'),
     ('sysinv-api', '182.20 MiB'), ('kubelet', '158.93 MiB'),
     ('etcd', '154.39 MiB'), ('python', '131.34 MiB'),
     ('python', '131.33 MiB'), ('sysinv-api', '113.25 MiB')]
    2022-06-27T17:38:50.954 controller-0 collectd[2607872]: info The top 10 memory rss processes for the Kubernetes System are :
    [('java', '4.52 GiB'), ('java', '1.32 GiB'),
     ('java', '829.71 MiB'), ('java', '796.45 MiB'),
     ('kube-apiserver', '613.39 MiB'), ('java', '564.62 MiB'),
     ('node', '544.01 MiB'), ('helm-controller', '352.71 MiB'),
     ('metricbeat', '187.61 MiB'), ('me...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.