Alarm 100.103 "Platform Memory threshold exceeded" not cleared

Bug #1835545 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Al Bailey

Bug Description

Brief Description
-----------------
Alarm 100.103 "Platform Memory threshold exceeded" raised, but alarm never cleared. Before the alarm raised, the operation is lock/unlock controller, not sure whether it is related to this issue.

Severity
--------
Minor

Steps to Reproduce
------------------
lock/unlock controller

TC-name:

Expected Behavior
------------------
100.103 alarm should be cleared

Actual Behavior
----------------
100.103 alarm not cleared

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 20190705T013000Z

Last Pass
---------
Lab: SM_5_6
Load: 20190701T233000Z

Timestamp/Logs
--------------
[2019-07-05 10:50:49,771] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock controller-0'

[2019-07-05 10:51:44,716] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

[2019-07-05 11:06:20,217] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-05 11:06:22,435] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
| f91d4e52-10d7-4f9a-9c98-038b9f0bf51f | 400.001 | Service group cloud-services warning; dbmon(enabled-standby, ) | service_domain=controller.service_group=cloud-services.host=controller-0 | minor | 2019-07-05T11:05:07.436436 |
| 7913be65-6605-40b9-8f79-6e47e98e87c6 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-07-05T09:50:53.399027 |
+--------------------------------------+----------+------------------------------------------------------------------------+--------------------------------------------------------------------------+----------+----------------------------+
controller-1:~$

[2019-07-05 11:10:48,204] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-05 11:10:49,778] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------+------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------+------------------------------+----------+----------------------------+
| 6ceff40c-9ba9-48fb-804e-2e08dac06611 | 100.103 | Platform Memory threshold exceeded ; threshold 80.00%, actual 80.02% | host=controller-1.numa=node1 | major | 2019-07-05T11:10:23.385929 |
| 7913be65-6605-40b9-8f79-6e47e98e87c6 | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-07-05T09:50:53.399027 |
+--------------------------------------+----------+------------------------------------------------------------------------+------------------------------+----------+----------------------------+

[2019-07-05 11:44:54,795] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-05 11:44:57,125] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+----------------------------------------------------------------------+------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+----------------------------------------------------------------------+------------------------------+----------+----------------------------+
| 77a7c5ee-1c10-42a5-a8cb-091677580ccf | 100.103 | Platform Memory threshold exceeded ; threshold 80.00%, actual 83.03% | host=controller-1.numa=node1 | major | 2019-07-05T11:44:23.388858 |
+--------------------------------------+----------+----------------------------------------------------------------------+------------------------------+----------+----------------------------+
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Was the stx-openstack application applied when the alarm was raised?
Is this the same issue reported in https://bugs.launchpad.net/starlingx/+bug/1818088?

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (3.9 KiB)

There is not this alarm when stx-openstack application first applied. Please see the log attached below.

After first lock/unlock controller, stx-openstack application reapply, but applying stuck at processing chart: osh-openstack-ceph-rgw, overall completion: 42.0% LP-1833609, but 100.103 alarm is not raised.

After another lock/unlock controller, 100.103 raised.

Some logs:

[2019-07-05 09:12:32,765] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
[2019-07-05 09:12:34,598] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+------------------------------+-------------------------------+--------------------+---------------+----------------------------------------------------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+------------------------------+-------------------------------+--------------------+---------------+----------------------------------------------------------------------------------------------------------------+
| hello-kitty | 1.0 | hello-kitty | manifest.yaml | upload-failed | Upload of application hello-kitty (1.0) failed: Command '['helm-upload', 'starlingx', u'/scratch/apps/hello- |
| | | | | | kitty/1.0/charts/hello-kitty.tgz']' returned non-zero exit status 1 |
| | | | | | |
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable- | armada-manifest | stx-openstack.yaml | applied | completed |
| | versioned | | | | |
| | | | | | |
+---------------------+-----------------------...

Read more...

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Are you suggesting this is a stuck alarm or a persistent condition/alarm ?

Revision history for this message
Peng Peng (ppeng) wrote :

I am not sure whether this alarm is related to host lock/unlock or stx-openstack application reapply failed.

stx-openstack application reapply failed after first lock/unlock, but there is no 100.103 alarm. After some more tests run and lock/unlock, 100.103 was raised.

If you want to see full execution log, please let me know.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 for now given that the reporter is indicating the issue is reproducible (memory alarm after lock/unlock).

Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
assignee: nobody → Gerry Kopec (gerry-kopec)
Frank Miller (sensfan22)
tags: added: stx.2.0 stx.config
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Gerry Kopec (gerry-kopec) → Al Bailey (albailey1974)
Revision history for this message
Al Bailey (albailey1974) wrote :

Its not a stuck alarm. The value in the alarm changed. 80.02 to 83.03

This just means the steady state of this system was too high.

In the last couple of weeks there have been several CPU improvements which would also reduce memory ie:
reduces the number of threads for rabbit (bare metal and containerized).
Some charts are now optional, so those processes are no longer running (radosgw)

I actually think this is fixed. WolfPass 1-2 has been running for quite some time, and has no alarms.

free
              total used free shared buff/cache available
Mem: 97528392 85906252 2477424 66260 9144716 9791928
Swap: 0 0 0

Revision history for this message
Al Bailey (albailey1974) wrote :

Tested today in a 2+3 lab. Have not observed any memory alarms after multiple lock/unlocks.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Al Bailey (albailey1974) wrote :

The fixes for threading and optional services have resulted in lower memory usage.
This issue can be considered resolved.

Changed in starlingx:
status: Triaged → Fix Committed
Revision history for this message
Al Bailey (albailey1974) wrote :
Changed in starlingx:
status: Fix Committed → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

not seeing this issue for a while

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.