Allow raise of platform cpu and memory collectd alarms even with kube ApiException
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
John Kung |
Bug Description
Brief Description
-----------------
Unable to raise 100.101 Platform CPU threshold exceeded alarm after performing CPU performance test with stress-ng pods.
The kube ApiException should not prevent platform collectd alarms from being raised.
Severity
--------
Minor: System/Feature is usable with missing alarm. (Though the platform may be unresponsive/
Steps to Reproduce
------------------
kubectl create -f stressng-
(where the yaml is as follows:
apiVersion: v1
kind: Pod
metadata:
name: stressng-
namespace: kube-system
spec:
containers:
- name: stressng
image: polinux/stress-ng
imagePullPo
command: [ "/bin/sh", "-c", "trap : TERM INT; sleep 1d & wait" ]
securityCon
privileged: true
restartPolicy: Never
)
# login to the pod
stress-ng -c 0 -l 100
Expected Behavior
------------------
Platform Alarm should be raised after several minutes debounce with sustained high cpu load.
+------
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 99.98% | host=controller-0 | critical | 2021-08-
Actual Behavior
----------------
Alarm is not raised, due to the following exception:
2021-07-
Reason: SSLError
[Errno 2] No such file or directory
2021-07-
2021-07-
Reproducibility
---------------
Intermittent; depending upon kube-api behaviour under load.
System Configuration
-------
Two node system
Branch/Pull Time/Commit
-------
Issue exists in current load, e.g. 2021-07-27
Last Pass
---------
N/A; Intermittent issue.
Timestamp/Logs
--------------
See logs above,
Test Activity
-------------
Evaluation, Other - Stress Test
Workaround
----------
Remove stress-ng platform cpu ; a host-swact may also be required to recover.
Changed in starlingx: | |
assignee: | nobody → John Kung (john-kung) |
tags: | added: stx.6.0 stx.monitor |
Changed in starlingx: | |
importance: | Undecided → Medium |
importance: | Medium → Low |
Fix proposed to branch: master /review. opendev. org/c/starlingx /monitoring/ +/803803
Review: https:/