Allow raise of platform cpu and memory collectd alarms even with kube ApiException

Bug #1939172 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
John Kung

Bug Description

Brief Description
-----------------
Unable to raise 100.101 Platform CPU threshold exceeded alarm after performing CPU performance test with stress-ng pods.

The kube ApiException should not prevent platform collectd alarms from being raised.

Severity
--------
Minor: System/Feature is usable with missing alarm. (Though the platform may be unresponsive/unusable due to CPU load from stress test).

Steps to Reproduce
------------------

kubectl create -f stressng-kube-system-controller-0.yaml
(where the yaml is as follows:
apiVersion: v1
kind: Pod
metadata:
  name: stressng-kube-system-controller-0
  namespace: kube-system
spec:
  containers:
  - name: stressng
    image: polinux/stress-ng
    imagePullPolicy: IfNotPresent
    command: [ "/bin/sh", "-c", "trap : TERM INT; sleep 1d & wait" ]
    securityContext:
      privileged: true
  restartPolicy: Never
)

# login to the pod
stress-ng -c 0 -l 100

Expected Behavior
------------------
Platform Alarm should be raised after several minutes debounce with sustained high cpu load.
+----------+--------------------------------------------------------------------+-------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------+-------------------+----------+----------------------------+
| 100.101 | Platform CPU threshold exceeded ; threshold 95.00%, actual 99.98% | host=controller-0 | critical | 2021-08-05T22:41:31.450533 |

Actual Behavior
----------------
Alarm is not raised, due to the following exception:
2021-07-08T18:52:57.000 controller-0 collectd[3466]: err Unhandled python exception in read callback: ApiException: (0)
Reason: SSLError
[Errno 2] No such file or directory
2021-07-08T18:52:57.000 controller-0 collectd[3466]: err Traceback (most recent call last):
2021-07-08T18:52:57.000 controller-0 collectd[3466]: err File "/opt/collectd/extensions/python/cpu.py", line 482, in read_func

Reproducibility
---------------
Intermittent; depending upon kube-api behaviour under load.

System Configuration
--------------------
Two node system

Branch/Pull Time/Commit
-----------------------
Issue exists in current load, e.g. 2021-07-27

Last Pass
---------
N/A; Intermittent issue.

Timestamp/Logs
--------------
See logs above,

Test Activity
-------------
 Evaluation, Other - Stress Test

Workaround
----------
Remove stress-ng platform cpu ; a host-swact may also be required to recover.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/monitoring/+/803803

Changed in starlingx:
status: New → In Progress
John Kung (john-kung)
Changed in starlingx:
assignee: nobody → John Kung (john-kung)
tags: added: stx.6.0 stx.monitor
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/803803
Committed: https://opendev.org/starlingx/monitoring/commit/ecd744ba0a37a69a0c048d0b2fad098147450f7e
Submitter: "Zuul (22348)"
Branch: master

commit ecd744ba0a37a69a0c048d0b2fad098147450f7e
Author: John Kung <email address hidden>
Date: Fri Aug 6 14:49:43 2021 -0500

    Handle kube ApiException during collectd platform monitoring

    During stress test/high platform load it is possible that the
    kube-apiserver responds with an kube ApiException.

    As platform monitoring of cpu and memory should not be affected by
    unresponsive kube-api server, allow the kube ApiException to be handled
    and the remaining platform resource utilization monitoring to proceed.

    This could help identify the issue by allowing the raise of
    the platform alarm (e.g. 100.101 Platform CPU threshold exceeded,
    100.103 Memory threshold exceeded).

    Verfied:
      o Platform CPU Alarm is raised with stress test
      o Platform CPU Alarm is raised with stress test
        and intermittent ApiException
      o Memory Alarm is raised with stress test
      o Memory Alarm is raised with stress test
        and intermittent ApiException
      o the above alarm conditions are cleared after
        debounce when stress condition is removed

    Closes-Bug: 1939172
    Signed-off-by: John Kung <email address hidden>
    Change-Id: I2c9c39a390af1d7ae752ad00db18384479cf6e99

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.