The problem was logic expected one line data for
ntpq result. But it was 2 lines for each ntp server
entry. When peer server is selected, script checked
refid if refid is reliable or not but it could not
find because refid is in the following line.
This fix formats 2 lines data into 1 line.
The minor alarm "minor alarm "NTP cannot reach
external time source; syncing with peer controller
only" is removed because NTP does not prioritize
external time source over peer.
The common audit is enabled for the fm_notifier plugin that supports
alarm managment for the following resources.
- CPU with alarm id 100.101
- Memory with alarm id 100.103
- Filesystem with alarm id 100.104
Other plugins may use this common audit in the future but only the
above resources have the audit enabled for them by this update.
Test Plan:
PASS: Verify stale alarm detection/correction handling
PASS: Verify missing alarm detection/correction handling
PASS: Verify alarm severity mismatch detection/correction handling
PASS: Verify hosts only audits its own specified alarms
PASS: Verify success path of monitoring a single and mix
of base and instance alarms of varying severity while
such alarm conditions come and go
PASS: Verify alarm audit of mix of base and instance alarms
over a collectd process restart
PASS: Verify audit handling of alarm that migrates from
major to critical to major to clear
PASS: Verify audit handling transition between alarm and
no alarm conditions
PASS: Verify soak of random cpu, memory and filesystem
overage alarm assertions and clears that also involve
manual alarm deletions, assertions and severity changes
that exercise new audit features
Regression:
PASS: Verify alarm and audit handling over Swact with mounted filesystem that has active alarm
PASS: Verify collectd logs following a system install and
while alarms are managed during above soak
PASS: Verify behavior while FM is killed or stopped/started
PASS: Verify Standard system install with Sanity and Regression
PASS: Verify AIO DX/DC systems install with Sanity and Regression
Closes-Bug: 1925210
Change-Id: I1cafd17ad07ec769240de92ae4e67cb1357f0992
Signed-off-by: Eric MacDonald <email address hidden>
Running tox for linters fails since the bandit being pulled
in is python3 only. This is similar to other bugs where a new
version is released which drops py2 support.
In this env, we only include bandit if we are testing and running
in py3.
commit a2a2a88887c6f2761d6f6ec1f339cf949207c0b2
Author: Eric MacDonald <email address hidden>
Date: Tue Jan 26 07:58:05 2021 -0500
Avoid loading collectd's default plugins
The current opensource collectd rpm installs
several default plugins, some that overlap
starlingx developed plugins and others that
simply collect way too much data.
This update moves the value added starlingx
plugins to /etc/collectd.d/starlingx and
relies another puppet update to change the
collectd's plugin search path accordingly.
Test Plan:
PASS: Verify default plugins are not loaded
and they samples are not collected.
PASS: Verify patch apply and remove.
Note: is reboot required patch
PASS: Verify the daily influxdb usage
drops by 80-85%.
Turns out that not all plugins are ready to sample
immediately following the config complete state.
One example is FM on the active controller needs
time to get going before plugins can query their
alarms on startup. Also, some plugins need more
time than others.
To account for both cases this update adds a
thresholded node ready gate that can be tailored
to a plugin to hold off fm access and sampling
until its ready threshold is reached.
Test Plan:
PASS: Verify AIO SX and DX system install
PROG: Verify Storage system install
PASS: Verify AIO SX node lock and unlock
PASS: Verify AIO Standby controller lock and unlock
PASS: Verify Standard controller lock and unlock
PASS: Verify Compute and Storage node lock and unlock
PASS: Verify Dead-Office-Recovery (AIO DX)
PASS: Verify collectd sampling and logs
Partial-Bug: 1872979
Signed-off-by: Eric MacDonald <email address hidden>
Change-Id: I044d812542a4222214c7d13e231ac4024cca9800
commit 23489af038771da85d5443fd822dcae4699acf03
Author: Jim Gauld <email address hidden>
Date: Wed Nov 4 17:22:34 2020 -0500
Increase field widths of PID for schedtop
This increases field width of TID, PID, and PPID to 7 wide for schedtop
engineering tool. Newer systems support larger PIDs.
Change-Id: I706b60d83e8ce341a7d07c4c067a74e7049acdad
Closes-Bug: 1902954
Signed-off-by: Jim Gauld <email address hidden>
Reviewed: https:/ /review. opendev. org/c/starlingx /monitoring/ +/792244 /opendev. org/starlingx/ monitoring/ commit/ fdc0d099fb0d65c bf8f037fe0cc9ac 8125410284
Committed: https:/
Submitter: "Zuul (22348)"
Branch: f/centos8
commit 2ef5451f4424826 36db3c0c3641e84 12821bd8c5
Author: Takamasa Takenaka <email address hidden>
Date: Thu Apr 22 12:28:37 2021 -0300
Format 2 lines ntpq data into 1 lines
The problem was logic expected one line data for
ntpq result. But it was 2 lines for each ntp server
entry. When peer server is selected, script checked
refid if refid is reliable or not but it could not
find because refid is in the following line.
This fix formats 2 lines data into 1 line.
The minor alarm "minor alarm "NTP cannot reach
external time source; syncing with peer controller
only" is removed because NTP does not prioritize
external time source over peer.
Closes-Bug: 1889101
Signed-off-by: Takamasa Takenaka <email address hidden> bf0351165c671eb f35b97fa3bc
Change-Id: Icc8316bb1a7041
commit d37490b81408ca5 3b1b8fd61992c6c 9337dbcaed
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 20 10:03:07 2021 -0400
Add alarm audit to starlingx collectd fm notifier plugin
This update adds common plugin support for alarm state auditing.
The audit is able to detect and correct the following alarm
state errors:
Error Case Correction Action
------- ------- ------- -- -----------------
- stale alarm ; delete alarm
- missing alarm ; assert alarm
- alarm severity mismatch ; refresh alarm
The common audit is enabled for the fm_notifier plugin that supports
alarm managment for the following resources.
- CPU with alarm id 100.101
- Memory with alarm id 100.103
- Filesystem with alarm id 100.104
Other plugins may use this common audit in the future but only the
above resources have the audit enabled for them by this update.
Test Plan:
PASS: Verify stale alarm detection/ correction handling correction handling correction handling
PASS: Verify missing alarm detection/
PASS: Verify alarm severity mismatch detection/
PASS: Verify hosts only audits its own specified alarms
PASS: Verify success path of monitoring a single and mix
of base and instance alarms of varying severity while
such alarm conditions come and go
PASS: Verify alarm audit of mix of base and instance alarms
over a collectd process restart
PASS: Verify audit handling of alarm that migrates from
major to critical to major to clear
PASS: Verify audit handling transition between alarm and
no alarm conditions
PASS: Verify soak of random cpu, memory and filesystem
overage alarm assertions and clears that also involve
manual alarm deletions, assertions and severity changes
that exercise new audit features
Regression:
PASS: Verify alarm and audit handling over Swact with mounted
filesystem that has active alarm
PASS: Verify collectd logs following a system install and
while alarms are managed during above soak
PASS: Verify behavior while FM is killed or stopped/started
PASS: Verify Standard system install with Sanity and Regression
PASS: Verify AIO DX/DC systems install with Sanity and Regression
Closes-Bug: 1925210 69240de92ae4e67 cb1357f0992
Change-Id: I1cafd17ad07ec7
Signed-off-by: Eric MacDonald <email address hidden>
commit 14e1a9a82b017fb 5d2fa3fd62b5c09 43058ef0ee
Author: albailey <email address hidden>
Date: Wed Apr 7 17:52:02 2021 -0400
Bandit should only be installed in py3 env
Running tox for linters fails since the bandit being pulled
in is python3 only. This is similar to other bugs where a new
version is released which drops py2 support.
In this env, we only include bandit if we are testing and running
in py3.
Partial-Bug: 1922590 e7846e1420521de e0d48128fc5
Change-Id: I11b7d974ae3b64
Signed-off-by: albailey <email address hidden>
commit 19460ecbd2c50b2 d3fd8436d12066d f5925f0fb4
Author: Gerry Kopec <email address hidden>
Date: Tue Apr 6 21:28:53 2021 -0400
Add platform namespaces to collectd
Add missing platform namespaces (armada, cert-manager, portieris, vault
and notification) to collectd kubernetes system list.
Change-Id: I341d802210388e 5e1f3fd2d7a11fa 0593c44fa68
Closes-Bug: 1922629
Signed-off-by: Gerry Kopec <email address hidden>
commit a2a2a88887c6f27 61d6f6ec1f339cf 949207c0b2
Author: Eric MacDonald <email address hidden>
Date: Tue Jan 26 07:58:05 2021 -0500
Avoid loading collectd's default plugins
The current opensource collectd rpm installs
several default plugins, some that overlap
starlingx developed plugins and others that
simply collect way too much data.
The plugins in question are:
/etc/ collectd. d/90-default- plugins- syslog. conf collectd. d/90-default- plugins- memory. conf collectd. d/90-default- plugins- load.conf collectd. d/90-default- plugins- interface. conf collectd. d/90-default- plugins- cpu.conf
/etc/
/etc/
/etc/
/etc/
This update moves the value added starlingx d/starlingx and
plugins to /etc/collectd.
relies another puppet update to change the
collectd's plugin search path accordingly.
Test Plan:
PASS: Verify default plugins are not loaded
and they samples are not collected.
PASS: Verify patch apply and remove.
Note: is reboot required patch
PASS: Verify the daily influxdb usage
drops by 80-85%.
Regression:
PASS: Verify collectd alarm/degrade regression soak
Change-Id: Ic7884ae69014fa 274f0bd0515adec 90b08747c67
Closes-Bug: 1905581
Signed-off-by: Eric MacDonald <email address hidden>
commit ea4b515f91f3852 3a22e877ebba9d5 52962153b2
Author: Eric MacDonald <email address hidden>
Date: Mon Jan 25 09:40:06 2021 -0500
Add node ready check to collectd plugins
This update adds a second collectd plugin
initialization enhancement. First update
added a config complete gate:
https:/ /review. opendev. org/c/starlingx /monitoring/ +/736817
Turns out that not all plugins are ready to sample
immediately following the config complete state.
One example is FM on the active controller needs
time to get going before plugins can query their
alarms on startup. Also, some plugins need more
time than others.
To account for both cases this update adds a
thresholded node ready gate that can be tailored
to a plugin to hold off fm access and sampling
until its ready threshold is reached.
Test Plan:
PASS: Verify AIO SX and DX system install Recovery (AIO DX)
PROG: Verify Storage system install
PASS: Verify AIO SX node lock and unlock
PASS: Verify AIO Standby controller lock and unlock
PASS: Verify Standard controller lock and unlock
PASS: Verify Compute and Storage node lock and unlock
PASS: Verify Dead-Office-
PASS: Verify collectd sampling and logs
Partial-Bug: 1872979 2214c7d13e231ac 4024cca9800
Signed-off-by: Eric MacDonald <email address hidden>
Change-Id: I044d812542a422
commit 23489af038771da 85d5443fd822dca e4699acf03
Author: Jim Gauld <email address hidden>
Date: Wed Nov 4 17:22:34 2020 -0500
Increase field widths of PID for schedtop
This increases field width of TID, PID, and PPID to 7 wide for schedtop
engineering tool. Newer systems support larger PIDs.
Change-Id: I706b60d83e8ce3 41a7d07c4c067a7 4e7049acdad
Closes-Bug: 1902954
Signed-off-by: Jim Gauld <email address hidden>