Host in degraded state with no alarm

Bug #1925210 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

If the starlingx collectd memory, cpu or filesystem usage plugin asserts a critical alarm that causes degrade and somehow that alarm is lost or manually deleted then the host can get into a state where it is degraded with no alarm.

Severity: Minor - requires manual action or an upgrade that drops the fm database to occur.

Steps to Reproduce: Delete a critical 100.101, 100,103 or 100.104 alarm
Expected Behavior: Alarm gets recreated
Actual Behavior: Host is degraded with no alarm

Reproducibility : 100% in the unlikely case this does occur.

System Configuration: Any

Branch/Pull Time/Commit: Any as of April 2021

Last Pass: Not tested

Timestamp/Logs
--------------
2021-04-20T12:14:36.217 [423480.00607] controller-1 mtcAgent hbs nodeClass.cpp (5369) collectd_notify_handler : Info : controller-1 collectd degrade state change ; clear -> assert (due to memory:host=controller-1.memory=total)
2021-04-20T12:14:36.217 [423480.00608] controller-1 mtcAgent inv mtcInvApi.cpp (1119) mtcInvApi_update_state : Info : controller-1 degraded (seq:197)

Test Activity: Feature Testing

Workaround: Restart collectd

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/monitoring/+/787202

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as low priority given this is a rare occurrence and there is a workaround.

Changed in starlingx:
importance: Undecided → Low
tags: added: stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/787202
Committed: https://opendev.org/starlingx/monitoring/commit/d37490b81408ca53b1b8fd61992c6c9337dbcaed
Submitter: "Zuul (22348)"
Branch: master

commit d37490b81408ca53b1b8fd61992c6c9337dbcaed
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 20 10:03:07 2021 -0400

    Add alarm audit to starlingx collectd fm notifier plugin

    This update adds common plugin support for alarm state auditing.
    The audit is able to detect and correct the following alarm
    state errors:

       Error Case Correction Action
       ----------------------- -----------------
     - stale alarm ; delete alarm
     - missing alarm ; assert alarm
     - alarm severity mismatch ; refresh alarm

    The common audit is enabled for the fm_notifier plugin that supports
    alarm managment for the following resources.

     - CPU with alarm id 100.101
     - Memory with alarm id 100.103
     - Filesystem with alarm id 100.104

    Other plugins may use this common audit in the future but only the
    above resources have the audit enabled for them by this update.

    Test Plan:

    PASS: Verify stale alarm detection/correction handling
    PASS: Verify missing alarm detection/correction handling
    PASS: Verify alarm severity mismatch detection/correction handling
    PASS: Verify hosts only audits its own specified alarms
    PASS: Verify success path of monitoring a single and mix
          of base and instance alarms of varying severity while
          such alarm conditions come and go
    PASS: Verify alarm audit of mix of base and instance alarms
          over a collectd process restart
    PASS: Verify audit handling of alarm that migrates from
          major to critical to major to clear
    PASS: Verify audit handling transition between alarm and
          no alarm conditions
    PASS: Verify soak of random cpu, memory and filesystem
          overage alarm assertions and clears that also involve
          manual alarm deletions, assertions and severity changes
          that exercise new audit features

    Regression:

    PASS: Verify alarm and audit handling over Swact with mounted
          filesystem that has active alarm
    PASS: Verify collectd logs following a system install and
          while alarms are managed during above soak
    PASS: Verify behavior while FM is killed or stopped/started
    PASS: Verify Standard system install with Sanity and Regression
    PASS: Verify AIO DX/DC systems install with Sanity and Regression

    Closes-Bug: 1925210
    Change-Id: I1cafd17ad07ec769240de92ae4e67cb1357f0992
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/monitoring/+/792244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (f/centos8)
Download full text (7.8 KiB)

Reviewed: https://review.opendev.org/c/starlingx/monitoring/+/792244
Committed: https://opendev.org/starlingx/monitoring/commit/fdc0d099fb0d65cbf8f037fe0cc9ac8125410284
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 2ef5451f442482636db3c0c3641e8412821bd8c5
Author: Takamasa Takenaka <email address hidden>
Date: Thu Apr 22 12:28:37 2021 -0300

    Format 2 lines ntpq data into 1 lines

    The problem was logic expected one line data for
    ntpq result. But it was 2 lines for each ntp server
    entry. When peer server is selected, script checked
    refid if refid is reliable or not but it could not
    find because refid is in the following line.
    This fix formats 2 lines data into 1 line.

    The minor alarm "minor alarm "NTP cannot reach
    external time source; syncing with peer controller
    only" is removed because NTP does not prioritize
    external time source over peer.

    Closes-Bug: 1889101

    Signed-off-by: Takamasa Takenaka <email address hidden>
    Change-Id: Icc8316bb1a7041bf0351165c671ebf35b97fa3bc

commit d37490b81408ca53b1b8fd61992c6c9337dbcaed
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 20 10:03:07 2021 -0400

    Add alarm audit to starlingx collectd fm notifier plugin

    This update adds common plugin support for alarm state auditing.
    The audit is able to detect and correct the following alarm
    state errors:

       Error Case Correction Action
       ----------------------- -----------------
     - stale alarm ; delete alarm
     - missing alarm ; assert alarm
     - alarm severity mismatch ; refresh alarm

    The common audit is enabled for the fm_notifier plugin that supports
    alarm managment for the following resources.

     - CPU with alarm id 100.101
     - Memory with alarm id 100.103
     - Filesystem with alarm id 100.104

    Other plugins may use this common audit in the future but only the
    above resources have the audit enabled for them by this update.

    Test Plan:

    PASS: Verify stale alarm detection/correction handling
    PASS: Verify missing alarm detection/correction handling
    PASS: Verify alarm severity mismatch detection/correction handling
    PASS: Verify hosts only audits its own specified alarms
    PASS: Verify success path of monitoring a single and mix
          of base and instance alarms of varying severity while
          such alarm conditions come and go
    PASS: Verify alarm audit of mix of base and instance alarms
          over a collectd process restart
    PASS: Verify audit handling of alarm that migrates from
          major to critical to major to clear
    PASS: Verify audit handling transition between alarm and
          no alarm conditions
    PASS: Verify soak of random cpu, memory and filesystem
          overage alarm assertions and clears that also involve
          manual alarm deletions, assertions and severity changes
          that exercise new audit features

    Regression:

    PASS: Verify alarm and audit handling over Swact with mounted
          filesystem that has active alarm
  ...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.