Alarm "Platform CPU threshold exceeded; 90%" not cleared after win_2016 VM migration test

Bug #1793314 reported by Peng Peng
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
Boot up win_2016 image VM, do live/cold migration, then swact the hosts. After these operations, alarm 100.101 "Platform CPU threshold exceeded; 90%" not clear in 300 secs

Severity
--------
Major

Steps to Reproduce
------------------
1. Boot up win_2016 VM
2. live migrate
3. cold migrate
4. swact the host
5. check alarm-list

Expected Behavior
------------------
after 300 secs, there is no 100.101 alarm in the list

Actual Behavior
----------------
100.101 alarm is in the list after 300 secs

Reproducibility
---------------
Reproducible (8/10)

System Configuration
--------------------
Two node system

Branch/Pull Time/Commit
-----------------------
master as of 2018-09-17_20-18-00

Timestamp/Logs
--------------
[2018-09-18 14:12:57,226] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne host-swact controller-1'

[2018-09-18 14:24:21,094] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-region-name RegionOne alarm-list --nowrap --uuid'
[2018-09-18 14:24:22,609] 382 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+
| 727f2677-f130-4cc0-889b-669d870df753 | 100.101 | Platform CPU threshold exceeded; 90%, actual 92% | host=controller-1 | major | 2018-09-18T14:11:18.199008 |
+--------------------------------------+----------+--------------------------------------------------+-------------------+----------+----------------------------+

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Requested information from the reporter regarding how long it look the alarm to actually clear.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The alarm was still there after four hours. It remains stuck.

summary: - STX: Alarm "Platform CPU threshold exceeded; 90%" not clear in 300 secs
- during win_2016 VM migration test
+ Alarm "Platform CPU threshold exceeded; 90%" not cleared after win_2016
+ VM migration test
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-integ (master)

Fix proposed to branch: master
Review: https://review.openstack.org/604183

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-integ (master)

Reviewed: https://review.openstack.org/604183
Committed: https://git.openstack.org/cgit/openstack/stx-integ/commit/?id=5142fac49806c8b823c50be52119c878841f0955
Submitter: Zuul
Branch: master

commit 5142fac49806c8b823c50be52119c878841f0955
Author: Eric MacDonald <email address hidden>
Date: Thu Sep 20 14:21:32 2018 -0400

    Make collectd alarm notifier retry alarm clear attempts that fail

    The Starling-X collectd alarm notification handler Fault Manager (FM)
    call to clear an alarm can lead to a stuck alarm if that FM request
    fails, say due to a concurrent swact operation, and the clear is not
    retried.

    The alarm will remain stuck until there is another same alarm assertion,
    followed by deassertion that leads to a successful clear.

    The fix is to execute a 'return' in the alarm clear failure path so
    that the alarm notifier's alarm manager control structure is not
    updated with the clear state so that the clear will be automatically
    retried on the next audit interval.

    Change-Id: Iddf4e0e7b99eab0bf0748230a25851419e7c06fa
    Closes-Bug: 1793314
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Targeting stx.2018.10 - the fix is simple and would avoid this alarm condition

tags: added: stx.2018.10
Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.