Stuck configuration failure alarm after AIO DX install
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Eric MacDonald |
Bug Description
AIO DX install as a Distributed Cloud experienced a controller-1 configuration error that self corrected over a re-enable retry.
However the configuration failure alarm that was raised in the first enable attempt was not cleared in the second attempt that succeeded.
Looking at the logs I see that a clause in the in-service test handler qualified and switch the enable FSM to run the subfunction handler before the enable handler cleared the existing configuration failure alarm.
2020-02-
2020-02-
Issue is easy to fix. However, its not clear how the in-service test clause qualified to trigger that action which is the critical main part of the investigation.
Severity
--------
Minor with the following reasoning ...
Issue can only occur in AIO system and only after it has experienced a configuration failure that resolves on a subsequent re-enable retry. Appears to be a race condition between the enable handler and in-service test handler that requires as many as 7 'if clause' conditions to be met, 2 of which should not have qualified. This is where the investigation will focus.
Alarm can be manually deleted through fm cli.
Steps to Reproduce
------------------
Create a AIO C1 configuration failure that eventually recovers on its own over enable retries.
Expected Behavior
------------------
No stuck alarm
Actual Behavior
----------------
Stuck configuration failure alarm
Reproducibility
---------------
Seen once
System Configuration
-------
AIO DX Discributed Cloud System Controller ; IPv6
[sysadmin@
+------
| Property | Value |
+------
| contact | None |
| created_at | 2020-02-
| description | None |
| distributed_
| https_enabled | False |
| location | None |
| name | dc-system-
| region_name | RegionOne |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_
| software_version | 20.02 |
| system_mode | duplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-02-
| uuid | f38c692f-
| vswitch_type | none |
+------
Branch/Pull Time/Commit
-------
SW_VERSION="20.02"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID=
SRC_BUILD_ID="8"
JOB="WRCP_
BUILD_BY="jenkins"
BUILD_NUMBER="8"
BUILD_HOST=
BUILD_DATE=
Last Pass
---------
Every other time. Only ever seen once.
Timestamp/Logs
--------------
2020-02-
2020-02-
2020-02-
2020-02-
2020-02-
2020-02-
2020-02-
2020-02-
2020-02-
2020-02-
2020-02-
Test Activity
-------------
Feature Testing
Workaround
----------
Manually delete alarm
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
low priority / not gating -- double fault scenario; unlikely to occur