Stuck configuration failure alarm after AIO DX install

Bug #1864888 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

AIO DX install as a Distributed Cloud experienced a controller-1 configuration error that self corrected over a re-enable retry.

However the configuration failure alarm that was raised in the first enable attempt was not cleared in the second attempt that succeeded.

Looking at the logs I see that a clause in the in-service test handler qualified and switch the enable FSM to run the subfunction handler before the enable handler cleared the existing configuration failure alarm.

2020-02-24T21:01:35.567 [121293.00403] controller-0 mtcAgent hbs nodeClass.cpp (6190) allStateChange : Info : controller-1 unlocked-enabled-degraded (seq:43)
2020-02-24T21:01:35.578 [121293.00404] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (7240) insv_test_handler : Info : controller-1-worker ... running recovery enable

Issue is easy to fix. However, its not clear how the in-service test clause qualified to trigger that action which is the critical main part of the investigation.

Severity
--------
Minor with the following reasoning ...

Issue can only occur in AIO system and only after it has experienced a configuration failure that resolves on a subsequent re-enable retry. Appears to be a race condition between the enable handler and in-service test handler that requires as many as 7 'if clause' conditions to be met, 2 of which should not have qualified. This is where the investigation will focus.

Alarm can be manually deleted through fm cli.

Steps to Reproduce
------------------
Create a AIO C1 configuration failure that eventually recovers on its own over enable retries.

Expected Behavior
------------------
No stuck alarm

Actual Behavior
----------------
Stuck configuration failure alarm

Reproducibility
---------------
Seen once

System Configuration
--------------------
AIO DX Discributed Cloud System Controller ; IPv6

[sysadmin@controller-0 ~(keystone_admin)]$ system show
+------------------------+--------------------------------------+
| Property | Value |
+------------------------+--------------------------------------+
| contact | None |
| created_at | 2020-02-24T19:58:53.410307+00:00 |
| description | None |
| distributed_cloud_role | systemcontroller |
| https_enabled | False |
| location | None |
| name | dc-system-controller |
| region_name | RegionOne |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_project_name | services |
| software_version | 20.02 |
| system_mode | duplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-02-24T20:11:13.940456+00:00 |
| uuid | f38c692f-05f5-4886-93ec-3557882a007d |
| vswitch_type | none |
+------------------------+--------------------------------------+

Branch/Pull Time/Commit
-----------------------
SW_VERSION="20.02"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2020-02-22_11-13-25"
SRC_BUILD_ID="8"

JOB="WRCP_20.02_Build"
BUILD_BY="jenkins"
BUILD_NUMBER="8"
BUILD_HOST="yow-cgts4-lx.wrs.com"
BUILD_DATE="2020-02-22 11:15:22 -0500"

Last Pass
---------
Every other time. Only ever seen once.

Timestamp/Logs
--------------

2020-02-24T21:01:35.567 [121293.00402] controller-0 mtcAgent inv mtcInvApi.cpp ( 437) mtcInvApi_force_task : Info : controller-1 task clear (seq:42) (was:Enabling)
2020-02-24T21:01:35.567 fmAPI.cpp(490): Enqueue raise alarm request: UUID (bc117117-62c2-4a59-a863-2c1e47b6e879) alarm id (200.022) instant id (host=controller-1.state=enabled)
2020-02-24T21:01:35.567 [121293.00403] controller-0 mtcAgent hbs nodeClass.cpp (6190) allStateChange : Info : controller-1 unlocked-enabled-degraded (seq:43)
2020-02-24T21:01:35.578 [121293.00404] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (7240) insv_test_handler : Info : controller-1-worker ... running recovery enable
2020-02-24T21:01:35.578 [121293.00405] controller-0 mtcAgent hbs nodeClass.cpp (1796) alarm_compute_clear : Info : controller-1 major enable alarm clear
2020-02-24T21:01:35.578 [121293.00406] controller-0 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : controller-1 clearing 'Compute Function' alarm (200.013)
2020-02-24T21:01:35.578 fmAPI.cpp(512): Enqueue clear alarm request: alarm id (200.013), instant id (host=controller-1)
2020-02-24T21:01:35.578 [121293.00407] controller-0 mtcAgent |-| mtcSubfHdlrs.cpp ( 83) enable_subf_handler : Info : controller-1-worker Subf Enable FSM (from start)
2020-02-24T21:01:35.601 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=controller-1.state=enabled)
2020-02-24T21:01:35.630 [121293.00408] controller-0 mtcAgent |-| mtcSubfHdlrs.cpp ( 113) enable_subf_handler : Info : controller-1-worker Subf Configured OK
2020-02-24T21:01:35.641 [121293.00409] controller-0 mtcAgent hdl mtcSubfHdlrs.cpp ( 194) enable_subf_handler : Info : controller-1-worker running out-of-service tests

Test Activity
-------------
Feature Testing

Workaround
----------
Manually delete alarm

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

low priority / not gating -- double fault scenario; unlikely to occur

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.metal
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

A fix for this issue has been submitted with the following details

update: Add in-service test to clear stale config failure alarm
review: https://review.opendev.org/c/starlingx/metal/+/783388
commit: https://opendev.org/starlingx/metal/commit/031818e55bc255b59e486ebf6faadf4b784c93fe

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.