Host stuck with mtce enable alarm/degrade if mtcAgent restarts late in graceful recovery
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Eric MacDonald |
Bug Description
A host can get stuck in an unlocked-
The recovery_handler sets the host state to 'enabled' following the start of an 11 second heartbeat soak.
If then the mtcAgent is restarted during that 11 second soak, when it comes up it loads existing alarms, sees that the host is in the unlocked-enabled state and proceeds with the stuck alarm.
Had the mtcAgent not been restarted, the 11 second heartbeat soak would have completed and the 'critical enable' alarm and degrade would have been cleared as expected.
There are 2 issues that need to be addressed here.
1. The recovery_handler should not change the host state to 'enabled' until 'after' it clears the critical enable alarm.
2. If the mtcAgent comes up and the 'add_handler' FSM sees that there is a host state/alarm conflict such as a critical alarm raised against a host and the host is in the 'unlocked-enabled' state then the add handler should prioritise the host state over the alarm and assume that the host alarm is stale and clear it.
Inspecting the code in the recovery handler it appears that there are two 11 second heartbeat soaks back to back. An additional fix is to remove the second one ; it is not needed. One must have been added without realising that the other existed. Having both of them just makes this issue failure window wider.
Severity
--------
Minor: This condition is unlikely to occur because it requires 2 faults in close proximity
1. heartbeat failure of a host
2. system fault that causes SM to restart the mtcAgent in a small 11 second recovery window.
Steps to Reproduce
------------------
reboot in-service host to cause graceful recovery of that host
wait for rebooted host to recover to the point of this log and then kill mtcAgent
2020-12-
2020-12-
2020-12-
2020-12-
2020-12-
2020-12-
After this one
2020-12-
Expected Behavior
------------------
No stuck alarm or degrade
Actual Behavior
----------------
Stuck alarm and degrade
Reproducibility
---------------
100% with the above stated conditions met
System Configuration
-------
Any system other than AIO SX
Branch/Pull Time/Commit
-------
starlingx/master at time of this issue creation
Last Pass
---------
Never observed or tested for.
Timestamp/Logs
--------------
See collect logs for this LP title @ https:/
Test Activity
-------------
Feature Testing
Workaround
----------
Lock and unlock the host.
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
tags: | added: stx.metal |
Changed in starlingx: | |
importance: | Undecided → Low |
status: | New → Triaged |
Another work around is to manually delete the alarm and restart mtcAgent
> fm alarm-list --uuid
> fm alarm-delete <uuid of 200.004 alarm>
> sudo sm-restart-safe service mtc-agent