Host stuck with mtce enable alarm/degrade if mtcAgent restarts late in graceful recovery

Bug #1906728 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

A host can get stuck in an unlocked-enabled-degraded state with a 200.004 Critical In-service Failure Alarm raised (stuck) if the mtcAgent process gets restarted late in the graceful recovery handling following an in-service heartbeat failure.

The recovery_handler sets the host state to 'enabled' following the start of an 11 second heartbeat soak.

If then the mtcAgent is restarted during that 11 second soak, when it comes up it loads existing alarms, sees that the host is in the unlocked-enabled state and proceeds with the stuck alarm.

Had the mtcAgent not been restarted, the 11 second heartbeat soak would have completed and the 'critical enable' alarm and degrade would have been cleared as expected.

There are 2 issues that need to be addressed here.

1. The recovery_handler should not change the host state to 'enabled' until 'after' it clears the critical enable alarm.

2. If the mtcAgent comes up and the 'add_handler' FSM sees that there is a host state/alarm conflict such as a critical alarm raised against a host and the host is in the 'unlocked-enabled' state then the add handler should prioritise the host state over the alarm and assume that the host alarm is stale and clear it.

Inspecting the code in the recovery handler it appears that there are two 11 second heartbeat soaks back to back. An additional fix is to remove the second one ; it is not needed. One must have been added without realising that the other existed. Having both of them just makes this issue failure window wider.

Severity
--------
Minor: This condition is unlikely to occur because it requires 2 faults in close proximity
 1. heartbeat failure of a host
 2. system fault that causes SM to restart the mtcAgent in a small 11 second recovery window.

Steps to Reproduce
------------------
reboot in-service host to cause graceful recovery of that host
wait for rebooted host to recover to the point of this log and then kill mtcAgent

2020-12-03T21:56:50.255 [2483843.00424] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (2489) recovery_handler : Info : controller-1 Starting 11 sec Heartbeat Soak (with ready event)
2020-12-03T21:56:50.255 [2483843.00425] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat minor clear (Mgmnt)
2020-12-03T21:56:50.255 [2483843.00426] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat degrade clear (Mgmnt)
2020-12-03T21:56:50.255 [2483843.00427] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat minor clear (Clstr)
2020-12-03T21:56:50.255 [2483843.00428] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat degrade clear (Clstr)
2020-12-03T21:57:01.260 fmAPI.cpp(490): Enqueue raise alarm request: UUID (729508fa-f1f2-499d-b0be-bb7583c4f2f9) alarm id (200.022) instant id (host=controller-1.state=enabled)

After this one

2020-12-03T21:57:01.260 [2483843.00429] controller-0 mtcAgent hbs nodeClass.cpp (6190) allStateChange : Info : controller-1 unlocked-enabled-available (seq:49)

Expected Behavior
------------------
No stuck alarm or degrade

Actual Behavior
----------------
Stuck alarm and degrade

Reproducibility
---------------
100% with the above stated conditions met

System Configuration
--------------------
Any system other than AIO SX

Branch/Pull Time/Commit
-----------------------
starlingx/master at time of this issue creation

Last Pass
---------
Never observed or tested for.

Timestamp/Logs
--------------
See collect logs for this LP title @ https://files.starlingx.kube.cengn.ca/)

Test Activity
-------------
Feature Testing

Workaround
----------
Lock and unlock the host.

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Another work around is to manually delete the alarm and restart mtcAgent

> fm alarm-list --uuid

> fm alarm-delete <uuid of 200.004 alarm>

> sudo sm-restart-safe service mtc-agent

Ghada Khalil (gkhalil)
tags: added: stx.metal
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The following 2 merged updates address this issue

update: Add in-service test to clear stale config failure alarm
review: https://review.opendev.org/c/starlingx/metal/+/783388
commit: https://opendev.org/starlingx/metal/commit/031818e55bc255b59e486ebf6faadf4b784c93fe

update: Fix Graceful Recovery handling while in Graceful Recovery handling
review: https://review.opendev.org/c/starlingx/metal/+/780976
commit: https://opendev.org/starlingx/metal/commit/5c83453fdf8775e5d776a02a2b5c06810d84cb55

Changed in starlingx:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.