Temporary 200.004 critical alarm raised over unlock due to slow shutdown
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
Slower than usual shutdown sequence leading into a reboot of a remote host can trick the offline handler into seeing the node as offline before it is actually offline. A latent
mtcClient, in one example, over an unlock reboot took 8 seconds to stop. The offline handler, which requests mtcAlive every 100 ms for at least 5 seconds saw no mtcAlive responses from the host during that period. The algorithm will restart the countdown if it receives a mtcAlive response during anytime during that offline (5 second) countdown. Then, 2 seconds after declaring the host as offline a mtcAlive was received which maintenance caught and noticing that the uptime in that message indicated that the node had not rebooted so mtcAgent failed the node, raised the alarm and retried the full enable sequence. The retry reset failed because by that time the node was by then resetting. The node recovered normally, from the initial reboot, and went enabled.
Severity
--------
Minor: transient alarm that auto clears.
An unexpected alarm was raised due to the mtcAlive trickery mentioned above. However, there was no impact to node recovery as there was no real second reset/reboot of the node
Steps to Reproduce
------------------
lock host that is very busy
Expected Behavior
------------------
No transient alarm
Actual Behavior
----------------
Transient alarm seen
Reproducibility
---------------
very low < 1%
System Configuration
-------
Any
Branch/Pull Time/Commit
-------
Any release of Debian prior to this bug report resolution
Last Pass
---------
Had not failed before.
Timestamp/Logs
--------------
2022-12-
Test Activity
-------------
Regression Testing
Workaround
----------
None required
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.metal |
Fix proposed to branch: master /review. opendev. org/c/starlingx /metal/ +/886289
Review: https:/