Temporary 200.004 critical alarm raised over unlock due to slow shutdown

Bug #2024249 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
Slower than usual shutdown sequence leading into a reboot of a remote host can trick the offline handler into seeing the node as offline before it is actually offline. A latent

mtcClient, in one example, over an unlock reboot took 8 seconds to stop. The offline handler, which requests mtcAlive every 100 ms for at least 5 seconds saw no mtcAlive responses from the host during that period. The algorithm will restart the countdown if it receives a mtcAlive response during anytime during that offline (5 second) countdown. Then, 2 seconds after declaring the host as offline a mtcAlive was received which maintenance caught and noticing that the uptime in that message indicated that the node had not rebooted so mtcAgent failed the node, raised the alarm and retried the full enable sequence. The retry reset failed because by that time the node was by then resetting. The node recovered normally, from the initial reboot, and went enabled.

Severity
--------
Minor: transient alarm that auto clears.

An unexpected alarm was raised due to the mtcAlive trickery mentioned above. However, there was no impact to node recovery as there was no real second reset/reboot of the node

Steps to Reproduce
------------------
lock host that is very busy

Expected Behavior
------------------
No transient alarm

Actual Behavior
----------------
Transient alarm seen

Reproducibility
---------------
very low < 1%

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------
Any release of Debian prior to this bug report resolution

Last Pass
---------
Had not failed before.

Timestamp/Logs
--------------
2022-12-01T19:35:45.093 [76269.00915] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1122) enable_handler :Error : compute-3 uptime is more than 1440 seconds ; host did not reboot

Test Activity
-------------
Regression Testing

Workaround
----------
None required

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/886289

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/886289
Committed: https://opendev.org/starlingx/metal/commit/d863aea1729f0aa6cb17bbf8aec5f9fd3e9cc371
Submitter: "Zuul (22348)"
Branch: master

commit d863aea1729f0aa6cb17bbf8aec5f9fd3e9cc371
Author: Eric MacDonald <email address hidden>
Date: Fri Jun 16 18:11:53 2023 +0000

    Increase mtce host offline threshold to handle slow host shutdown

    Mtce polls/queries the remote host for mtcAlive messages
    for 42 x 100 ms intervals over unlock or host failed cases.
    Absence of mtcAlive during this (~5 sec) period indicates
    the node is offline.

    However, in the rare case where shutdown is slow, 5 seconds
    is not long enough. Rare cases have been seen where 7 or 8
    second wait time is required to properly declare offline.

    To avoid the rare transient 200.004 host alarm over an
    unlock operation, this update increases the mtce host
    offline window from 5 to 10 seconds (approx) by modifying
    the mtce configuration file offline threshold from 42 to 90.

    Test Plan:

    PASS: Verify unchallenged failed to offline period to be ~10 secs
    PASS: Verify algorithm restarts if there is mtcAlive received
          anytime during the polls/queries (challenge) window.
    PASS: Verify challenge handling leads to a longer but
          successful offline declaration.
    PASS: Verify above handling for both unlock and spontaneous
          failure handling cases.

    Closes-Bug: 2024249
    Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
importance: Undecided → Medium
tags: added: stx.9.0 stx.metal
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.