Comment 2 for bug 2024249

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/886289
Committed: https://opendev.org/starlingx/metal/commit/d863aea1729f0aa6cb17bbf8aec5f9fd3e9cc371
Submitter: "Zuul (22348)"
Branch: master

commit d863aea1729f0aa6cb17bbf8aec5f9fd3e9cc371
Author: Eric MacDonald <email address hidden>
Date: Fri Jun 16 18:11:53 2023 +0000

    Increase mtce host offline threshold to handle slow host shutdown

    Mtce polls/queries the remote host for mtcAlive messages
    for 42 x 100 ms intervals over unlock or host failed cases.
    Absence of mtcAlive during this (~5 sec) period indicates
    the node is offline.

    However, in the rare case where shutdown is slow, 5 seconds
    is not long enough. Rare cases have been seen where 7 or 8
    second wait time is required to properly declare offline.

    To avoid the rare transient 200.004 host alarm over an
    unlock operation, this update increases the mtce host
    offline window from 5 to 10 seconds (approx) by modifying
    the mtce configuration file offline threshold from 42 to 90.

    Test Plan:

    PASS: Verify unchallenged failed to offline period to be ~10 secs
    PASS: Verify algorithm restarts if there is mtcAlive received
          anytime during the polls/queries (challenge) window.
    PASS: Verify challenge handling leads to a longer but
          successful offline declaration.
    PASS: Verify above handling for both unlock and spontaneous
          failure handling cases.

    Closes-Bug: 2024249
    Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf
    Signed-off-by: Eric MacDonald <email address hidden>