Comment 2 for bug 2016168

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/880343
Committed: https://opendev.org/starlingx/ha/commit/a85ffc695ed7a1f42f39ecfe0f76e54db958389a
Submitter: "Zuul (22348)"
Branch: master

commit a85ffc695ed7a1f42f39ecfe0f76e54db958389a
Author: Bin Qian <email address hidden>
Date: Thu Apr 13 16:44:05 2023 +0000

    Shorten rabbit failure recovery delay

    In rare cases, when system running slowly with significant scheduling
    delay, rabbit disable action timeout continually. As final resort sm
    reboots the impacted controller for recovery after failure count reaches
    MAX_TRANSITION_FAILURES. As rabbit service disable timeout is set to 60
    seconds, this result a significant delay before reboot for recovery.

    This change updates MAX_TRANSITION_FAILURES of rabbit service from
    16 to 5 to reduce the delay of recovery of rabbit failure.

    TCs passed:
        Install a DX system
        Observed service group recovery escalated to reboot after 5 forced
        rabbit disable failure.

    Closes-bug: 2016168
    Signed-off-by: Bin Qian <email address hidden>
    Change-Id: I660a64f0e78b6564456eb26245b672d2549f9a3b