Failure of rabbit results long delay of recovery

Bug #2016168 reported by Bin Qian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Brief Description
-----------------
In the situation that rabbit fail to disable, it takes long time before SM initiates reboot to recover the system.

Severity
--------
Provide the severity of the defect.
Major: system could be out of service for long time

Steps to Reproduce
------------------
The root cause is that the system running slowly with significant scheduling delay. It requires to reproduce the root cause to reproduce the issue. In which case, rabbit disable action timeout continually.

Expected Behavior
------------------
As final resort, SM should reboot the impact controller to recover the system reasonably short period of time.

Actual Behavior
----------------
It take very long time (40+ minutes) before SM reboot the impact controller.

Reproducibility
---------------
This issue is always reproducible when system runs very slowly with significant scheduling delay.

System Configuration
--------------------
DX

Tags: stx.9.0 stx.ha
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/880343

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.ha
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
importance: Undecided → Medium
tags: added: stx.9.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/880343
Committed: https://opendev.org/starlingx/ha/commit/a85ffc695ed7a1f42f39ecfe0f76e54db958389a
Submitter: "Zuul (22348)"
Branch: master

commit a85ffc695ed7a1f42f39ecfe0f76e54db958389a
Author: Bin Qian <email address hidden>
Date: Thu Apr 13 16:44:05 2023 +0000

    Shorten rabbit failure recovery delay

    In rare cases, when system running slowly with significant scheduling
    delay, rabbit disable action timeout continually. As final resort sm
    reboots the impacted controller for recovery after failure count reaches
    MAX_TRANSITION_FAILURES. As rabbit service disable timeout is set to 60
    seconds, this result a significant delay before reboot for recovery.

    This change updates MAX_TRANSITION_FAILURES of rabbit service from
    16 to 5 to reduce the delay of recovery of rabbit failure.

    TCs passed:
        Install a DX system
        Observed service group recovery escalated to reboot after 5 forced
        rabbit disable failure.

    Closes-bug: 2016168
    Signed-off-by: Bin Qian <email address hidden>
    Change-Id: I660a64f0e78b6564456eb26245b672d2549f9a3b

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.