Standby controller does not become active when sm failure in the Active controller

Bug #2012519 reported by Eliud Kyale
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eliud Kyale

Bug Description

Brief Description
-----------------

While testing a persistent pmond managed critical service failure on the active controller to verify if the standby controller will become the new active controller automatically without any issues.

Crashing sm on controller-0 (active)

sudo mv /etc/init.d/sm /etc/init.d/sm.orig
cat /var/run/sm.pid | sudo xargs kill

During the test, the standby controller did not become the active controller automatically. Controller-1 only became the active controller after /etc/init.d/sm is manually restored in controller-0.

Severity
--------
Critical: System/Feature is not usable after the defect

Steps to Reproduce
------------------
<Repeated multiple times>

sudo mv /etc/init.d/sm /etc/init.d/sm.orig
cat /var/run/sm.pid | sudo xargs kill

Expected Behavior
------------------
Standby controller should become active

Actual Behavior
----------------
Loss of floating ip. System is unusable

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DX+worker-0

Branch/Pull Time/Commit
-----------------------
https://opendev.org/starlingx/ha/commit/b65eb7b2f6b350003225f37bbcd60708801c6f59

Last Pass
---------
N/A

Test Activity
-------------
Other

Workaround
----------
N/A

Tags: stx.9.0 stx.ha
Changed in starlingx:
status: New → In Progress
Eliud Kyale (ekyale)
Changed in starlingx:
assignee: nobody → Eliud Kyale (ekyale)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/878274
Committed: https://opendev.org/starlingx/ha/commit/9e2ff8241103c1679ef7f6e5098bc95275db39fe
Submitter: "Zuul (22348)"
Branch: master

commit 9e2ff8241103c1679ef7f6e5098bc95275db39fe
Author: Kyale, Eliud <email address hidden>
Date: Wed Mar 22 10:06:37 2023 -0400

    Add failover state of peer to heartbeat msg

    - add failover state to heartbeat message ( 4 bits )
    - add logic to survived_state to use peer's
      failover state to determine whether to exit survived state
      and enter normal state
    - throttle peer is normal events with a threshold of 10
      used to ensure the peer is normal and stable
    - change fsm->send_event() log to debug from info log level
    - a few logging improvements; debug send_event logs
    - update copyright year 2023

    Test plan:
    PASS - AIO-DX: iso install
    PASS - AIO-DX: crash the sm as indicated in bug
                   and observe swact to standby
    PASS - AIO-DX: manual swact
    PASS - AIO-DX: power off active controller
    PASS - AIO-SX: install and basic sanity check
    PASS - AIO-SX: upgrade test to verify sm heartbeat
                   messages changes still function when
                   controllers are running different loads

    Closes-Bug: 2012519

    Signed-off-by: Kyale, Eliud <email address hidden>
    Change-Id: I1f86dcb8c9d9dbaf436b9240867f61adc405e88c

Changed in starlingx:
status: In Progress → Fix Released
tags: added: stx.9.0 stx.ha
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.