Standby controller does not become active when sm failure in the Active controller
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eliud Kyale |
Bug Description
Brief Description
-----------------
While testing a persistent pmond managed critical service failure on the active controller to verify if the standby controller will become the new active controller automatically without any issues.
Crashing sm on controller-0 (active)
sudo mv /etc/init.d/sm /etc/init.d/sm.orig
cat /var/run/sm.pid | sudo xargs kill
During the test, the standby controller did not become the active controller automatically. Controller-1 only became the active controller after /etc/init.d/sm is manually restored in controller-0.
Severity
--------
Critical: System/Feature is not usable after the defect
Steps to Reproduce
------------------
<Repeated multiple times>
sudo mv /etc/init.d/sm /etc/init.d/sm.orig
cat /var/run/sm.pid | sudo xargs kill
Expected Behavior
------------------
Standby controller should become active
Actual Behavior
----------------
Loss of floating ip. System is unusable
Reproducibility
---------------
Intermittent
System Configuration
-------
DX+worker-0
Branch/Pull Time/Commit
-------
https:/
Last Pass
---------
N/A
Test Activity
-------------
Other
Workaround
----------
N/A
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
assignee: | nobody → Eliud Kyale (ekyale) |
tags: | added: stx.9.0 stx.ha |
Changed in starlingx: | |
importance: | Undecided → Medium |
Reviewed: https:/ /review. opendev. org/c/starlingx /ha/+/878274 /opendev. org/starlingx/ ha/commit/ 9e2ff8241103c16 79ef7f6e5098bc9 5275db39fe
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 9e2ff8241103c16 79ef7f6e5098bc9 5275db39fe
Author: Kyale, Eliud <email address hidden>
Date: Wed Mar 22 10:06:37 2023 -0400
Add failover state of peer to heartbeat msg
- add failover state to heartbeat message ( 4 bits )
- add logic to survived_state to use peer's
failover state to determine whether to exit survived state
and enter normal state
- throttle peer is normal events with a threshold of 10
used to ensure the peer is normal and stable
- change fsm->send_event() log to debug from info log level
- a few logging improvements; debug send_event logs
- update copyright year 2023
Test plan:
and observe swact to standby
messages changes still function when
controller s are running different loads
PASS - AIO-DX: iso install
PASS - AIO-DX: crash the sm as indicated in bug
PASS - AIO-DX: manual swact
PASS - AIO-DX: power off active controller
PASS - AIO-SX: install and basic sanity check
PASS - AIO-SX: upgrade test to verify sm heartbeat
Closes-Bug: 2012519
Signed-off-by: Kyale, Eliud <email address hidden> af436b9240867f6 1adc405e88c
Change-Id: I1f86dcb8c9d9db