Maintenance does not give an unhealthy Service Management (SM) process enough time to attempt a self recovery before failing and rebooting the controller it runs on.
Severity: Major
Impact: Maintenance fails the standby controller while SM it is trying to self recover.
Steps to Reproduce
------------------
Create a situation that results in a greater than 2 second SM heartbeat loss between controllers
Expected Behavior
------------------
SM has enough time to try and self recover before maintenance fails the node.
Actual Behavior
----------------
There are some SM heartbeat failure modes that allow the standby controller's unhealthy notification to reach maintenance on the active controller where it reacts by failing the node immediately.
Reproducibility
---------------
100% of the time with certain failure modes ; i.e. when the links don't drop but heartbeat still fails for > 2 secs.
System Configuration
--------------------
Duplex system
Branch/Pull Time/Commit
-----------------------
Aug 24, 2020
Last Pass
---------
Test Escape
Timestamp/Logs
--------------
# Standby controller experiences heartbeat loss to active controller
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.803] log<318> INFO: sm[88213]: sm_cluster_hbs_info_msg.cpp(110): Cluster Info: 0 secs ago, from c0: 3 enabled, 3 reachable, from c1: 3 enabled, 3 reachable
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.803] log<319> INFO: sm[88213]: sm_cluster_hbs_info_msg.cpp(110): Cluster Info: 0 secs ago, from c0: 3 enabled, 3 reachable, from c1: 3 enabled, 3 reachable
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.995] log<320> INFO: sm[88213]: sm_failover_fsm.cpp(107): send_event fail-pending-timeout
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<321> INFO: sm[88213]: sm_failover_ss.c(319): get survivor controller-1 standby, controller-0 active
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<322> INFO: sm[88213]: sm_failover_ss.c(352): Loss of heartbeat ALL
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<323> INFO: sm[88213]: sm_failover_ss.c(478): host reaches 3 nodes, peer reaches 3 nodes, peer will be survivor
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<324> INFO: sm[88213]: sm_failover_ss.c(520): Host from standby to failed, Peer from active to active.
# This log represents SM touching the /var/run/.sm_unhealthy failed flag file.
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.001] log<327> INFO: sm[88213]: sm_failover.c(588): To disable controller-1
# Standby controller enters failure mode and enables recovery audit
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.002] log<332> ERROR: sm[88213]: sm_node_api.cpp(749): Node controller-1 is entering to failure mode.
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<344> INFO: sm[88213]: sm_failover_fsm.cpp(68): Exiting fail-pending state
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<345> INFO: sm[88213]: sm_failover_fsm.cpp(62): Entering failed state
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<346> ERROR: sm[88213]: sm_failover_failed_state.cpp(60): ********************************************************
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<347> ERROR: sm[88213]: sm_failover_failed_state.cpp(61): Entering Failover Failed state ; recovery audit started
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<348> ERROR: sm[88213]: sm_failover_failed_state.cpp(62): ********************************************************
# However, mtcAgent already learned that controller-1 SM had declared itself unhealthy so it failed and rebooted it like in the customer case.
2020-08-19T17:20:18.912 [18613.00559] controller-0 mtcAgent hbs nodeClass.cpp (4061) set_mtce_flags :Error : controller-1 reported unhealthy by SM (Mgmnt)
# Recovery audit tries to recover in 5 seconds ; after first audit interval, clears the unhealthy flag file and restarts itself
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<423> INFO: sm[88213]: sm_failover_fsm.cpp(107): send_event failed-recovery-audit
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<424> INFO: sm[88213]: sm_failover_failed_state.cpp(162): Oam:missing-heartbeat ; Mgmt:missing-heartbeat ; Cluster:missing-heartbeat ; recovery criteria met: Yes
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<425> INFO: sm[88213]: sm_failover_failed_state.cpp(178): ************************************
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<426> INFO: sm[88213]: sm_failover_failed_state.cpp(179): ** Failover Failed state recovery **
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<427> INFO: sm[88213]: sm_failover_failed_state.cpp(180): ************************************
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<428> INFO: sm[88213]: sm_failover_failed_state.cpp(83): Issuing controlled process restart ; pmon-restart hbsAgent
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.011] log<429> INFO: sm[88213]: sm_failover_failed_state.cpp(83): Issuing controlled process restart ; pmon-restart sm
2020-08-19T17:20:27.000 controller-1 sm: debug time[2277.948] log<3> INFO: sm_trap[88224]: sm_trap_thread.c(509): Trap connection has been broken, exiting.
2020-08-19T17:20:27.000 controller-1 sm: debug time[2277.948] log<4> INFO: sm_trap[88224]: sm_trap_thread.c(672): Shutting down.
2020-08-19T17:20:27.000 controller-1 sm: debug time[2277.949] log<5> INFO: sm_trap[88224]: sm_trap_thread.c(681): Shutdown complete.
But its too late, mtcAgent has failed controller-1
Test Activity
-------------
Other
Workaround
----------
None
Fix proposed to branch: master /review. opendev. org/747805
Review: https:/