This is more like an expected scenario, out of blind guess survivor selection protocol. Despite some noise from a few services failure (presumably related to cables pulled), the SM on active controller (controller-0) issued reboot request to mtcAgent at 2019-09-19T17:27:19.286. Then the controller-1 reboot as expected. At 2019-09-19T17:34:13 after controller-1 boot up, mtcAgent decided to perform graceful recovery, thus it sent a 2nd reboot request to mtcClient on controller-1 and unlocked-disabled request to SM. After 2nd reboot, controller-1 went standby successfully at 2019-09-19T17:39:42. Significant logs list below. @Eric MacDonald please confirm the 2nd reboot is expected to perform graceful recovery. | 2019-09-19T17:26:08.389 | 333 | interface-scn | oam-interface | enabled | disabled | eno1 disabled | 2019-09-19T17:26:10.424 | 334 | interface-scn | cluster-host-interface | enabled | disabled | enp24s0f0.187 disabled | 2019-09-19T17:26:10.425 | 335 | interface-scn | management-interface | enabled | disabled | enp24s0f0.186 disabled | 2019-09-19T17:27:09.749 | 336 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed | 2019-09-19T17:27:19.310 | 377 | service-scn | etcd | enabled-active | disabled | process (pid=943968) failed | 2019-09-19T17:27:19.393 | 378 | service-scn | registry-token-server | enabled-active | disabled | process (pid=943102) failed | 2019-09-19T17:27:19.393 | 379 | service-scn | docker-distribution | enabled-active | disabled | process (pid=944125) failed | 2019-09-19T17:27:20.557 | 391 | service-scn | docker-distribution | enabling | disabling | enable failed | 2019-09-19T17:27:20.956 | 394 | service-group-scn | controller-services | active | active-failed | docker-distribution(disabled, failed) 2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3118> INFO: sm[107746]: sm_failover_ss.c(520): Host from active to active, Peer from standby to standby. 2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3119> INFO: sm[107746]: sm_failover_fsm.cpp(68): Exiting survived state 2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3120> INFO: sm[107746]: sm_failover_fsm.cpp(62): Entering normal state 2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.249] log<3121> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state-changed 2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3709> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state-changed 2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3710> INFO: sm[107746]: sm_failover_normal_state.cpp(66): Blind guess pre failure host state standby 2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3712> INFO: sm[107746]: sm_failover_fsm.cpp(68): Exiting normal state 2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3713> INFO: sm[107746]: sm_failover_fsm.cpp(62): Entering fail-pending state 2019-09-19T17:26:10.000 controller-0 sm: debug time[55959.257] log<3721> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state-changed 2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.275] log<3725> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event fail-pending-timeout 2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.275] log<3726> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(165): Pre failure host state standby, peer state active 2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.275] log<3727> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(144): set stayfailed flag file 2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.276] log<3752> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(156): wait for 45 secs 2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<4035> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(115): stayfailed flag file is removed 2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<4041> INFO: sm[107746]: sm_failover_fsm.cpp(68): Exiting fail-pending state 2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<4042> INFO: sm[107746]: sm_failover_fsm.cpp(62): Entering survived state 2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<9> INFO: worker[107757]: sm_failover_fail_pending_state.cpp(34): To send mtce api to disable peer 2019-09-19T17:22:07.828 [89692.00087] controller-1 mtcClient msg mtcCompMsg.cpp ( 290) mtc_service_command : Info : start controller host services request posted (Mgmnt) 2019-09-19T17:22:07.829 [89692.00088] controller-1 mtcClient msg mtcCompMsg.cpp ( 410) mtc_service_command : Info : start controller host services reply send (Mgmnt) 2019-09-19T17:22:07.829 [89692.00089] controller-1 mtcClient --- mtcNodeComp.cpp (1408) _launch_all_scripts : Info : Sorted Host Services File List: 2 2019-09-19T17:22:07.829 [89692.00090] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/services.d/controller/ceph.sh start 2019-09-19T17:22:07.829 [89692.00091] controller-1 mtcClient --- mtcNodeComp.cpp (1415) _launch_all_scripts : Info : ... /etc/services.d/controller/mtcTest start 2019-09-19T17:22:07.837 [89692.00092] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/services.d/controller/mtcTest (0.007 secs) 2019-09-19T17:22:08.299 [89692.00093] controller-1 mtcClient --- mtcNodeComp.cpp (1787) daemon_sigchld_hdlr : Info : PASSED: /etc/services.d/controller/ceph.sh (0.470 secs) 2019-09-19T17:22:08.349 [89692.00094] controller-1 mtcClient --- mtcNodeComp.cpp ( 732) _manage_services_scripts: Info : Host Services Complete ; all passed 2019-09-19T17:27:19.286 [89692.00095] controller-1 mtcClient msg mtcCompMsg.cpp ( 257) mtc_service_command : Info : reboot command received (Mgmnt) 2019-09-19T17:27:19.286 [89692.00096] controller-1 mtcClient msg mtcCompMsg.cpp ( 410) mtc_service_command : Info : reboot reply send (Mgmnt) 2019-09-19T17:27:19.286 [89692.00097] controller-1 mtcClient msg mtcCompMsg.cpp ( 473) mtc_service_command : Info : Reboot (Mgmnt) 2019-09-19T17:27:19.286 [997024.00098] controller-1 mtcClient com nodeUtil.cpp (1029) fork_sysreq_reboot : Info : *** Failsafe Reset Thread *** 2019-09-19T17:27:19.286 [89692.00098] controller-1 mtcClient com nodeUtil.cpp (1083) fork_sysreq_reboot : Info : Forked Fail-Safe (Backup) Reboot Action 2019-09-19T17:27:20.285 [997024.00099] controller-1 mtcClient com nodeUtil.cpp (1058) fork_sysreq_reboot : Info : sysrq reset in 120 seconds 2019-09-19T17:27:25.286 [997024.00100] controller-1 mtcClient com nodeUtil.cpp (1058) fork_sysreq_reboot : Info : sysrq reset in 115 seconds 2019-09-19T17:27:30.286 [997024.00101] controller-1 mtcClient com nodeUtil.cpp (1058) fork_sysreq_reboot : Info : sysrq reset in 110 seconds 2019-09-19T17:34:13.758 [89613.00087] controller-1 mtcClient msg mtcCompMsg.cpp ( 257) mtc_service_command : Info : reboot command received (Mgmnt) 2019-09-19T17:34:13.758 [89613.00088] controller-1 mtcClient msg mtcCompMsg.cpp ( 410) mtc_service_command : Info : reboot reply send (Mgmnt) 2019-09-19T17:34:13.758 [89613.00089] controller-1 mtcClient msg mtcCompMsg.cpp ( 473) mtc_service_command : Info : Reboot (Mgmnt) 2019-09-19T17:34:13.758 [115512.00090] controller-1 mtcClient com nodeUtil.cpp (1029) fork_sysreq_reboot : Info : *** Failsafe Reset Thread *** 2019-09-19T17:34:13.759 [89613.00090] controller-1 mtcClient com nodeUtil.cpp (1083) fork_sysreq_reboot : Info : Forked Fail-Safe (Backup) Reboot Action 2019-09-19T17:34:14.759 [115512.00091] controller-1 mtcClient com nodeUtil.cpp (1058) fork_sysreq_reboot : Info : sysrq reset in 120 seconds 2019-09-19T17:34:19.763 [115512.00092] controller-1 mtcClient com nodeUtil.cpp (1058) fork_sysreq_reboot : Info : sysrq reset in 115 seconds 2019-09-19T17:34:13.533 [3163376.00615] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat loss (Clstr) 2019-09-19T17:34:13.533 [3163376.00616] controller-0 mtcAgent hbs nodeClass.cpp (4696) manage_heartbeat_failure:Error : controller-1 Clstr *** Heartbeat Loss *** 2019-09-19T17:34:13.533 [3163376.00617] controller-0 mtcAgent hbs nodeClass.cpp (4707) manage_heartbeat_failure:Error : controller-1 Clstr network heartbeat failure 2019-09-19T17:34:13.533 [3163376.00618] controller-0 mtcAgent inv mtcInvApi.cpp (1079) mtcInvApi_update_state : Info : controller-1 failed (seq:36) 2019-09-19T17:34:13.533 [3163376.00619] controller-0 mtcAgent hbs nodeClass.cpp (4716) manage_heartbeat_failure: Warn : controller-1 restarting graceful recovery 2019-09-19T17:34:13.533 [3163376.00620] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1637) recovery_handler : Info : controller-1 Graceful Recovery (uptime was 284) 2019-09-19T17:34:13.534 [3163376.00621] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-0 hbsAgent 2019-09-19T17:34:13.534 [3163376.00622] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-1 hbsAgent 2019-09-19T17:34:13.534 [3163376.00623] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1665) recovery_handler :Error : controller-1 Graceful Recovery Failed (retries=3) 2019-09-19T17:34:13.534 [3163376.00624] controller-0 mtcAgent |-| nodeClass.cpp (7325) force_full_enable : Info : controller-1 Forcing Full Enable Sequence 2019-09-19T17:34:13.534 [3163376.00625] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat minor clear (Mgmnt) 2019-09-19T17:34:13.534 [3163376.00626] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp ( 560) enable_handler :Error : controller-1 Main Enable FSM (from failed) 2019-09-19T17:34:13.534 [3163376.00627] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-0 hbsAgent 2019-09-19T17:34:13.534 [3163376.00628] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-1 hbsAgent 2019-09-19T17:34:13.534 [3163376.00629] controller-0 mtcAgent hbs nodeClass.cpp (4588) hbs_minor_clear : Warn : controller-1 MNFA host tally (Clstr:0) 2019-09-19T17:34:13.534 [3163376.00630] controller-0 mtcAgent vim mtcVimApi.cpp ( 251) mtcVimApi_state_change :Error : controller-1 {"state-change": {"administrative":"unlocked","operational":"disabled","availability":"failed","subfunction_oper":"disabled","subfunction_avail":"failed"},"hostname":"controller-1","uuid":"8ab8e0f8-98f3-41e0-af40-41c90d7a7db5","subfunctions":"controller,worker,lowlatency","personality":"controller"} 2019-09-19T17:34:13.534 [3163376.00631] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat degrade clear (Mgmnt) 2019-09-19T17:34:13.534 [3163376.00632] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat minor clear (Clstr) 2019-09-19T17:34:13.534 [3163376.00633] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp ( 756) enable_handler : Info : controller-1 Main Enable FSM (from start) 2019-09-19T17:34:13.534 [3163376.00634] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 750) workQueue_purge : Warn : controller-1 purging 5 items from workQueue 2019-09-19T17:34:13.534 [3163376.00635] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 764) workQueue_purge : Warn : controller-1 mtcInvApi_update_state seq:36 ... did not complete (Wait) 2019-09-19T17:34:13.534 [3163376.00636] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 758) workQueue_purge : Warn : controller-1 mtcInvApi_force_states seq:37 ... was not executed 2019-09-19T17:34:13.534 [3163376.00637] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 758) workQueue_purge : Warn : controller-1 mtcInvApi_subf_states seq:38 ... was not executed 2019-09-19T17:34:13.534 [3163376.00638] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 758) workQueue_purge : Warn : controller-1 mtcInvApi_force_states seq:39 ... was not executed 2019-09-19T17:34:13.534 [3163376.00639] controller-0 mtcAgent --- mtcWorkQueue.cpp ( 758) workQueue_purge : Warn : controller-1 mtcVimApi_state_change seq:40 ... was not executed 2019-09-19T17:34:13.534 [3163376.00640] controller-0 mtcAgent mgr mtcSmgrApi.cpp ( 209) mtcSmgrApi_request : Info : controller-1 sending 'unlocked-disabled' request to HA Service Manager 2019-09-19T17:34:13.757 [3163376.00641] controller-0 mtcAgent mgr mtcSmgrApi.cpp ( 300) mtcSmgrApi_request : Info : controller-1 is 'disabled' to Service Management 2019-09-19T17:34:13.757 [3163376.00642] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat degrade clear (Clstr) 2019-09-19T17:34:13.757 [3163376.00643] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-0 hbsAgent 2019-09-19T17:34:13.758 [3163376.00644] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-1 hbsAgent 2019-09-19T17:34:13.758 [3163376.00645] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : controller-1 heartbeat minor clear (Mgmnt) 2019-09-19T17:34:13.758 [3163376.00646] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp ( 950) enable_handler : Info : controller-1 reboot 2019-09-19T17:34:13.758 [3163376.00647] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp ( 91) calc_reset_prog_timeout : Info : controller-1 Reboot/Reset progression has 60 sec 'wait for offline' timeout 2019-09-19T17:34:13.758 [3163376.00648] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp ( 95) calc_reset_prog_timeout : Info : controller-1 ... sources - mgmnt:Yes clstr:Yes bmc:Yes | 2019-09-19T17:39:42.008 | 169 | service-group-scn | controller-services | standby-degraded | standby |