StarlingX

AIO-DX+ worker: After cable pull test/split brain test, controller-1 remains in a failed state

Bug #1844717 reported by Anujeyan Manokeran on 2019-09-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Eric MacDonald

Bug Description

ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms
64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms
64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms

--- controller-1-cluster-host ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4999ms
rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms
[sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms

Severity
--------
Major

Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management.
3. Wait for 30 seconds and put back the cable.
4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever.
System Configuration
--------------------
AIO-DX + worker nodes with IPv6 configuration wolfpass-08-12

Expected Behavior
------------------
Failure recovered after cable was put back.

Actual Behavior
----------------
As per description never recovered
Reproducibility
---------------
Able Test only once lab was not recovered . Reinstall was required.
Load
----
Build was on 2019-09-18_20-00-00
Last Pass
---------
Timestamp/Logs
--------------
2019-09-19T17:28:33.
Test Activity
-------------
Regression test

See original description

Tags:

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-19:

collect logs Edit (93.9 MiB, application/x-tar)

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-20:

Marking as stx.3.0 / medium priority - standby controller does not recover after split brain/cable pull test

summary:	- AIO-DX+ worker:After cable pull test cluster network failure alarm was - never cleared when cluster network is back and ping able + AIO-DX+ worker: After cable pull test/split brain test, controller-1 + remains in a failed state
tags:	added: stx.3.0 stx.ha
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Bin Qian (bqian20)

Revision history for this message

Bin Qian (bqian20) wrote on 2019-09-27:

Download full text (14.7 KiB)

This is more like an expected scenario, out of blind guess survivor selection protocol.
Despite some noise from a few services failure (presumably related to cables pulled), the SM on active controller (controller-0) issued reboot request to mtcAgent at 2019-09-19T17:27:19.286. Then the controller-1 reboot as expected.
At 2019-09-19T17:34:13 after controller-1 boot up, mtcAgent decided to perform graceful recovery, thus it sent a 2nd reboot request to mtcClient on controller-1 and unlocked-disabled request to SM.
After 2nd reboot, controller-1 went standby successfully at 2019-09-19T17:39:42.
Significant logs list below.
@Eric MacDonald please confirm the 2nd reboot is expected to perform graceful recovery.

This is more like an expected scenario, out of blind guess survivor selection protocol. 
Despite some noise from a few services failure (presumably related to cables pulled), the SM on active controller (controller-0) issued reboot request to mtcAgent at 2019-09-19T17:27:19.286. Then the controller-1 reboot as expected. 
At 2019-09-19T17:34:13 after controller-1 boot up, mtcAgent decided to perform graceful recovery, thus it sent a 2nd reboot request to mtcClient on controller-1 and unlocked-disabled request to SM. 
After 2nd reboot, controller-1 went standby successfully at 2019-09-19T17:39:42. 
Significant logs list below.
@Eric MacDonald please confirm the 2nd reboot is expected to perform graceful recovery.

2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3118> INFO: sm[107746]: sm_failover_ss.c(520): Host from active to active, Peer from standby to standby.
2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3119> INFO: sm[107746]: sm_failover_fsm.cpp(68): Exiting survived state
2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3120> INFO: sm[107746]: sm_failover_fsm.cpp(62): Entering normal state
2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.249] log<3121> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state-changed
2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3709> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state-changed
2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3710> INFO: sm[107746]: sm_failover_normal_state.cpp(66): Blind guess pre failure host state standby
2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3712> INFO: sm[107746]: sm_failover_fsm.cpp(68): Exiting normal state
2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3713> INFO: sm[107746]: sm_failover_fsm.cpp(62): Entering fail-pending state
2019-09-19T17:26:10.000 controller-0 sm: debug time[55959.257] log<3721> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state-changed
2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.275] log<3725> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event fail-pending-timeout
2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.275] log<3726> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(165): Pre failure host state standby, peer state active
2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.275] log<3727> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(144): set stayfailed flag file
2019-09-19T17:26:12.000 controller-0 sm: debug time[55961.276] log<3752> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(156): wait for 45 secs
2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<4035> INFO: sm[107746]: sm_failover_fail_pending_state.cpp(115): stayfailed flag file is removed
2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<4041> INFO: sm[107746]: sm_failover_fsm.cpp(68): Exiting fail-pending state
2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<4042> INFO: sm[107746]: sm_failover_fsm.cpp(62): Entering survived state
2019-09-19T17:26:57.000 controller-0 sm: debug time[56006.276] log<9> INFO: worker[107757]: sm_failover_fail_pending_state.cpp(34): To send mtce api to disable peer

2019-09-19T17:22:07.828 [89692.00087] controller-1 mtcClient msg mtcCompMsg.cpp    ( 290) mtc_service_command     : Info : start controller host services request posted (Mgmnt)
2019-09-19T17:22:07.829 [89692.00088] controller-1 mtcClient msg mtcCompMsg.cpp    ( 410) mtc_service_command     : Info : start controller host services reply send (Mgmnt)
2019-09-19T17:22:07.829 [89692.00089] controller-1 mtcClient --- mtcNodeComp.cpp   (1408) _launch_all_scripts     : Info : Sorted Host Services File List: 2
2019-09-19T17:22:07.829 [89692.00090] controller-1 mtcClient --- mtcNodeComp.cpp   (1415) _launch_all_scripts     : Info :  ... /etc/services.d/controller/ceph.sh start
2019-09-19T17:22:07.829 [89692.00091] controller-1 mtcClient --- mtcNodeComp.cpp   (1415) _launch_all_scripts     : Info :  ... /etc/services.d/controller/mtcTest start
2019-09-19T17:22:07.837 [89692.00092] controller-1 mtcClient --- mtcNodeComp.cpp   (1787) daemon_sigchld_hdlr     : Info : PASSED: /etc/services.d/controller/mtcTest (0.007 secs)
2019-09-19T17:22:08.299 [89692.00093] controller-1 mtcClient --- mtcNodeComp.cpp   (1787) daemon_sigchld_hdlr     : Info : PASSED: /etc/services.d/controller/ceph.sh (0.470 secs)
2019-09-19T17:22:08.349 [89692.00094] controller-1 mtcClient --- mtcNodeComp.cpp   ( 732) _manage_services_scripts: Info : Host Services Complete ; all passed
2019-09-19T17:27:19.286 [89692.00095] controller-1 mtcClient msg mtcCompMsg.cpp    ( 257) mtc_service_command     : Info : reboot command received (Mgmnt)
2019-09-19T17:27:19.286 [89692.00096] controller-1 mtcClient msg mtcCompMsg.cpp    ( 410) mtc_service_command     : Info : reboot reply send (Mgmnt)
2019-09-19T17:27:19.286 [89692.00097] controller-1 mtcClient msg mtcCompMsg.cpp    ( 473) mtc_service_command     : Info : Reboot (Mgmnt)
2019-09-19T17:27:19.286 [997024.00098] controller-1 mtcClient com nodeUtil.cpp      (1029) fork_sysreq_reboot      : Info : *** Failsafe Reset Thread ***
2019-09-19T17:27:19.286 [89692.00098] controller-1 mtcClient com nodeUtil.cpp      (1083) fork_sysreq_reboot      : Info : Forked Fail-Safe (Backup) Reboot Action
2019-09-19T17:27:20.285 [997024.00099] controller-1 mtcClient com nodeUtil.cpp      (1058) fork_sysreq_reboot      : Info : sysrq reset in 120 seconds
2019-09-19T17:27:25.286 [997024.00100] controller-1 mtcClient com nodeUtil.cpp      (1058) fork_sysreq_reboot      : Info : sysrq reset in 115 seconds
2019-09-19T17:27:30.286 [997024.00101] controller-1 mtcClient com nodeUtil.cpp      (1058) fork_sysreq_reboot      : Info : sysrq reset in 110 seconds

2019-09-19T17:34:13.758 [89613.00087] controller-1 mtcClient msg mtcCompMsg.cpp    ( 257) mtc_service_command     : Info : reboot command received (Mgmnt)
2019-09-19T17:34:13.758 [89613.00088] controller-1 mtcClient msg mtcCompMsg.cpp    ( 410) mtc_service_command     : Info : reboot reply send (Mgmnt)
2019-09-19T17:34:13.758 [89613.00089] controller-1 mtcClient msg mtcCompMsg.cpp    ( 473) mtc_service_command     : Info : Reboot (Mgmnt)
2019-09-19T17:34:13.758 [115512.00090] controller-1 mtcClient com nodeUtil.cpp      (1029) fork_sysreq_reboot      : Info : *** Failsafe Reset Thread ***
2019-09-19T17:34:13.759 [89613.00090] controller-1 mtcClient com nodeUtil.cpp      (1083) fork_sysreq_reboot      : Info : Forked Fail-Safe (Backup) Reboot Action
2019-09-19T17:34:14.759 [115512.00091] controller-1 mtcClient com nodeUtil.cpp      (1058) fork_sysreq_reboot      : Info : sysrq reset in 120 seconds
2019-09-19T17:34:19.763 [115512.00092] controller-1 mtcClient com nodeUtil.cpp      (1058) fork_sysreq_reboot      : Info : sysrq reset in 115 seconds

2019-09-19T17:34:13.533 [3163376.00615] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 971) service_events          : Info : controller-1 heartbeat loss (Clstr)
2019-09-19T17:34:13.533 [3163376.00616] controller-0 mtcAgent hbs nodeClass.cpp     (4696) manage_heartbeat_failure:Error : controller-1 Clstr *** Heartbeat Loss ***
2019-09-19T17:34:13.533 [3163376.00617] controller-0 mtcAgent hbs nodeClass.cpp     (4707) manage_heartbeat_failure:Error : controller-1 Clstr network heartbeat failure
2019-09-19T17:34:13.533 [3163376.00618] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : controller-1 failed (seq:36)
2019-09-19T17:34:13.533 [3163376.00619] controller-0 mtcAgent hbs nodeClass.cpp     (4716) manage_heartbeat_failure: Warn : controller-1 restarting graceful recovery
2019-09-19T17:34:13.533 [3163376.00620] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (1637) recovery_handler        : Info : controller-1 Graceful Recovery (uptime was 284)
2019-09-19T17:34:13.534 [3163376.00621] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-0 hbsAgent
2019-09-19T17:34:13.534 [3163376.00622] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-1 hbsAgent
2019-09-19T17:34:13.534 [3163376.00623] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1665) recovery_handler        :Error : controller-1 Graceful Recovery Failed (retries=3)
2019-09-19T17:34:13.534 [3163376.00624] controller-0 mtcAgent |-| nodeClass.cpp     (7325) force_full_enable       : Info : controller-1 Forcing Full Enable Sequence
2019-09-19T17:34:13.534 [3163376.00625] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 971) service_events          : Info : controller-1 heartbeat minor clear (Mgmnt)
2019-09-19T17:34:13.534 [3163376.00626] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  ( 560) enable_handler          :Error : controller-1 Main Enable FSM (from failed)
2019-09-19T17:34:13.534 [3163376.00627] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-0 hbsAgent
2019-09-19T17:34:13.534 [3163376.00628] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-1 hbsAgent
2019-09-19T17:34:13.534 [3163376.00629] controller-0 mtcAgent hbs nodeClass.cpp     (4588) hbs_minor_clear         : Warn : controller-1 MNFA host tally (Clstr:0)
2019-09-19T17:34:13.534 [3163376.00630] controller-0 mtcAgent vim mtcVimApi.cpp     ( 251) mtcVimApi_state_change  :Error : controller-1 {"state-change": {"administrative":"unlocked","operational":"disabled","availability":"failed","subfunction_oper":"disabled","subfunction_avail":"failed"},"hostname":"controller-1","uuid":"8ab8e0f8-98f3-41e0-af40-41c90d7a7db5","subfunctions":"controller,worker,lowlatency","personality":"controller"}
2019-09-19T17:34:13.534 [3163376.00631] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 971) service_events          : Info : controller-1 heartbeat degrade clear (Mgmnt)
2019-09-19T17:34:13.534 [3163376.00632] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 971) service_events          : Info : controller-1 heartbeat minor clear (Clstr)
2019-09-19T17:34:13.534 [3163376.00633] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  ( 756) enable_handler          : Info : controller-1 Main Enable FSM (from start)
2019-09-19T17:34:13.534 [3163376.00634] controller-0 mtcAgent --- mtcWorkQueue.cpp  ( 750) workQueue_purge         : Warn : controller-1 purging 5 items from workQueue
2019-09-19T17:34:13.534 [3163376.00635] controller-0 mtcAgent --- mtcWorkQueue.cpp  ( 764) workQueue_purge         : Warn : controller-1 mtcInvApi_update_state seq:36 ... did not complete (Wait)
2019-09-19T17:34:13.534 [3163376.00636] controller-0 mtcAgent --- mtcWorkQueue.cpp  ( 758) workQueue_purge         : Warn : controller-1 mtcInvApi_force_states seq:37 ... was not executed
2019-09-19T17:34:13.534 [3163376.00637] controller-0 mtcAgent --- mtcWorkQueue.cpp  ( 758) workQueue_purge         : Warn : controller-1 mtcInvApi_subf_states seq:38 ... was not executed
2019-09-19T17:34:13.534 [3163376.00638] controller-0 mtcAgent --- mtcWorkQueue.cpp  ( 758) workQueue_purge         : Warn : controller-1 mtcInvApi_force_states seq:39 ... was not executed
2019-09-19T17:34:13.534 [3163376.00639] controller-0 mtcAgent --- mtcWorkQueue.cpp  ( 758) workQueue_purge         : Warn : controller-1 mtcVimApi_state_change seq:40 ... was not executed
2019-09-19T17:34:13.534 [3163376.00640] controller-0 mtcAgent mgr mtcSmgrApi.cpp    ( 209) mtcSmgrApi_request      : Info : controller-1 sending 'unlocked-disabled' request to HA Service Manager
2019-09-19T17:34:13.757 [3163376.00641] controller-0 mtcAgent mgr mtcSmgrApi.cpp    ( 300) mtcSmgrApi_request      : Info : controller-1 is 'disabled' to Service Management
2019-09-19T17:34:13.757 [3163376.00642] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 971) service_events          : Info : controller-1 heartbeat degrade clear (Clstr)
2019-09-19T17:34:13.757 [3163376.00643] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-0 hbsAgent
2019-09-19T17:34:13.758 [3163376.00644] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-1 hbsAgent
2019-09-19T17:34:13.758 [3163376.00645] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 971) service_events          : Info : controller-1 heartbeat minor clear (Mgmnt)
2019-09-19T17:34:13.758 [3163376.00646] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  ( 950) enable_handler          : Info : controller-1 reboot
2019-09-19T17:34:13.758 [3163376.00647] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (  91) calc_reset_prog_timeout : Info : controller-1 Reboot/Reset progression has 60 sec 'wait for offline' timeout
2019-09-19T17:34:13.758 [3163376.00648] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (  95) calc_reset_prog_timeout : Info : controller-1 ... sources - mgmnt:Yes  clstr:Yes  bmc:Yes

Changed in starlingx:
assignee:	Bin Qian (bqian20) → nobody
assignee:	nobody → Bin Qian (bqian20)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-30:

Download full text (8.9 KiB)

In looking at the logs I can see that the second reboot occurred as a result of the Cluster network heartbeat failure immediately following the recovery of controller-1 after the initial reboot.

Here is the log analysis

Heartbeat Loss due to cable pull

2019-09-19T17:26:37.186 [3163376.00210] controller-0 mtcAgent hbs nodeClass.cpp (4696) manage_heartbeat_failure:Error : controller-1 Mgmnt *** Heartbeat Loss ***
2019-09-19T17:26:37.188 [3163376.00219] controller-0 mtcAgent hbs nodeClass.cpp (4696) manage_heartbeat_failure:Error : controller-1 Clstr *** Heartbeat Loss ***
2019-09-19T17:26:42.189 [3163376.00304] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1846) recovery_handler : Warn : controller-1 Loss Of Communication for 5 seconds ; disabling host
2019-09-19T17:26:42.189 [3163376.00305] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1847) recovery_handler : Warn : controller-1 ... stopping host services
2019-09-19T17:26:42.189 [3163376.00306] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1848) recovery_handler : Warn : controller-1 ... continuing with graceful recovery
2019-09-19T17:26:42.189 [3163376.00310] controller-0 mtcAgent hbs nodeClass.cpp (1672) alarm_enabled_failure :Error : controller-1 critical enable failure
2019-09-19T17:26:42.205 [3163376.00318] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1954) recovery_handler : Info : controller-1 Graceful Recovery Wait (1200 secs) (uptime was 0)

Cable reinserted resulting in observed mtcAlive.

2019-09-19T17:27:19.267 [3163376.00338] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2016) recovery_handler : Info : controller-1 regained MTCALIVE from host that did not reboot (uptime:9231)
2019-09-19T17:27:19.267 [3163376.00339] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2017) recovery_handler : Info : controller-1 ... uptimes before:0 after:9231
2019-09-19T17:27:19.267 [3163376.00340] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2018) recovery_handler : Info : controller-1 ... exiting graceful recovery
2019-09-19T17:27:19.267 [3163376.00341] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2019) recovery_handler : Info : controller-1 ... forcing full enable with reset
2019-09-19T17:27:19.267 [3163376.00342] controller-0 mtcAgent |-| nodeClass.cpp (7325) force_full_enable : Info : controller-1 Forcing Full Enable Sequence
2019-09-19T17:27:19.267 [3163376.00343] controller-0 mtcAgent hbs nodeClass.cpp (5941) allStateChange : Info : controller-1 unlocked-disabled-failed (seq:12)
2019-09-19T17:27:19.277 [3163376.00344] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp ( 560) enable_handler :Error : controller-1 Main Enable FSM (from failed)
2019-09-19T17:27:19.277 [3163376.00345] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-0 hbsAgent
2019-09-19T17:27:19.277 [3163376.00346] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-1 hbsAgent
2019-09-19T17:27:19.283 [107366.01375] controller-0 hbsAgent hbs nodeClass.cpp (7505) mon_host ...

In looking at the logs I can see that the second reboot occurred as a result of the Cluster network heartbeat failure immediately following the recovery of controller-1 after the initial reboot.

Here is the log analysis

Heartbeat Loss due to cable pull

2019-09-19T17:26:37.186 [3163376.00210] controller-0 mtcAgent hbs nodeClass.cpp     (4696) manage_heartbeat_failure:Error : controller-1 Mgmnt *** Heartbeat Loss ***
2019-09-19T17:26:37.188 [3163376.00219] controller-0 mtcAgent hbs nodeClass.cpp     (4696) manage_heartbeat_failure:Error : controller-1 Clstr *** Heartbeat Loss ***
2019-09-19T17:26:42.189 [3163376.00304] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1846) recovery_handler        : Warn : controller-1 Loss Of Communication for 5 seconds ; disabling host
2019-09-19T17:26:42.189 [3163376.00305] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1847) recovery_handler        : Warn : controller-1 ... stopping host services
2019-09-19T17:26:42.189 [3163376.00306] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1848) recovery_handler        : Warn : controller-1 ... continuing with graceful recovery
2019-09-19T17:26:42.189 [3163376.00310] controller-0 mtcAgent hbs nodeClass.cpp     (1672) alarm_enabled_failure   :Error : controller-1 critical enable failure
2019-09-19T17:26:42.205 [3163376.00318] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (1954) recovery_handler        : Info : controller-1 Graceful Recovery Wait (1200 secs) (uptime was 0)

Cable reinserted resulting in observed mtcAlive.

2019-09-19T17:27:19.267 [3163376.00338] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2016) recovery_handler        : Info : controller-1 regained MTCALIVE from host that did not reboot (uptime:9231)
2019-09-19T17:27:19.267 [3163376.00339] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2017) recovery_handler        : Info : controller-1 ... uptimes before:0 after:9231
2019-09-19T17:27:19.267 [3163376.00340] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2018) recovery_handler        : Info : controller-1 ... exiting graceful recovery
2019-09-19T17:27:19.267 [3163376.00341] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2019) recovery_handler        : Info : controller-1 ... forcing full enable with reset
2019-09-19T17:27:19.267 [3163376.00342] controller-0 mtcAgent |-| nodeClass.cpp     (7325) force_full_enable       : Info : controller-1 Forcing Full Enable Sequence
2019-09-19T17:27:19.267 [3163376.00343] controller-0 mtcAgent hbs nodeClass.cpp     (5941) allStateChange          : Info : controller-1 unlocked-disabled-failed (seq:12)
2019-09-19T17:27:19.277 [3163376.00344] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  ( 560) enable_handler          :Error : controller-1 Main Enable FSM (from failed)
2019-09-19T17:27:19.277 [3163376.00345] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-0 hbsAgent
2019-09-19T17:27:19.277 [3163376.00346] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-1 hbsAgent
2019-09-19T17:27:19.283 [107366.01375] controller-0 hbsAgent hbs nodeClass.cpp     (7505) mon_host                : Info : controller-1 stopping heartbeat service

mtcAgent is running on both controllers at this point.
The peer re-enabled heartbeat, undesirable but with no real consequence.

2019-09-19T17:27:29.108 [995970.00106] controller-1 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 add host sent to controller-0 hbsAgent
2019-09-19T17:27:29.108 [107366.01485] controller-0 hbsAgent hbs nodeClass.cpp     (7489) mon_host                : Info : controller-1 starting heartbeat service

2019-09-19T17:27:29.325 [3163376.00389] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (1022) enable_handler          : Info : controller-1 Booting (timeout: 1200 secs) (0

Due to the unexpected heartbeat restart we see another Loss failure which adds on eto the retry count

2019-09-19T17:28:38.035 [3163376.00477] controller-0 mtcAgent |-| nodeClass.cpp     (2523) stop_offline_handler    : Info : controller-1 stopping offline handler (unlocked-disabled-failed) (stage:3)
2019-09-19T17:28:38.035 [3163376.00478] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1846) recovery_handler        : Warn : controller-1 Loss Of Communication for 5 seconds ; disabling host
2019-09-19T17:28:38.035 [3163376.00479] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1847) recovery_handler        : Warn : controller-1 ... stopping host services
2019-09-19T17:28:38.035 [3163376.00480] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1848) recovery_handler        : Warn : controller-1 ... continuing with graceful recovery

After the boot we regain mtcAlive and execute the Enable sequence

2019-09-19T17:32:37.937 [3163376.00557] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2047) recovery_handler        : Info : controller-1 regained MTCALIVE from host that has rebooted (uptime curr:188 save:0)
2019-09-19T17:32:37.937 [3163376.00558] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2050) recovery_handler        : Info : controller-1 ... continuing with graceful recovery  
2019-09-19T17:32:37.937 [3163376.00559] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2052) recovery_handler        : Info : controller-1 ... without additional reboot 
2019-09-19T17:32:37.937 [3163376.00560] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : controller-1 intest (seq:33)
2019-09-19T17:32:37.937 [3163376.00561] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2133) recovery_handler        : Info : controller-1 waiting for GOENABLED ; with 600 sec timeout
2019-09-19T17:33:23.101 [3163376.00562] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (2173) recovery_handler        : Info : controller-1 got GOENABLED (Graceful Recovery)
2019-09-19T17:33:23.111 [3163376.00563] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (2197) recovery_handler        : Info : controller-1 Starting Host Services
2019-09-19T17:33:23.111 [3163376.00564] controller-0 mtcAgent hbs nodeClass.cpp     (7438) launch_host_services_cmd: Info : controller-1 start controller host services launch
2019-09-19T17:33:23.537 [3163376.00565] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2667) host_services_handler   : Info : controller-1 start controller host services completed
2019-09-19T17:33:23.547 [3163376.00566] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (2291) recovery_handler        : Info : controller-1-worker configured
2019-09-19T17:33:23.557 [3163376.00567] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (2314) recovery_handler        : Info : controller-1-worker running out-of-service tests

We start the 11 second heartbeat soak

2019-09-19T17:34:12.479 [3163376.00605] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 start host service sent to controller-0 hbsAgent
2019-09-19T17:34:12.480 [3163376.00606] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 start host service sent to controller-1 hbsAgent
2019-09-19T17:34:12.480 [3163376.00607] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (2470) recovery_handler        : Info : controller-1 Starting 11 sec Heartbeat Soak (with ready event)

... and immediately get another heartbeat failure ; but on the Cluster network Only

2019-09-19T17:34:13.533 [3163376.00616] controller-0 mtcAgent hbs nodeClass.cpp     (4696) manage_heartbeat_failure:Error : controller-1 Clstr *** Heartbeat Loss ***
2019-09-19T17:34:13.533 [3163376.00617] controller-0 mtcAgent hbs nodeClass.cpp     (4707) manage_heartbeat_failure:Error : controller-1 Clstr network heartbeat failure
2019-09-19T17:34:13.533 [3163376.00618] controller-0 mtcAgent inv mtcInvApi.cpp     (1079) mtcInvApi_update_state  : Info : controller-1 failed (seq:36)
2019-09-19T17:34:13.533 [3163376.00619] controller-0 mtcAgent hbs nodeClass.cpp     (4716) manage_heartbeat_failure: Warn : controller-1 restarting graceful recovery
2019-09-19T17:34:13.533 [3163376.00620] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp  (1637) recovery_handler        : Info : controller-1 Graceful Recovery (uptime was 284)
2019-09-19T17:34:13.534 [3163376.00621] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-0 hbsAgent
2019-09-19T17:34:13.534 [3163376.00622] controller-0 mtcAgent msg mtcCtrlMsg.cpp    ( 878) send_hbs_command        : Info : controller-1 stop host service sent to controller-1 hbsAgent
2019-09-19T17:34:13.534 [3163376.00623] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp  (1665) recovery_handler        :Error : controller-1 Graceful Recovery Failed (retries=3)
2019-09-19T17:34:13.534 [3163376.00624] controller-0 mtcAgent |-| nodeClass.cpp     (7325) force_full_enable       : Info : controller-1 Forcing Full Enable Sequence

That's the extra reboot.

Issue is why the clsuter networks was failing heartbeat at that point.
This is a mgmnt and cluster vlans on top of a physical

management_interface=enp24s0f0.186
cluster_host_interface=enp24s0f0.187

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-01:

Does this issue happen on systems that don't use VLANs for management and cluster on top of single interface ?

Would help to understand impact and scope for priority and who investigates.

Yang Liu (yliu12) on 2019-10-02

tags:

added: stx.retestneeded

Ghada Khalil (gkhalil) on 2019-10-08

Changed in starlingx:
assignee:	Bin Qian (bqian20) → Eric MacDonald (rocksolidmtce)
tags:	added: stx.metal removed: stx.ha

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-08:

Waiting on retest result.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-08:

The test team will re-test as per Eric's suggestion. Eric will do the initial investigation when the issue is reproduced.

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-10:

It was tested on system without vlan and different issue https://bugs.launchpad.net/starlingx/+bug/1847657

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-15:

#10

Debug on-hold until patch testing of surrogate issue https://bugs.launchpad.net/starlingx/+bug/1847657 is complete.

Revision history for this message

Bill Zvonar (billzvonar) wrote on 2019-10-22:

#11

Jeyan, can you re-test this LP as well now please.

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-25:

#12

Verified in load 2019-10-17_20-00-00" in wolfpass 8-12 . Could not reproduce this issue.

tags:

removed: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-25:

#13

Closing since Jeyan confirmed that the issue is not reproducible. Thanks Jeyan!

Changed in starlingx:
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

collect logs Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.