Maintenance fails controller with unhealthy SM immediately

Bug #1892789 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Maintenance does not give an unhealthy Service Management (SM) process enough time to attempt a self recovery before failing and rebooting the controller it runs on.

Severity: Major

Impact: Maintenance fails the standby controller while SM it is trying to self recover.

Steps to Reproduce
------------------
Create a situation that results in a greater than 2 second SM heartbeat loss between controllers

Expected Behavior
------------------
SM has enough time to try and self recover before maintenance fails the node.

Actual Behavior
----------------
There are some SM heartbeat failure modes that allow the standby controller's unhealthy notification to reach maintenance on the active controller where it reacts by failing the node immediately.

Reproducibility
---------------
100% of the time with certain failure modes ; i.e. when the links don't drop but heartbeat still fails for > 2 secs.

System Configuration
--------------------
Duplex system

Branch/Pull Time/Commit
-----------------------
Aug 24, 2020

Last Pass
---------
Test Escape

Timestamp/Logs
--------------

# Standby controller experiences heartbeat loss to active controller
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.803] log<318> INFO: sm[88213]: sm_cluster_hbs_info_msg.cpp(110): Cluster Info: 0 secs ago, from c0: 3 enabled, 3 reachable, from c1: 3 enabled, 3 reachable
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.803] log<319> INFO: sm[88213]: sm_cluster_hbs_info_msg.cpp(110): Cluster Info: 0 secs ago, from c0: 3 enabled, 3 reachable, from c1: 3 enabled, 3 reachable
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.995] log<320> INFO: sm[88213]: sm_failover_fsm.cpp(107): send_event fail-pending-timeout
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<321> INFO: sm[88213]: sm_failover_ss.c(319): get survivor controller-1 standby, controller-0 active
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<322> INFO: sm[88213]: sm_failover_ss.c(352): Loss of heartbeat ALL
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<323> INFO: sm[88213]: sm_failover_ss.c(478): host reaches 3 nodes, peer reaches 3 nodes, peer will be survivor
2020-08-19T17:20:16.000 controller-1 sm: debug time[2266.996] log<324> INFO: sm[88213]: sm_failover_ss.c(520): Host from standby to failed, Peer from active to active.

# This log represents SM touching the /var/run/.sm_unhealthy failed flag file.
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.001] log<327> INFO: sm[88213]: sm_failover.c(588): To disable controller-1

# Standby controller enters failure mode and enables recovery audit
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.002] log<332> ERROR: sm[88213]: sm_node_api.cpp(749): Node controller-1 is entering to failure mode.
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<344> INFO: sm[88213]: sm_failover_fsm.cpp(68): Exiting fail-pending state
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<345> INFO: sm[88213]: sm_failover_fsm.cpp(62): Entering failed state
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<346> ERROR: sm[88213]: sm_failover_failed_state.cpp(60): ********************************************************
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<347> ERROR: sm[88213]: sm_failover_failed_state.cpp(61): Entering Failover Failed state ; recovery audit started
2020-08-19T17:20:16.000 controller-1 sm: debug time[2267.003] log<348> ERROR: sm[88213]: sm_failover_failed_state.cpp(62): ********************************************************

# However, mtcAgent already learned that controller-1 SM had declared itself unhealthy so it failed and rebooted it like in the customer case.
2020-08-19T17:20:18.912 [18613.00559] controller-0 mtcAgent hbs nodeClass.cpp (4061) set_mtce_flags :Error : controller-1 reported unhealthy by SM (Mgmnt)

# Recovery audit tries to recover in 5 seconds ; after first audit interval, clears the unhealthy flag file and restarts itself
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<423> INFO: sm[88213]: sm_failover_fsm.cpp(107): send_event failed-recovery-audit
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<424> INFO: sm[88213]: sm_failover_failed_state.cpp(162): Oam:missing-heartbeat ; Mgmt:missing-heartbeat ; Cluster:missing-heartbeat ; recovery criteria met: Yes
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<425> INFO: sm[88213]: sm_failover_failed_state.cpp(178): ************************************
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<426> INFO: sm[88213]: sm_failover_failed_state.cpp(179): ** Failover Failed state recovery **
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<427> INFO: sm[88213]: sm_failover_failed_state.cpp(180): ************************************
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.010] log<428> INFO: sm[88213]: sm_failover_failed_state.cpp(83): Issuing controlled process restart ; pmon-restart hbsAgent
2020-08-19T17:20:21.000 controller-1 sm: debug time[2272.011] log<429> INFO: sm[88213]: sm_failover_failed_state.cpp(83): Issuing controlled process restart ; pmon-restart sm
2020-08-19T17:20:27.000 controller-1 sm: debug time[2277.948] log<3> INFO: sm_trap[88224]: sm_trap_thread.c(509): Trap connection has been broken, exiting.
2020-08-19T17:20:27.000 controller-1 sm: debug time[2277.948] log<4> INFO: sm_trap[88224]: sm_trap_thread.c(672): Shutting down.
2020-08-19T17:20:27.000 controller-1 sm: debug time[2277.949] log<5> INFO: sm_trap[88224]: sm_trap_thread.c(681): Shutdown complete.

But its too late, mtcAgent has failed controller-1

Test Activity
-------------
Other

Workaround
----------
None

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/747805

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - recovery robustness fix

tags: added: stx.5.0
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/747805
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=08c8c657951a128c6c4a08ddc014625924bc0279
Submitter: Zuul
Branch: master

commit 08c8c657951a128c6c4a08ddc014625924bc0279
Author: Eric MacDonald <email address hidden>
Date: Mon Aug 24 19:31:42 2020 -0400

    Debounce SM Unhealthy state notification

    Maintenance doesn't give an unhealthy Service Mgmnt (SM)
    process enough time to attempt a self recovery before
    failing and rebooting the controller it runs on.

    This update adds a small debounce to delay mtce's
    reaction to SM's unhealthy state notification.

    Only if the failure state persists for longer than 6
    back-to-back mtcAlive messages, approximately 30 secs,
    will maintenance fail and recovery the node through
    reboot.

    Change-Id: Ica1b0925f0c767001d80e6a3b9928a6761b0c00f
    Closes-Bug: #1892789
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/750805

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/750806

Changed in starlingx:
status: Fix Released → In Progress
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.