SM does not recover following full inter-controller comm loss

Bug #1883004 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Eric MacDonald

Bug Description

Brief Description
-----------------
SM monitors connectivity and health of peer controller over the OAM , management and (if provisioned) cluster host networks.

Normal communication loss over individual links is handled automatically and correctly. If however SM sees ALL its links to its peer go CARRIER DOWN simultaneously there is a race-to-failure condition that leads to BOTH controllers declaring themselves UNHEALTHY and BOTH shutting down.

SM doesn't currently automatically recover from an UNHEALTHY shutdown once its monitored links recover. This leaves the system in an un-managed state until manual action is required/taken to reboot one of the controllers.

Severity
--------
Critical: system is not managed and requires manual intervention to recover.

Steps to Reproduce
------------------
Power off the switch that provides link connectivity between controllers.
Must include all monitored interfaces and the links must all go CARRIER DOWN virtually simultaneously.

Expected Behavior
------------------
SM should recover once inter-controller connectivity recovers.

Actual Behavior
----------------
SM on both controllers remains in shutdown state until one of the controllers is rebooted manually.

Reproducibility
---------------
Difficult to reproduce based on timing.
With timing correct/precise (see above) issue is reproducible 100% of the time.

System Configuration
--------------------
Any duplex system type

Branch/Pull Time/Commit
-----------------------
StarlingX master branch as of the date of this issue creation.

Last Pass
---------
Not currently tested.
No recovery was expected behaviour but is felt to be unacceptable.
This issue is created to request and track an improvement update.

Timestamp/Logs
--------------
2020-04-17T12:02:30.000 controller-0 sm: debug time[58369.189] log<1470> INFO: sm[97337]: sm_node_utils.c(458): Node enable: blocked. node unhealthy file /var/run/.sm_node_unhealthy found

Test Activity
-------------
Evaluation

Workaround
----------
Reboot one or both controllers

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/735219

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
tags: added: stx.4.0 stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/735219
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=630a777cbb894501cb019c917c1be8288e7a7c36
Submitter: Zuul
Branch: master

commit 630a777cbb894501cb019c917c1be8288e7a7c36
Author: Eric MacDonald <email address hidden>
Date: Thu Jun 11 15:32:47 2020 -0400

    Add unhealthy state recovery audit to service management (sm)

    Service Management (SM) monitors connectivity and health of
    its peer controller over the OAM, Mgmt and (if provisioned)
    Cluster-Host networks.

    If SM sees all the links to its peer go 'carrier down' virtually
    simultaneously, it is possible that both controllers might
    simultaneously declare themselves unhealthy and both go
    disabled; i.e. shutdown all services with no automatic recovery.

    This update adds an 'Unhealthy State Recovery Audit' to SM which
    forces a self restart when all of its monitored links recover
    for cases where both controllers go unhealthy-shutdown or both
    controllers remain active in split-brain.

    Test Plan:

    PASS: Verify AIO SX install
    PASS: Verify Standard system install and unhealthy state recovery
    PASS: Verify single link failure end to end behavior
    PASS: Verify 2 of 3 link failure end to end behavior
    PASS: Verify all link failure end to end behavior
    PASS: Verify SM and Mtce heartbeat recovery over unhealthy state recovery
    PASS: Verify swact back and forth following a recovery
    PASS: Verify process restart as part of unhealthy state recovery
    PASS: Verify AIO DX install and unhealthy state recovery

    Change-Id: Ie906eaf04bec607328b7e0af09b37fa0558e3bbe
    Closes-Bug: 1883004
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ha/+/792251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (f/centos8)
Download full text (20.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/792251
Committed: https://opendev.org/starlingx/ha/commit/85bab5d2b394114feabe524504339a55eb8904e0
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9f70df63fd0d83bf0f94d1b9ac2f98516d5971c8
Author: Bin Qian <email address hidden>
Date: Fri May 7 16:36:23 2021 -0400

    Fix no swact for failure of critical services

    This fix is to ensure keeping service failure counting over successful
    audit.

    When service enabled audit successfully completes, SM reset the service
    failure state. However it should not reset the service fail-count.
    The fail-count should only be reset after the grace period.

    Closes-Bug: 1893669
    Change-Id: I6996fe3f1c08c38da6f26243aee2b95b083069f0
    Signed-off-by: Bin Qian <email address hidden>

commit 0b99b594f83b7c626cc0c4f7dc970ce373a7b748
Author: Bin Qian <email address hidden>
Date: Tue May 4 11:33:43 2021 -0400

    Fix AIO-DX failover issues

    This fix is to fix AIO unexpected failover behaviors.
    1. active controller reboots itself when standby controller
       reboot/lost power
    2. standby controller becomes degraded after active controller
       reboot/lost power

    Closes-bug: 1927133
    Change-Id: If3c9f6251f689a89cd206c672092ba296f00bd6b
    Signed-off-by: Bin Qian <email address hidden>

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 27 09:43:00 2021 -0400

    Remove hbsAgent restart in failover failure recovery handling

    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.

    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.

    update: Add unhealthy state recovery audit to service management (sm)
    review: https://review.opendev.org/c/starlingx/ha/+/735219

    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.

    update: Mtce heartbeat cluster state change notification improvement
    review: https://review.opendev.org/c/starlingx/metal/+/769936

    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.

    title: Force active controller reboot results in a second reboot
    issue: https://bugs.launchpad.net/starlingx/+bug/1922584

    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps re...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ha (f/centos8)

Change abandoned by "Bart Wensley <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ha/+/765061
Reason: This patch has been idle for more than six months. I am abandoning it to keep the review queue sane. If you are still interested in working on this patch, please unabandon it and upload a new patchset.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.