SM does not recover following full inter-controller comm loss
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
SM monitors connectivity and health of peer controller over the OAM , management and (if provisioned) cluster host networks.
Normal communication loss over individual links is handled automatically and correctly. If however SM sees ALL its links to its peer go CARRIER DOWN simultaneously there is a race-to-failure condition that leads to BOTH controllers declaring themselves UNHEALTHY and BOTH shutting down.
SM doesn't currently automatically recover from an UNHEALTHY shutdown once its monitored links recover. This leaves the system in an un-managed state until manual action is required/taken to reboot one of the controllers.
Severity
--------
Critical: system is not managed and requires manual intervention to recover.
Steps to Reproduce
------------------
Power off the switch that provides link connectivity between controllers.
Must include all monitored interfaces and the links must all go CARRIER DOWN virtually simultaneously.
Expected Behavior
------------------
SM should recover once inter-controller connectivity recovers.
Actual Behavior
----------------
SM on both controllers remains in shutdown state until one of the controllers is rebooted manually.
Reproducibility
---------------
Difficult to reproduce based on timing.
With timing correct/precise (see above) issue is reproducible 100% of the time.
System Configuration
-------
Any duplex system type
Branch/Pull Time/Commit
-------
StarlingX master branch as of the date of this issue creation.
Last Pass
---------
Not currently tested.
No recovery was expected behaviour but is felt to be unacceptable.
This issue is created to request and track an improvement update.
Timestamp/Logs
--------------
2020-04-
Test Activity
-------------
Evaluation
Workaround
----------
Reboot one or both controllers
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Changed in starlingx: | |
importance: | Undecided → High |
tags: | added: stx.4.0 stx.metal |
Fix proposed to branch: master /review. opendev. org/735219
Review: https:/