Need improved Service Management (SM) process stall handling
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Bin Qian |
Bug Description
In a duplex system Service Management (SM) runs active on one controller and standby on the other.
Each exchange heartbeat with the other.
There is a complex scenario where if SM fails to maintain heartbeat with its peer and it also looses contact and view of all the other nodes in the system then it stands down ; assuming its at fault.
manual recovery action is required if this happens to both controllers.
This use case behavior is was observed to be triggered by a significant SM process stall.
This bug report calls for improved handling of this scenario.
Severity
--------
Major: Rare case can cause outage that requires manual action to recover.
Steps to Reproduce
------------------
Create a 15 second SM process stall on the active controller.
Expected Behavior
------------------
System recovers automatically
Actual Behavior
----------------
System recovery requires manual action
Reproducibility
---------------
100% Reproducable with really long stalls.
System Configuration
-------
Multi-node system
Branch/Pull Time/Commit
-------
Current
Last Pass
---------
Never observed.
Timestamp/Logs
--------------
Not required. Issue is understood and fix being implemented.
Test Activity
-------------
Developer Testing
Workaround
----------
Manually reboot one or both of the controllers.
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
tags: | added: stx.5.0 stx.ha |
Changed in starlingx: | |
importance: | Undecided → Medium |
Changed in starlingx: | |
assignee: | Eric MacDonald (rocksolidmtce) → Bin Qian (bqian20) |
Change abandoned by Bin Qian (<email address hidden>) on branch: master /review. opendev. org/752521 /review. opendev. org/#/c/ 751421
Review: https:/
Reason: meant to update review https:/