Mtce heartbeat backoff not restored if mtcAgent restarts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
In the case of a Multi Node Failure Avoidance (MNFA) event, mtcAgent (Maintenance ) sends a 'back-off' request to the hbsAgent (Heartbeat service) while there appears to be a networking issue that affects a number of hosts.
This 'back-off' request tells the heartbeat service to slow down by a factor of 4 ; what was say a 100 ms period would change to a 400 ms period while in MNFA mode. When the MNFA condition resolves the mtcAgent sends a heartbeat 'recovery' command to the heartbeat service telling it to restore the heartbeat interval back to normal; the configured interval.
However, if the mtcAgent process is 'restarted' while in MNFA mode, the knowledge that the heartbeat service is running at a reduced rate is lost and not restored ; at least until the hbsAgent is restarted. If then the hbsAgent is 'restarted' on one controller but not the other a condition arises where one controller is heart-beating at a 4x slower rate than the other.
This then leads to a condition where the hbsClient detects that one controller is not providing cluster information at the same rate as the other and inserts [0:0] cluster info into the bounced response for that time-slot if this missing data occurs twice or more in a row ; [enabled:
This then leads to SM getting an apparent faulty view of the cluster information from the controller that is running at the back-off rate.
Severity
--------
Major: Minor Log flooding in hbsAgent.log, hbsClient.log and sm.log
Could affect SM's ability to select the proper controller in the event of a inter-controller networking failure. Fortunately the controller running at the back-off rate is the one that is showing cluster errors so SM 'should' not select it but there are many potential error conditions and this inaccurate data creates what could be seen as an untested double fault scenario.
Steps to Reproduce
------------------
Trigger MNFA mode and before it restores restart the mtcAgent and then the hbsAgent on one of the controllers.
Expected Behavior
------------------
Heartbeat back-off rate is restored
Actual Behavior
----------------
Heartbeat back-off rate is not restored.
Reproducibility
---------------
100% of the time when the conditions that create the issue occur
System Configuration
-------
Any system with more than just 2 controllers/hosts
Branch/Pull Time/Commit
-------
Date of this issue creation
Last Pass
---------
Test escape, potential for issue never identified or observed.
Timestamp/Logs
--------------
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
2020-06-
Test Activity
-------------
Feature Testing
Workaround
----------
Restart the hbsAgent on the controller that is being reported as missing cluster info ; controller-1 in the above logs examples
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
tags: | added: stx.metal |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.5.0 |
Changed in starlingx: | |
status: | New → Triaged |
Fix proposed to branch: master /review. opendev. org/737558
Review: https:/