StarlingX

Mtce heartbeat backoff not restored if mtcAgent restarts

Bug #1884556 reported by Eric MacDonald on 2020-06-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

In the case of a Multi Node Failure Avoidance (MNFA) event, mtcAgent (Maintenance ) sends a 'back-off' request to the hbsAgent (Heartbeat service) while there appears to be a networking issue that affects a number of hosts.

This 'back-off' request tells the heartbeat service to slow down by a factor of 4 ; what was say a 100 ms period would change to a 400 ms period while in MNFA mode. When the MNFA condition resolves the mtcAgent sends a heartbeat 'recovery' command to the heartbeat service telling it to restore the heartbeat interval back to normal; the configured interval.

However, if the mtcAgent process is 'restarted' while in MNFA mode, the knowledge that the heartbeat service is running at a reduced rate is lost and not restored ; at least until the hbsAgent is restarted. If then the hbsAgent is 'restarted' on one controller but not the other a condition arises where one controller is heart-beating at a 4x slower rate than the other.

This then leads to a condition where the hbsClient detects that one controller is not providing cluster information at the same rate as the other and inserts [0:0] cluster info into the bounced response for that time-slot if this missing data occurs twice or more in a row ; [enabled:receiving].

This then leads to SM getting an apparent faulty view of the cluster information from the controller that is running at the back-off rate.

Severity
--------
Major: Minor Log flooding in hbsAgent.log, hbsClient.log and sm.log

Could affect SM's ability to select the proper controller in the event of a inter-controller networking failure. Fortunately the controller running at the back-off rate is the one that is showing cluster errors so SM 'should' not select it but there are many potential error conditions and this inaccurate data creates what could be seen as an untested double fault scenario.

Steps to Reproduce
------------------
Trigger MNFA mode and before it restores restart the mtcAgent and then the hbsAgent on one of the controllers.

Expected Behavior
------------------
Heartbeat back-off rate is restored

Actual Behavior
----------------
Heartbeat back-off rate is not restored.

Reproducibility
---------------
100% of the time when the conditions that create the issue occur

System Configuration
--------------------
Any system with more than just 2 controllers/hosts

Branch/Pull Time/Commit
-----------------------
Date of this issue creation

Last Pass
---------
Test escape, potential for issue never identified or observed.

Timestamp/Logs
--------------
2020-06-11T13:47:21.000 controller-0 sm: debug time[56175.101] log<1601> INFO: sm[2251662]: sm_cluster_hbs_info_msg.cpp(110): Cluster hbs last updated 0 secs ago, storage-0 is not provisioned, from controller-0: 3 nodes enabled, 3 nodes reachable, from controller-1: 3 nodes enabled, 3 nodes reachable
2020-06-11T13:47:21.000 controller-0 sm: debug time[56175.312] log<1602> INFO: sm[2251662]: sm_cluster_hbs_info_msg.cpp(110): Cluster hbs last updated 0 secs ago, storage-0 is not provisioned, from controller-0: 3 nodes enabled, 3 nodes reachable, from controller-1: 0 nodes enabled, 0 nodes reachable
2020-06-11T13:47:21.000 controller-0 sm: debug time[56175.522] log<1603> INFO: sm[2251662]: sm_cluster_hbs_info_msg.cpp(110): Cluster hbs last updated 0 secs ago, storage-0 is not provisioned, from controller-0: 3 nodes enabled, 3 nodes reachable, from controller-1: 3 nodes enabled, 3 nodes reachable
2020-06-11T13:47:21.000 controller-0 sm: debug time[56175.732] log<1604> INFO: sm[2251662]: sm_cluster_hbs_info_msg.cpp(110): Cluster hbs last updated 0 secs ago, storage-0 is not provisioned, from controller-0: 3 nodes enabled, 3 nodes reachable, from controller-1: 0 nodes enabled, 0 nodes reachable
2020-06-11T13:47:21.000 controller-0 sm: debug time[56175.943] log<1605> INFO: sm[2251662]: sm_cluster_hbs_info_msg.cpp(110): Cluster hbs last updated 0 secs ago, storage-0 is not provisioned, from controller-0: 3 nodes enabled, 3 nodes reachable, from controller-1: 3 nodes enabled, 3 nodes reachable
2020-06-11T13:47:22.000 controller-0 sm: debug time[56176.153] log<1606> INFO: sm[2251662]: sm_cluster_hbs_info_msg.cpp(110): Cluster hbs last updated 0 secs ago, storage-0 is not provisioned, from controller-0: 3 nodes enabled, 3 nodes reachable, from controller-1: 0

2020-06-11T13:44:42.381 [1045702.00308] controller-1 hbsClient hbs hbsClient.cpp ( 962) _service_pulse_request : Warn : controller-1 Mgmnt cluster info cleared (5)
2020-06-11T13:44:42.802 [1045702.00309] controller-1 hbsClient hbs hbsClient.cpp ( 962) _service_pulse_request : Warn : controller-1 Mgmnt cluster info cleared (5)
2020-06-11T13:44:43.223 [1045702.00310] controller-1 hbsClient hbs hbsClient.cpp ( 962) _service_pulse_request : Warn : controller-1 Mgmnt cluster info cleared (5)
2020-06-11T13:44:43.644 [1045702.00311] controller-1 hbsClient hbs hbsClient.cpp ( 962) _service_pulse_request : Warn : controller-1 Mgmnt cluster info cleared (5)

Test Activity
-------------
Feature Testing

Workaround
----------
Restart the hbsAgent on the controller that is being reported as missing cluster info ; controller-1 in the above logs examples

Tags:

Eric MacDonald (rocksolidmtce) on 2020-06-22

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Ghada Khalil (gkhalil) on 2020-06-22

tags:	added: stx.metal
Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.5.0
Changed in starlingx:
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-23: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/737558

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-24: Fix merged to metal (master)

Reviewed: https://review.opendev.org/737558
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=4267d467869fcdbf2529bc1afa3d1c9958a2a6da
Submitter: Zuul
Branch: master

commit 4267d467869fcdbf2529bc1afa3d1c9958a2a6da
Author: Eric MacDonald <email address hidden>
Date: Tue Jun 23 09:23:51 2020 -0400

Force heartbeat period reset on mtcAgent process startup

    In the case of a Multi Node Failure Avoidance (MNFA) event,
    mtcAgent (Maintenance) sends a 'back-off' request to the
    hbsAgent (Heartbeat) while there appears to be a networking
    issue that affects a number of hosts.

    This 'back-off' request tells the heartbeat service to slow
    down by a factor of 4; what was say a 100 ms period would
    change to a 400 ms period while in MNFA mode. When the MNFA
    condition resolves the mtcAgent sends a heartbeat 'recovery'
    command to the heartbeat service telling it to restore the
    heartbeat interval back the configured interval.

    However, if the mtcAgent process is 'restarted' while in
    MNFA mode, the knowledge that the heartbeat service was
    running at a reduced rate is lost and not restored.

This update forces the heartbeat rate to be set back to the
configured rate when the mtcAgent starts up.

Note that MNFA mode is not and should not be preserved over
mtcAgent process restart. If after restart a MNFA event.

    Change-Id: I254ef86c453cb2d40cbeda859bd7477ac28942bc
    Closes-Bug: 1884556
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-06-28:

Changing the release tag to stx.4.0 since the fix was merged in stx master for that release.

tags:

added: stx.4.0
removed: stx.5.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.