Brief Description
-----------------
When Multi-Node Failure Avoidance (MNFA) occurs, maintenance commands the heartbeat agent to slow down by a factor of 4. When MNFA resolves the modified heartbeat period is not always restored.
Severity
--------
Major: Requires a MNFA condition to trigger. The heartbeat cluster info sent to SM is not accurate and 'could' (although not yet seen to) affect SM's uncontrolled swact decision making in the presence of inter controller communication failures.
Steps to Reproduce
------------------
Create and then resolve a MNFA condition
Expected Behavior
------------------
The heartbeat period should be restored once the MNFA condition is resolved
Actual Behavior
----------------
... but sometimes it's not.
Reproducibility
---------------
Intermittent: Happens most of the time but requires a MNFA event to trigger.
System Configuration
--------------------
Multi-node system
Branch/Pull Time/Commit
-----------------------
Aug 27, 2020
Last Pass
---------
Unknown at this time. Is a Test Escape.
Timestamp/Logs
--------------
11720 2020-08-04T15:33:26.288 | Mgmnt: 4 | Mon | Mis | Max | Deg | Fail | Pulses Tot | Pulses | Enabled ( 400) |
11721 2020-08-04T15:33:26.288 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
11722 2020-08-04T15:33:26.288 | controller-0 | n | 0 | 0 | 0 | 0 | 0 | 0 | 400 msec
11723 2020-08-04T15:33:26.288 | controller-1 | Y | 0 | 0 | 2 | 8 | 2441a9 | 22d56 | 400 msec
11724 2020-08-04T15:33:26.288 | worker-0 | Y | 0 | 0 | 2 | 9 | 244dd8 | 2c8b8 | 400 msec
11725 2020-08-04T15:33:26.288 | worker-1 | Y | 0 | 0 | 2 | 9 | 2439bf | 2c8b8 | 400 msec
11726 2020-08-04T15:33:26.288 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
11727 2020-08-04T15:33:26.288 | Clstr: 4 | Mon | Mis | Max | Deg | Fail | Pulses Tot | Pulses | Enabled ( 400) |
11728 2020-08-04T15:33:26.288 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
11729 2020-08-04T15:33:26.289 | controller-0 | n | 0 | 0 | 0 | 0 | 0 | 0 | 400 msec
11730 2020-08-04T15:33:26.289 | controller-1 | Y | 0 | 0 | 3 | 0 | 244346 | 22d56 | 400 msec
11731 2020-08-04T15:33:26.289 | worker-0 | Y | 0 | 0 | 1 | 2 | 245044 | 2c8b8 | 400 msec
11732 2020-08-04T15:33:26.289 | worker-1 | Y | 0 | 0 | 1 | 2 | 243c2b | 2c8b8 | 400 msec
11733 2020-08-04T15:33:26.289 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
This is where mtcAgent goes into MNFA.
6350 2020-08-03T18:56:59.768 [1629233.00798] controller-0 mtcAgent --- mtcNodeMnfa.cpp ( 202) mnfa_enter : Warn : MNFA ENTER --> Entering Multi-Node Failure Avoidance
6351 2020-08-03T18:56:59.768 fmAPI.cpp(490): Enqueue raise alarm request: UUID (54fb9b66-3141-4fa7-8cad-6bc9a67c1bd2) alarm id (200.020) instant id (host=controller-0.event=mnfa_enter)
6352 2020-08-03T18:56:59.768 [1629233.00799] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-0 heartbeat backoff sent to controller-0 hbsAgent
6353 2020-08-03T18:56:59.768 [1629233.00800] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-0 heartbeat backoff sent to controller-1 hbsAgent
hbsAgent.log
2020-08-04T16:05:00.095 [93621.963289] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Mgmnt history (entry 10)
2020-08-04T16:05:00.095 [93621.963290] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Clstr history (entry 10)
2020-08-04T16:05:00.095 [93621.963291] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: missing peer controller view
2020-08-04T16:05:00.095 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.095 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.095 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.095 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 [93621.963292] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Mgmnt history (entry 11)
2020-08-04T16:05:00.200 [93621.963293] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Clstr history (entry 11)
2020-08-04T16:05:00.200 [93621.963294] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: missing peer controller view
2020-08-04T16:05:00.200 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 [93621.963295] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: receiving peer controller view
2020-08-04T16:05:00.305 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 [93621.963296] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: query
2020-08-04T16:05:00.399 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 [93621.963297] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Mgmnt history (entry 11)
2020-08-04T16:05:00.516 [93621.963298] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Clstr history (entry 11)
2020-08-04T16:05:00.516 [93621.963299] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: missing peer controller view
2020-08-04T16:05:00.516 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
Test Activity
-------------
Developer Testing
Workaround
----------
Swact controllers or restart mtcAgent
stx.5.0 / medium priority - issue related to h/w monitoring/recovery