Mtce heartbeat period not restored after MNFA

Bug #1893212 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
When Multi-Node Failure Avoidance (MNFA) occurs, maintenance commands the heartbeat agent to slow down by a factor of 4. When MNFA resolves the modified heartbeat period is not always restored.

Severity
--------
Major: Requires a MNFA condition to trigger. The heartbeat cluster info sent to SM is not accurate and 'could' (although not yet seen to) affect SM's uncontrolled swact decision making in the presence of inter controller communication failures.

Steps to Reproduce
------------------
Create and then resolve a MNFA condition

Expected Behavior
------------------
The heartbeat period should be restored once the MNFA condition is resolved

Actual Behavior
----------------
... but sometimes it's not.

Reproducibility
---------------
Intermittent: Happens most of the time but requires a MNFA event to trigger.

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
Aug 27, 2020

Last Pass
---------
Unknown at this time. Is a Test Escape.

Timestamp/Logs
--------------
11720 2020-08-04T15:33:26.288 | Mgmnt: 4 | Mon | Mis | Max | Deg | Fail | Pulses Tot | Pulses | Enabled ( 400) |
11721 2020-08-04T15:33:26.288 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
11722 2020-08-04T15:33:26.288 | controller-0 | n | 0 | 0 | 0 | 0 | 0 | 0 | 400 msec
11723 2020-08-04T15:33:26.288 | controller-1 | Y | 0 | 0 | 2 | 8 | 2441a9 | 22d56 | 400 msec
11724 2020-08-04T15:33:26.288 | worker-0 | Y | 0 | 0 | 2 | 9 | 244dd8 | 2c8b8 | 400 msec
11725 2020-08-04T15:33:26.288 | worker-1 | Y | 0 | 0 | 2 | 9 | 2439bf | 2c8b8 | 400 msec
11726 2020-08-04T15:33:26.288 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
11727 2020-08-04T15:33:26.288 | Clstr: 4 | Mon | Mis | Max | Deg | Fail | Pulses Tot | Pulses | Enabled ( 400) |
11728 2020-08-04T15:33:26.288 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
11729 2020-08-04T15:33:26.289 | controller-0 | n | 0 | 0 | 0 | 0 | 0 | 0 | 400 msec
11730 2020-08-04T15:33:26.289 | controller-1 | Y | 0 | 0 | 3 | 0 | 244346 | 22d56 | 400 msec
11731 2020-08-04T15:33:26.289 | worker-0 | Y | 0 | 0 | 1 | 2 | 245044 | 2c8b8 | 400 msec
11732 2020-08-04T15:33:26.289 | worker-1 | Y | 0 | 0 | 1 | 2 | 243c2b | 2c8b8 | 400 msec
11733 2020-08-04T15:33:26.289 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+

This is where mtcAgent goes into MNFA.

6350 2020-08-03T18:56:59.768 [1629233.00798] controller-0 mtcAgent --- mtcNodeMnfa.cpp ( 202) mnfa_enter : Warn : MNFA ENTER --> Entering Multi-Node Failure Avoidance
6351 2020-08-03T18:56:59.768 fmAPI.cpp(490): Enqueue raise alarm request: UUID (54fb9b66-3141-4fa7-8cad-6bc9a67c1bd2) alarm id (200.020) instant id (host=controller-0.event=mnfa_enter)
6352 2020-08-03T18:56:59.768 [1629233.00799] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-0 heartbeat backoff sent to controller-0 hbsAgent
6353 2020-08-03T18:56:59.768 [1629233.00800] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-0 heartbeat backoff sent to controller-1 hbsAgent

hbsAgent.log

2020-08-04T16:05:00.095 [93621.963289] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Mgmnt history (entry 10)
2020-08-04T16:05:00.095 [93621.963290] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Clstr history (entry 10)
2020-08-04T16:05:00.095 [93621.963291] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: missing peer controller view
2020-08-04T16:05:00.095 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.095 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.095 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.095 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 [93621.963292] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Mgmnt history (entry 11)
2020-08-04T16:05:00.200 [93621.963293] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Clstr history (entry 11)
2020-08-04T16:05:00.200 [93621.963294] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: missing peer controller view
2020-08-04T16:05:00.200 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.200 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 [93621.963295] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: receiving peer controller view
2020-08-04T16:05:00.305 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.305 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 [93621.963296] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: query
2020-08-04T16:05:00.399 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.399 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 [93621.963297] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Mgmnt history (entry 11)
2020-08-04T16:05:00.516 [93621.963298] controller-1 hbsAgent hbs hbsCluster.cpp ( 841) hbs_cluster_inject : Warn : controller-1 injected 0:0 into controller-0 Clstr history (entry 11)
2020-08-04T16:05:00.516 [93621.963299] controller-1 hbsAgent hbs hbsCluster.cpp ( 631) hbs_cluster_send : Info : cluster state notification Reason: missing peer controller view
2020-08-04T16:05:00.516 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 Cluster Vault : C1 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-08-04T16:05:00.516 Cluster Vault : C0 Clstr [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [0:0]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]

Test Activity
-------------
Developer Testing

Workaround
----------
Swact controllers or restart mtcAgent

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - issue related to h/w monitoring/recovery

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/750805
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=2210c71216416ac3cfd46d2dc82d0114ee2ef7e1
Submitter: Zuul
Branch: master

commit 2210c71216416ac3cfd46d2dc82d0114ee2ef7e1
Author: Eric MacDonald <email address hidden>
Date: Sun Sep 6 07:58:29 2020 -0400

    Fix Mtce Heartbeat period recovery on MNFA Exit

    When Multi-Node Failure Avoidance (MNFA) occurs,
    maintenance commands the Heartbeat Agent to slow
    down by a factor of 4.

    The rate recovery following a MNFA is not occurring.

    Update https://review.opendev.org/#/c/701057 made
    a condition check change that introduced this issue
    by requiring mnfa_timeout to be non-zero before an
    attempt is made to recover heartbeat period following
    MNFA recovery.

    This update switches that condition check to use more
    specific mnfa_backoff state tracker and because MNFA
    is a global maintenance mode feature rather than a
    node specific feature, moves the recovery check code
    from the node level fsm into a mnfa_recovery_handler
    called in the main select loop.

    Test Plan:

    PASS: Verify MNFA handling/recovery with mnfa_timeout!=0
                 that expires.
    PASS: Verify MNFA handling/recovery when mnfa_timeout!=0
                 but before the timeout expires.
    PASS: Verify MNFA handling/recovery when mnfa_timeout=0
    PASS: Verify MNFA backoff rate recovery over mtcAgent
                 process restart.
    PASS: Verify MNFA backoff rate is sent to hbsAgent if
                 hbsAgent restarts while MNFA his active.

    Change-Id: I8da5a000ab503692c7cfa620233ed8aa772c50f8
    Closes-Bug: #1893212
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.