Controller failed to take activity

Bug #1830774 reported by Nimalini Rasa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
After rebooting active controller (controller-0), controller-1 did not take activity. When the controller-0 came out of reboot, it became the active controller.

Severity
--------
Major

Steps to Reproduce
------------------
Install system and issue reboot on active controller

Expected Behavior
------------------
Activity switch to standby controller (controller-1)

Actual Behavior
----------------
No swact

Reproducibility
---------------
seen once

System Configuration
--------------------
2+10

Branch/Pull Time/Commit
-----------------------
Private build:2019-05-23

Last Pass
---------
not known
Timestamp/Logs
--------------
2019-05-28T14:55:45.000 controller-0 -sh: info HISTORY: PID=276470 UID=1875 sudo reboot

See services going down on controller-0 in sm-customer.log:
| 2019-05-28T14:55:48.015 | 560 | service-scn | docker-distribution | enabled-active | disabled | process (pid=102746) failed
| 2019-05-28T14:55:48.052 | 561 | service-scn | docker-distribution | disabled | enabling | enabled-active state requested
| 2019-05-28T14:55:48.055 | 562 | service-scn | docker-distribution | enabling | enabled-active | enable success
| 2019-05-28T14:55:48.114 | 563 | service-scn | etcd | enabled-active | disabled | process (pid=102871) failed
| 2019-05-28T14:55:48.186 | 564 | service-scn | registry-token-server | enabled-active | disabled | process (pid=101767) failed
| 2019-05-28T14:55:48.228 | 565 | node-scn | controller-0 | unlocked-enabled | unlocked-disabled | node not ready, node unhealthy set

But on controller-1, it fails to take activity:
| 2019-05-28T14:55:51.046 | 189 | service-domain-scn | controller | backup | leader | leader change
| 2019-05-28T14:55:51.050 | 190 | node-scn | controller-1 | unlocked-enabled | unlocked-disabled | node not ready
| 2019-05-28T14:55:51.050 | 191 | interface-scn | cluster-host-interface | enabled | disabled | node disabled
| 2019-05-28T14:55:51.050 | 192 | interface-scn | oam-interface | enabled | disabled | node disabled
| 2019-05-28T14:55:51.050 | 193 | interface-scn | management-interface | enabled | disabled | node disabled

After controller-0 booted and came up it took activity and it shows both controllers as enabled-available:
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |

Test Activity
-------------
Platform Testing

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Looks like an early startup race condition where hbsAgent has not accumulated enough cluster history to qualify for an SM notification before an event occurs that SM needs cluster info to help it make a decision.

I'll take a the Jira and see if there is any tightening up I can do to make even partial info available earlier.

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; medium priority given that the issue appears to be intermittent

description: updated
tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2.0
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

I have identified that the heartbeat service did not treat the 'missing peer view' as a cluster event which delayed SM notification of cluster state change.

I've made changes to the hbs cluster event reporting to include the above as an event and to send SM a notification immediately upon detection of any cluster state change.

Update is being tested.

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/663813

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/663813
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=3c86ec443f9098b4a41c03ef6295d6af376d35a5
Submitter: Zuul
Branch: master

commit 3c86ec443f9098b4a41c03ef6295d6af376d35a5
Author: Eric MacDonald <email address hidden>
Date: Thu Jun 6 21:12:54 2019 -0400

    Make heartbeat cluster state change notifications immediate

    Investigation of the reported issue lead to the discovery
    that the heartbeat service was not sending a cluster state
    change notification after detecting that the peer controller
    was no longer providing its cluster state view.

    This update makes the following two cluster state change
    notification improvements to ensure that SM is notified
    immeditely following all cluster state changes.

    1. send SM current cluster state view notification as soon
       as the state change is detected rather than waiting for
       the end of the heartbeat period.

    2. consider the loss of the peer controller cluster view as
       a cluster state change that, like other state changes,
       warrants an immeduiate notification.

    Test Plan:

    PASS: Inspect heartbeat service logs for expected behavior.
    PASS: Reboot the active controller 10+ times and verify SM
          switches activity to the inservice peer controller in
          a timely manner.

    Change-Id: Iebf466a6d75e2f56fae1fdbdfc148127db139c99
    Closes-Bug: 1830774
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ming Lei (mlei) wrote :

Verified on build: 20190814T053000Z

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.