StarlingX

Controller failed to take activity

Bug #1830774 reported by Nimalini Rasa on 2019-05-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
After rebooting active controller (controller-0), controller-1 did not take activity. When the controller-0 came out of reboot, it became the active controller.

Severity
--------
Major

Steps to Reproduce
------------------
Install system and issue reboot on active controller

Expected Behavior
------------------
Activity switch to standby controller (controller-1)

Actual Behavior
----------------
No swact

Reproducibility
---------------
seen once

System Configuration
--------------------
2+10

Branch/Pull Time/Commit
-----------------------
Private build:2019-05-23

Last Pass
---------
not known
Timestamp/Logs
--------------
2019-05-28T14:55:45.000 controller-0 -sh: info HISTORY: PID=276470 UID=1875 sudo reboot

Test Activity
-------------
Platform Testing

See original description

Tags:

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2019-05-28:

controller-0 log Edit (43.0 MiB, application/x-tar)

Numan Waheed (nwaheed) on 2019-05-28

tags:

added: stx.retestneeded

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2019-05-28:

controller-1 log Edit (39.7 MiB, application/x-tar)

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2019-05-29:

controller-0_20190529.164737.tar Edit (36.8 MiB, application/x-tar)

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2019-05-29:

controller-1_20190529.164929.tar Edit (34.3 MiB, application/x-tar)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-29:

Looks like an early startup race condition where hbsAgent has not accumulated enough cluster history to qualify for an SM notification before an event occurs that SM needs cluster info to help it make a decision.

I'll take a the Jira and see if there is any tightening up I can do to make even partial info available earlier.

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-29:

Marking as release gating; medium priority given that the issue appears to be intermittent

description:	updated
tags:	added: stx.metal
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.2.0

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-06-06:

I have identified that the heartbeat service did not treat the 'missing peer view' as a cluster event which delayed SM notification of cluster state change.

I've made changes to the hbs cluster event reporting to include the above as an event and to send SM a notification immediately upon detection of any cluster state change.

Update is being tested.

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-07: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/663813

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-12: Fix merged to metal (master)

Reviewed: https://review.opendev.org/663813
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=3c86ec443f9098b4a41c03ef6295d6af376d35a5
Submitter: Zuul
Branch: master

commit 3c86ec443f9098b4a41c03ef6295d6af376d35a5
Author: Eric MacDonald <email address hidden>
Date: Thu Jun 6 21:12:54 2019 -0400

Make heartbeat cluster state change notifications immediate

    Investigation of the reported issue lead to the discovery
    that the heartbeat service was not sending a cluster state
    change notification after detecting that the peer controller
    was no longer providing its cluster state view.

    This update makes the following two cluster state change
    notification improvements to ensure that SM is notified
    immeditely following all cluster state changes.

    1. send SM current cluster state view notification as soon
       as the state change is detected rather than waiting for
       the end of the heartbeat period.

    2. consider the loss of the peer controller cluster view as
       a cluster state change that, like other state changes,
       warrants an immeduiate notification.

Test Plan:

    PASS: Inspect heartbeat service logs for expected behavior.
    PASS: Reboot the active controller 10+ times and verify SM
          switches activity to the inservice peer controller in
          a timely manner.

    Change-Id: Iebf466a6d75e2f56fae1fdbdfc148127db139c99
    Closes-Bug: 1830774
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ming Lei (mlei) wrote on 2019-08-15:

#10

Verified on build: 20190814T053000Z

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.