Controller failed to take activity
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
After rebooting active controller (controller-0), controller-1 did not take activity. When the controller-0 came out of reboot, it became the active controller.
Severity
--------
Major
Steps to Reproduce
------------------
Install system and issue reboot on active controller
Expected Behavior
------------------
Activity switch to standby controller (controller-1)
Actual Behavior
----------------
No swact
Reproducibility
---------------
seen once
System Configuration
-------
2+10
Branch/Pull Time/Commit
-------
Private build:2019-05-23
Last Pass
---------
not known
Timestamp/Logs
--------------
2019-05-
See services going down on controller-0 in sm-customer.log:
| 2019-05-
| 2019-05-
| 2019-05-
| 2019-05-
| 2019-05-
| 2019-05-
But on controller-1, it fails to take activity:
| 2019-05-
| 2019-05-
| 2019-05-
| 2019-05-
| 2019-05-
After controller-0 booted and came up it took activity and it shows both controllers as enabled-available:
[wrsroot@
+----+-
| id | hostname | personality | administrative | operational | availability |
+----+-
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
Test Activity
-------------
Platform Testing
tags: | added: stx.retestneeded |
Looks like an early startup race condition where hbsAgent has not accumulated enough cluster history to qualify for an SM notification before an event occurs that SM needs cluster info to help it make a decision.
I'll take a the Jira and see if there is any tightening up I can do to make even partial info available earlier.