Controller failover is in "deadlock" state after switch restart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Bin Qian |
Bug Description
Brief Description
-----------------
After a switch restart, the controllers are unable to establish active-standby relationship. When the switch got restarted; both controllers experienced heartbeat loss on both mgmt and cluster-host interfaces. Before making the fail-over decision; these interfaces went down but the peer update had not been received so both controllers think its peer is more healthy.
Severity
--------
Critical
Steps to Reproduce
------------------
While the system is up with all nodes unlocked-
Expected Behavior
------------------
Communications are restored after switch restart. All nodes are unlocked-
Actual Behavior
----------------
Access to floating IP is lost. Each controller is in standalone state. Can only ssh into each controller using node IP. Worker nodes are inaccessible.
Reproducibility
---------------
Only tried restarting the switch once an in attempt to resolve the periodic state transition of the bond slaves from ACTIVE to BACKUP that might have caused Multinode Failure Avoidance in the lab.
System Configuration
-------
IPv6 Standard system
Branch/Pull Time/Commit
-------
master BUILD_ID=
Last Pass
---------
N/A
Timestamp/Logs
--------------
See sm.log attached
Test Activity
-------------
System Test
Changed in starlingx: | |
assignee: | nobody → Bin Qian (bqian20) |
tags: | added: stx.ha |
description: | updated |
Changed in starlingx: | |
status: | Triaged → In Progress |
Marking as stx.3.0 - system doesn't recover after a fault scenario