Controller failover is in "deadlock" state after switch restart

Bug #1845393 reported by Tee Ngo
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Brief Description
-----------------
After a switch restart, the controllers are unable to establish active-standby relationship. When the switch got restarted; both controllers experienced heartbeat loss on both mgmt and cluster-host interfaces. Before making the fail-over decision; these interfaces went down but the peer update had not been received so both controllers think its peer is more healthy.

Severity
--------
Critical

Steps to Reproduce
------------------
While the system is up with all nodes unlocked-enabled-availabled, restart the switch.

Expected Behavior
------------------
Communications are restored after switch restart. All nodes are unlocked-enabled-available. Horizon is accessible via floating IP.

Actual Behavior
----------------
Access to floating IP is lost. Each controller is in standalone state. Can only ssh into each controller using node IP. Worker nodes are inaccessible.

Reproducibility
---------------
Only tried restarting the switch once an in attempt to resolve the periodic state transition of the bond slaves from ACTIVE to BACKUP that might have caused Multinode Failure Avoidance in the lab.

System Configuration
--------------------
IPv6 Standard system

Branch/Pull Time/Commit
-----------------------
master BUILD_ID="2019-09-17_20-00-00"

Last Pass
---------
N/A

Timestamp/Logs
--------------
See sm.log attached

Test Activity
-------------
System Test

Tags: stx.3.0 stx.ha
Revision history for this message
Tee Ngo (teewrs) wrote :
description: updated
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Ghada Khalil (gkhalil)
tags: added: stx.ha
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 - system doesn't recover after a fault scenario

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.3.0
Bin Qian (bqian20)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/686222

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/686222
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=fc0828238ff7b670c1d03f97a7a814f431d756a1
Submitter: Zuul
Branch: master

commit fc0828238ff7b670c1d03f97a7a814f431d756a1
Author: Bin Qian <email address hidden>
Date: Wed Oct 2 13:03:09 2019 -0400

    Bug1845393 remove interface recovering state

    In the case of a switch recycle, the connected nic will go down and up
    but the communication will restore after the switch is up and running.
    This could take a few seconds (much longer than anticipated).

    This holds off the i/f state update to the peer.

    Also remove the batching interface failover state change. This is already
    handled in the failover fsm fail_pending state.

    Change-Id: Ia810927dbbc4b3821f7915e6a42bceeac43d9e46
    Closes-Bug: 1845393
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.