StarlingX

Controller failover is in "deadlock" state after switch restart

Bug #1845393 reported by Tee Ngo on 2019-09-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Bin Qian

Bug Description

Brief Description
-----------------
After a switch restart, the controllers are unable to establish active-standby relationship. When the switch got restarted; both controllers experienced heartbeat loss on both mgmt and cluster-host interfaces. Before making the fail-over decision; these interfaces went down but the peer update had not been received so both controllers think its peer is more healthy.

Severity
--------
Critical

Steps to Reproduce
------------------
While the system is up with all nodes unlocked-enabled-availabled, restart the switch.

Expected Behavior
------------------
Communications are restored after switch restart. All nodes are unlocked-enabled-available. Horizon is accessible via floating IP.

Actual Behavior
----------------
Access to floating IP is lost. Each controller is in standalone state. Can only ssh into each controller using node IP. Worker nodes are inaccessible.

Reproducibility
---------------
Only tried restarting the switch once an in attempt to resolve the periodic state transition of the bond slaves from ACTIVE to BACKUP that might have caused Multinode Failure Avoidance in the lab.

System Configuration
--------------------
IPv6 Standard system

Branch/Pull Time/Commit
-----------------------
master BUILD_ID="2019-09-17_20-00-00"

Last Pass
---------
N/A

Timestamp/Logs
--------------
See sm.log attached

Test Activity
-------------
System Test

See original description

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-09-25:

failover_issue.tgz Edit (2.6 MiB, application/x-tar)

description:

updated

Frank Miller (sensfan22) on 2019-09-26

Changed in starlingx:
assignee:	nobody → Bin Qian (bqian20)

Ghada Khalil (gkhalil) on 2019-09-26

tags:	added: stx.ha
description:	updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-27:

Marking as stx.3.0 - system doesn't recover after a fault scenario

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.3.0

Bin Qian (bqian20) on 2019-09-30

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-02: Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/686222

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-08: Fix merged to ha (master)

Reviewed: https://review.opendev.org/686222
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=fc0828238ff7b670c1d03f97a7a814f431d756a1
Submitter: Zuul
Branch: master

commit fc0828238ff7b670c1d03f97a7a814f431d756a1
Author: Bin Qian <email address hidden>
Date: Wed Oct 2 13:03:09 2019 -0400

Bug1845393 remove interface recovering state

    In the case of a switch recycle, the connected nic will go down and up
    but the communication will restore after the switch is up and running.
    This could take a few seconds (much longer than anticipated).

This holds off the i/f state update to the peer.

Also remove the batching interface failover state change. This is already
handled in the failover fsm fail_pending state.

    Change-Id: Ia810927dbbc4b3821f7915e6a42bceeac43d9e46
    Closes-Bug: 1845393
    Signed-off-by: Bin Qian <email address hidden>