After controller swact incorrect host state for system host-list on split brain test scenario

Bug #1813976 reported by Anujeyan Manokeran
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Bug Description : As HA improvement test split brain test scenario was executed stopping the traffic flowing from active controller (controller-0) to controller-1 and compute-0 using ip table command. Soon after this there was a swact as expected from controller-0 to controller-1 because controller-1 is the healthier controller who can see compute-0. After this host-list from new active controller(controller-1) was showing incorrect data which is controller-0 is online and controller-1 is failed and compute-0 failed. It is an issue in maintenance not controller-1 is still think control-0 is active since it is getting message from controller-0 and unable to send messages.

------------+-------------+--------------+
[wrsroot@controller-1 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | online |
| 2 | controller-1 | controller | unlocked | enabled | failed |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | failed |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

services on controller-1 are all enabled-active, and on controller-0 are all disabled.

====================================================================
         SYSTEM: yow-cgcs-wildcat-113_121
====================================================================

controller-0:~$ sudo sm-dump
Password:

-Service_Groups------------------------------------------------------------------------
oam-services disabled disabled
controller-services disabled disabled
cloud-services disabled disabled
patching-services disabled disabled
directory-services disabled disabled
web-services disabled disabled
storage-services disabled disabled
storage-monitoring-services disabled disabled
vim-services disabled disabled
---------------------------------------------------------------------------------------

controller-1:~$ sudo sm-dump
Password:

-Service_Groups------------------------------------------------------------------------
oam-services active active
controller-services active active
cloud-services active active
patching-services active active
directory-services active active
web-services active active
storage-services active active
storage-monitoring-services active active
vim-services active active
---------------------------------------------------------------------------------------

Severity
--------
Major

Steps to Reproduce
------------------
1. Execute below command from active controller-0 to block standby controller-1 and compute-0 traffic.
sudo iptables -I INPUT 1 -s 192.168.223.57 -j DROP && sudo iptables -I INPUT 1 -s 192.168.222.156 -j DROP && \
sudo iptables -I INPUT 1 -s 192.168.222.4 -j DROP && sudo iptables -I INPUT 1 -s 192.168.223.3 -j DROP && sudo iptables -I INPUT 1 -s 128.224.150.57 -j DROP

 2. After the above command swact to controller-1 but instable host-list display as per description .

Expected Behavior
------------------
Stable host-list correct information

Actual Behavior
----------------
As per description

Reproducibility
---------------
Yes . Reproduced as controller-0 as active controller

System Configuration
--------------------
regular system

Branch/Pull Time/Commit
-----------------------
StarlingX_Upstream_build release branch build as of 019-01-16_20-18-01

Timestamp/Logs
--------------
2019-01-28T16:29:08.185

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating -- issue found during feature testing (HA Recovery Improvements)

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05 stx.metal
description: updated
Ken Young (kenyis)
Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Cindy Xie (xxie1)
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Changed in starlingx:
assignee: Cindy Xie (xxie1) → chen haochuan (martin1982)
Revision history for this message
Frank Miller (sensfan22) wrote :

Assigning to the stx-metal Core to triage this issue and determine root cause and how to fix.

Changed in starlingx:
assignee: chen haochuan (martin1982) → Eric MacDonald (rocksolidmtce)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/657682

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/657682
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=5c043f7ca94a1bf4d121c209d42687184ec58a18
Submitter: Zuul
Branch: master

commit 5c043f7ca94a1bf4d121c209d42687184ec58a18
Author: Eric MacDonald <email address hidden>
Date: Tue May 7 15:30:00 2019 -0400

    Make Mtce ignore heartbeat events from in-active controller.

    There is the potential for a race condition that can lead to
    mtce incorrectly failing hosts due to heartbeat failure event
    messages sourced from the in-active controller.

    During a split brain recovery action scenario there was a swact
    which left the hbsAgent on the new stand-by controller thinking
    it was still on the active controller.

    This specific split brain failure mode was one where the active
    and then (after swact) stand-by controller was failing heartbeat
    to its peer and other nodes in the system even though the new
    active controller saw heartbeat working fine.

    The problem being, the in-active controller detected and sent
    a heartbeat loss message to mtce before mtce was able to update
    the in-active controller's heartbeat activity status which would
    have gated the loss event send.

    This update adds an additional layer of protection by intentionally
    ignoring heartbeat events from the in-active controller that might
    slip through due to this activity state change race condition.

    Also fixed a flooding log in the hbsAgent for big systems.

    Change-Id: I825a801166b3e80cbf67945c7f587851f4e0d90b
    Closes-Bug: 1813976
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

Verified in load BUILD_ID="2019-05-30_16-58-16"

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.