No controller swact for management cable pull on active controller

Bug #1812019 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Bug Description : Management cable pull test on active controller was failed expected to switch-over which is not happening. As active controller(controller-1) Management cable was pulled and there was no switch-over to controller-0 . Due to this alarms and state on host-list was not correct as below.

$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | disabled | offline |
| 2 | controller-1 | controller | unlocked | enabled | degraded |
| 3 | storage-0 | storage | unlocked | disabled | failed |
| 4 | storage-1 | storage | unlocked | disabled | offline |
| 5 | compute-0 | worker | unlocked | disabled | offline |
| 6 | compute-1 | worker | unlocked | disabled | offline |
| 7 | compute-2 | worker | unlocked | disabled | offline |
| 8 | compute-3 | worker | unlocked | disabled | offline |
| 9 | compute-4 | worker | unlocked | disabled | offline |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@controller-1 ~(keystone_admin)]$

200.009 | compute-4 experienced a persistent critical 'Infrastructure Network' communication failure. | host=compute-4.network=Infrastructure | critical | 2019-01-15T16:14: |
| | | | | 56.955679 |
| | | | | |
| 200.009 | compute-3 experienced a persistent critical 'Infrastructure Network' communication failure. | host=compute-3.network=Infrastructure | critical | 2019-01-15T16:14: |
| | | | | 56.875668 |
| | | | | |
| 200.009 | compute-2 experienced a persistent critical 'Infrastructure Network' communication failure. | host=compute-2.network=Infrastructure | critical | 2019-01-15T16:14: |
| | | | | 56.794702 |
| | | | | |
| 200.009 | compute-1 experienced a persistent critical 'Infrastructure Network' communication failure. | host=compute-1.network=Infrastructure | critical | 2019-01-15T16:14: |
| | | | | 56.714703 |
| | | | | |
| 200.009 | compute-0 experienced a persistent critical 'Infrastructure Network' communication failure. | host=compute-0.network=Infrastructure | critical | 2019-01-15T16:14: |
| | | | | 56.634703 |
| | | | | |
| 200.009 | storage-1 experienced a persistent critical 'Infrastructure Network' communication failure. | host=storage-1.network=Infrastructure | critical | 2019-01-15T16:14: |
| | | | | 56.553738 |
| | | | | |
| 200.009 | storage-0 experienced a persistent critical 'Infrastructure Network' communication failure. | host=storage-0.network=Infrastructure | critical | 2019-01-15T16:14: |
| | | | | 56.473739 |
| | | | | |
| 200.009 | controller-0 experienced a persistent critical 'Infrastructure Network' communication | host=controller-0.network= | critical | 2019-01-15T16:14: |
| | failure. | Infrastructure | | 56.393724 |
| | | | | |
| 200.005 | compute-4 experienced a persistent critical 'Management Network' communication failure. | host=compute-4.network=Management | critical | 2019-01-15T16:14: |
| | | | | 56.313705 |
| | | | | |
| 200.005 | compute-3 experienced a persistent critical 'Management Network' communication failure. | host=compute-3.network=Management | critical | 2019-01-15T16:14: |
| | | | | 56.232714 |
| | | | | |
| 200.005 | compute-2 experienced a persistent critical 'Management Network' communication failure. | host=compute-2.network=Management | critical | 2019-01-15T16:14: |
| | | | | 56.152733 |
| | | | | |
| 200.005 | compute-1 experienced a persistent critical 'Management Network' communication failure. | host=compute-1.network=Management | critical | 2019-01-15T16:14: |
| | | | | 56.072713 |
| | | | | |
| 200.005 | compute-0 experienced a persistent critical 'Management Network' communication failure. | host=compute-0.network=Management | critical | 2019-01-15T16:14: |
| | | | | 55.992697 |
| | | | | |
| 200.005 | storage-1 experienced a persistent critical 'Management Network' communication failure. | host=storage-1.network=Management | critical | 2019-01-15T16:14: |
| | | | | 55.912708 |
| | | | | |
| 200.005 | storage-0 experienced a persistent critical 'Management Network' communication failure. | host=storage-0.network=Management | critical | 2019-01-15T16:14: |
| | | | | 55.832702 |
| | | | | |
| 200.005 | controller-0 experienced a persistent critical 'Management Network' communication failure. | host=controller-0.network=Management | critical | 2019-01-15T16:14: |
| | | | | 55.752681 |
| |

Severity
--------
Major

Steps to Reproduce
------------------
1. Pull management cable on active controller and verify the state of controller swact. Above test controller-1 was active and cable was pulled from controller-1 .

Expected Behavior
------------------
switchover and controller-0 becomes active controller-1 becomes standby.

Actual Behavior
----------------
As per description

Reproducibility
---------------
Yes reproduced 100%

System Configuration
--------------------
storage system

Branch/Pull Time/Commit
-----------------------
StarlingX_Upstream_build release branch build as of 2019-01-04_20-18-00

Timestamp/Logs
--------------
2019-01-15T16:14:

Tags: stx.2.0 stx.ha
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating until further investigation.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05
tags: added: stx.ha
Bin Qian (bqian20)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-ha (master)

Fix proposed to branch: master
Review: https://review.openstack.org/633520

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-ha (master)

Reviewed: https://review.openstack.org/633520
Committed: https://git.openstack.org/cgit/openstack/stx-ha/commit/?id=0641b4a44e38ca3fbb004aeda96a8493ac7699a7
Submitter: Zuul
Branch: master

commit 0641b4a44e38ca3fbb004aeda96a8493ac7699a7
Author: Bin Qian <email address hidden>
Date: Mon Jan 28 08:42:08 2019 -0500

    Fix h/w subsystem duplicated initialization

    h/w subsystem is mistakenly initialized twice. It causes the
    interface operational state changed events not being passed to
    the listener. In the event an interface operational state changed,
    i.e, cable is pulled, the system could not react to it.

    Change-Id: I014d25befda536265c9c588a156ce411d01147cf
    Closes-Bug: 1812019
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.