Standby controller rebooted when the OAM ports of the active controller were disconnected

Bug #2037579 reported by Eliud Kyale
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eliud Kyale

Bug Description

Brief Description
-----------------

AIO-DX with controller-0 (active) and controller-1 (standby). disconnecting the oam ports of the controller-0 causes controller-1 to continuously reboot. After reconnecting the oam ports , controller-1 stops rebooting.

Severity
--------
Major

Steps to Reproduce
------------------
1) Tester disconnects oam ports of the system ( cable pull )

Expected Behavior
------------------

No reboot of standby. Only raise an alarm.

Actual Behavior
----------------

controller-1 rebooted

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
AIO-DX system

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
2022-12-01 Regression test - Pull management and OAM Cable

Timestamp/Logs
--------------

2023-09-11T13:02:57.000 controller-0 sm: debug time[257181.025] log<965> INFO: sm[103532]: sm_failover.c(298): Interface vlan11 lose heartbeat.
2023-09-11T13:02:57.000 controller-0 sm: debug time[257181.025] log<165> INFO: sm_alarm[103556]: sm_alarm_thread.c(653): Raising alarm (communication-failure) for node (controller-0) domain () entity (oam).
2023-09-11T13:02:57.000 controller-0 sm: debug time[257181.026] log<166> INFO: sm_alarm[103556]: sm_alarm_thread.c(1113): Raised alarm (communication-failure) for node (controller-0) domain () entity (oam), fm_uuid=1742860c-bb8b-48dd-ade4-9e11d5905fd6.

Test Activity
-------------

Evaluation testing

Workaround
----------
N/A

Tags: stx.9.0 stx.ha
Eliud Kyale (ekyale)
Changed in starlingx:
assignee: nobody → Eliud Kyale (ekyale)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/896694

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/896694
Committed: https://opendev.org/starlingx/ha/commit/efe4a7a3706ff6381e0bdd8fbc9c2150d44cba20
Submitter: "Zuul (22348)"
Branch: master

commit efe4a7a3706ff6381e0bdd8fbc9c2150d44cba20
Author: Kyale, Eliud <email address hidden>
Date: Wed Sep 27 13:45:14 2023 -0400

    IF_STATE_MASK fix for SM_FAILOVER_HEARTBEAT_ALIVE

    The SM_FAILOVER_IF_STATE_MASK change from 0xF to 0x3F

    mask was clearing the HEARTBEAT ALIVE flag.
    SM_FAILOVER_HEARTBEAT_ALIVE = (0x1 << 4), // 16

    This change restores previous system behavior. Tester performs a
    cable pull on the oam ports. The expected behavior is an alarm
    being raised. Instead the standby controller ended up getting rebooted.

    oam interface testing was simulated by bringing the ip link down for 1
    second.

    For example:

    sudo ip link set <oam> down; sleep 1 ; sudo ip link set <oam> up

    -----------------
    Before change
    -----------------
    - Heartbeat loss on oam interface resulted in standby controller reboot

    -----------------
    After change:
    -----------------

    - Heartbeat loss on oam interface resulted in alarm raised
    - Logs indicate the health score of controller-1 drops by 1 point

    Test plan:

    PASS - AIO-SX: iso install

    PASS - AIO-DX: iso install
                   drop oam interface on standby
                   verify standby controller-1 is not rebooted
                   by active controller-0
                   restore oam interface

    PASS - AIO-DX: system host-swact . swact back and forth

    Closes-Bug: 2037579

    Change-Id: I4f1ffc1169d4df090f71377e5aa8247e1cd17fc3
    Signed-off-by: Kyale, Eliud <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
summary: - The standby controller rebooted when the OAM ports of the active
- controller were disconnected
+ Standby controller rebooted when the OAM ports of the active controller
+ were disconnected
Ghada Khalil (gkhalil)
tags: added: stx.9.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.