SM logs state change of interfaces not being managed

Bug #1910770 reported by Bin Qian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Bin Qian

Bug Description

Following repeated lines are printed on sm logs.

2020-12-21T15:10:27.000 controller-0 sm: debug time[238541.420] log<2029> INFO: sm[100498]: sm_failover.c(1151): Interface cali86cbf9a2468 state changed to 1

Expected behavior
SM should skip logging state changes of interfaces that are not monitored by SM.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Minor / low priority - log noise; would be nice to fix, but will not gate the next stx release

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.ha
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
Bin Qian (bqian20) wrote :
Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ha/+/792251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (f/centos8)
Download full text (20.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/792251
Committed: https://opendev.org/starlingx/ha/commit/85bab5d2b394114feabe524504339a55eb8904e0
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9f70df63fd0d83bf0f94d1b9ac2f98516d5971c8
Author: Bin Qian <email address hidden>
Date: Fri May 7 16:36:23 2021 -0400

    Fix no swact for failure of critical services

    This fix is to ensure keeping service failure counting over successful
    audit.

    When service enabled audit successfully completes, SM reset the service
    failure state. However it should not reset the service fail-count.
    The fail-count should only be reset after the grace period.

    Closes-Bug: 1893669
    Change-Id: I6996fe3f1c08c38da6f26243aee2b95b083069f0
    Signed-off-by: Bin Qian <email address hidden>

commit 0b99b594f83b7c626cc0c4f7dc970ce373a7b748
Author: Bin Qian <email address hidden>
Date: Tue May 4 11:33:43 2021 -0400

    Fix AIO-DX failover issues

    This fix is to fix AIO unexpected failover behaviors.
    1. active controller reboots itself when standby controller
       reboot/lost power
    2. standby controller becomes degraded after active controller
       reboot/lost power

    Closes-bug: 1927133
    Change-Id: If3c9f6251f689a89cd206c672092ba296f00bd6b
    Signed-off-by: Bin Qian <email address hidden>

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 27 09:43:00 2021 -0400

    Remove hbsAgent restart in failover failure recovery handling

    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.

    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.

    update: Add unhealthy state recovery audit to service management (sm)
    review: https://review.opendev.org/c/starlingx/ha/+/735219

    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.

    update: Mtce heartbeat cluster state change notification improvement
    review: https://review.opendev.org/c/starlingx/metal/+/769936

    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.

    title: Force active controller reboot results in a second reboot
    issue: https://bugs.launchpad.net/starlingx/+bug/1922584

    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps re...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.