StarlingX

SM does not recover following full inter-controller comm loss

Bug #1883004 reported by Eric MacDonald on 2020-06-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Eric MacDonald

Bug Description

Brief Description
-----------------
SM monitors connectivity and health of peer controller over the OAM , management and (if provisioned) cluster host networks.

Normal communication loss over individual links is handled automatically and correctly. If however SM sees ALL its links to its peer go CARRIER DOWN simultaneously there is a race-to-failure condition that leads to BOTH controllers declaring themselves UNHEALTHY and BOTH shutting down.

SM doesn't currently automatically recover from an UNHEALTHY shutdown once its monitored links recover. This leaves the system in an un-managed state until manual action is required/taken to reboot one of the controllers.

Severity
--------
Critical: system is not managed and requires manual intervention to recover.

Steps to Reproduce
------------------
Power off the switch that provides link connectivity between controllers.
Must include all monitored interfaces and the links must all go CARRIER DOWN virtually simultaneously.

Expected Behavior
------------------
SM should recover once inter-controller connectivity recovers.

Actual Behavior
----------------
SM on both controllers remains in shutdown state until one of the controllers is rebooted manually.

Reproducibility
---------------
Difficult to reproduce based on timing.
With timing correct/precise (see above) issue is reproducible 100% of the time.

System Configuration
--------------------
Any duplex system type

Branch/Pull Time/Commit
-----------------------
StarlingX master branch as of the date of this issue creation.

Last Pass
---------
Not currently tested.
No recovery was expected behaviour but is felt to be unacceptable.
This issue is created to request and track an improvement update.

Timestamp/Logs
--------------
2020-04-17T12:02:30.000 controller-0 sm: debug time[58369.189] log<1470> INFO: sm[97337]: sm_node_utils.c(458): Node enable: blocked. node unhealthy file /var/run/.sm_node_unhealthy found

Test Activity
-------------
Evaluation

Workaround
----------
Reboot one or both controllers

Tags:

Eric MacDonald (rocksolidmtce) on 2020-06-10

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-11: Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/735219

Changed in starlingx:
status:	New → In Progress

Ghada Khalil (gkhalil) on 2020-06-12

Changed in starlingx:
importance:	Undecided → High
tags:	added: stx.4.0 stx.metal

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-21: Fix merged to ha (master)

Reviewed: https://review.opendev.org/735219
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=630a777cbb894501cb019c917c1be8288e7a7c36
Submitter: Zuul
Branch: master

commit 630a777cbb894501cb019c917c1be8288e7a7c36
Author: Eric MacDonald <email address hidden>
Date: Thu Jun 11 15:32:47 2020 -0400

Add unhealthy state recovery audit to service management (sm)

    Service Management (SM) monitors connectivity and health of
    its peer controller over the OAM, Mgmt and (if provisioned)
    Cluster-Host networks.

    If SM sees all the links to its peer go 'carrier down' virtually
    simultaneously, it is possible that both controllers might
    simultaneously declare themselves unhealthy and both go
    disabled; i.e. shutdown all services with no automatic recovery.

    This update adds an 'Unhealthy State Recovery Audit' to SM which
    forces a self restart when all of its monitored links recover
    for cases where both controllers go unhealthy-shutdown or both
    controllers remain active in split-brain.

Test Plan:

    PASS: Verify AIO SX install
    PASS: Verify Standard system install and unhealthy state recovery
    PASS: Verify single link failure end to end behavior
    PASS: Verify 2 of 3 link failure end to end behavior
    PASS: Verify all link failure end to end behavior
    PASS: Verify SM and Mtce heartbeat recovery over unhealthy state recovery
    PASS: Verify swact back and forth following a recovery
    PASS: Verify process restart as part of unhealthy state recovery
    PASS: Verify AIO DX install and unhealthy state recovery

    Change-Id: Ie906eaf04bec607328b7e0af09b37fa0558e3bbe
    Closes-Bug: 1883004
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to ha (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ha/+/792251

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-01: Fix merged to ha (f/centos8)

Download full text (20.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/792251
Committed: https://opendev.org/starlingx/ha/commit/85bab5d2b394114feabe524504339a55eb8904e0
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9f70df63fd0d83bf0f94d1b9ac2f98516d5971c8
Author: Bin Qian <email address hidden>
Date: Fri May 7 16:36:23 2021 -0400

Fix no swact for failure of critical services

This fix is to ensure keeping service failure counting over successful
audit.

    When service enabled audit successfully completes, SM reset the service
    failure state. However it should not reset the service fail-count.
    The fail-count should only be reset after the grace period.

    Closes-Bug: 1893669
    Change-Id: I6996fe3f1c08c38da6f26243aee2b95b083069f0
    Signed-off-by: Bin Qian <email address hidden>

commit 0b99b594f83b7c626cc0c4f7dc970ce373a7b748
Author: Bin Qian <email address hidden>
Date: Tue May 4 11:33:43 2021 -0400

Fix AIO-DX failover issues

    This fix is to fix AIO unexpected failover behaviors.
    1. active controller reboots itself when standby controller
       reboot/lost power
    2. standby controller becomes degraded after active controller
       reboot/lost power

    Closes-bug: 1927133
    Change-Id: If3c9f6251f689a89cd206c672092ba296f00bd6b
    Signed-off-by: Bin Qian <email address hidden>

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 27 09:43:00 2021 -0400

Remove hbsAgent restart in failover failure recovery handling

    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.

    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.

update: Add unhealthy state recovery audit to service management (sm)
review: https://review.opendev.org/c/starlingx/ha/+/735219

    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.

update: Mtce heartbeat cluster state change notification improvement
review: https://review.opendev.org/c/starlingx/metal/+/769936

    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.

title: Force active controller reboot results in a second reboot
issue: https://bugs.launchpad.net/starlingx/+bug/1922584

The hbsAgent is not able to detect the heartbeat loss of the just-
booted controller because SM keeps re...

Reviewed:  https://review.opendev.org/c/starlingx/ha/+/792251
Committed: https://opendev.org/starlingx/ha/commit/85bab5d2b394114feabe524504339a55eb8904e0
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 9f70df63fd0d83bf0f94d1b9ac2f98516d5971c8
Author: Bin Qian <bin.qian@windriver.com>
Date:   Fri May 7 16:36:23 2021 -0400

Fix no swact for failure of critical services
    
    This fix is to ensure keeping service failure counting over successful
    audit.
    
    When service enabled audit successfully completes, SM reset the service
    failure state. However it should not reset the service fail-count.
    The fail-count should only be reset after the grace period.
    
    Closes-Bug: 1893669
    Change-Id: I6996fe3f1c08c38da6f26243aee2b95b083069f0
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 0b99b594f83b7c626cc0c4f7dc970ce373a7b748
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue May 4 11:33:43 2021 -0400

Fix AIO-DX failover issues
    
    This fix is to fix AIO unexpected failover behaviors.
    1. active controller reboots itself when standby controller
       reboot/lost power
    2. standby controller becomes degraded after active controller
       reboot/lost power
    
    Closes-bug: 1927133
    Change-Id: If3c9f6251f689a89cd206c672092ba296f00bd6b
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Apr 27 09:43:00 2021 -0400

Remove hbsAgent restart in failover failure recovery handling
    
    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.
    
    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.
    
    update: Add unhealthy state recovery audit to service management (sm)
    review: https://review.opendev.org/c/starlingx/ha/+/735219
    
    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.
    
    update: Mtce heartbeat cluster state change notification improvement
    review: https://review.opendev.org/c/starlingx/metal/+/769936
    
    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.
    
    title: Force active controller reboot results in a second reboot
    issue: https://bugs.launchpad.net/starlingx/+bug/1922584
    
    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps restarting it before it reaches
    the heartbeat loss state.
    
    With the cluster notification improvement update now implemented
    and merged it's time to remove the hbsAgent restart from SM's
    failover failure recovery algorithm.
    
    Test Plan:
    
    PASS: Active controller force reboot handling in AIO DC, DX and
          standard systems.
    PASS: Standby controller force reboot handling in AIO DC, DX and
          standard systems
    
    Partial-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66

commit 05a01c2100de3108d0a8ac757f0939d5c61fedcb
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Mar 17 10:46:37 2021 -0400

Fix SQLite3 concurrent access issue
    
    SQLite3 does not support concurrent access with multiple connections
    that have writeable access. Currently SM opens database connections
    with full access, which causes concurrent issue.
    
    This fix includes:
    1. open readonly connection whenever the write permission is not needed
    2. remove code that open connections that are not being used
    3. remove reattempt and loggings from previous partial fix
    
    Now all writable connections are opened and used in main thread, this
    can ensure no more concurrent issue.
    
    Closes-Bug: 1915894
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: I200647a3733ac899b0b7498abd52992c7a87bd32

commit 7ca56fec9f2829953f934bad519a7eea0a27f3f2
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Mar 4 15:07:24 2021 -0500

Limit the troubleshooting log
    
    Stop the troubleshooting log once the execution passes the
    checkpoint.
    
    Change-Id: I4e1d7710d5216f7b5a908f56e72d5f95c35a6586
    Partial-bug: 1915894
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 10ff42ae1135b5cdd9df0a13cd0d18bfe8d655fe
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Mar 2 16:05:14 2021 -0500

Fix incorrect include causing build failure
    
    Previous commit:
    https://opendev.org/starlingx/ha/commit/f39ca95924a0a44dc287c1a560fa9f6f52cdea51
    added an incorrect #include which cause build failure.
    
    Closes-bug: 1917527
    
    Change-Id: I5d93d77fb0b14446e21a1ba160ffd0848533e970
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit f39ca95924a0a44dc287c1a560fa9f6f52cdea51
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Dec 15 15:55:25 2020 -0500

Add reattempt and collect more data for SM init failure
    
    Multiple report to AIO-SX that SM failed its intialization due to
    a SQL failure. The issue had not been reproduced in DEV environment.
    This change adds logging, reattempt and collect SM troubleshooting
    data when SM fails in such situation.
    For potential recovery before pmon start actively monitoring SM,
    setting systemd restart=on-failure. Also set RestartSec=10 seconds
    in order to give pmon enough time to catch the failure and restart
    SM.
    
    Partial-bug: 1915894
    Change-Id: I5899e401742510158cd9c59a664b1dc329bb1075
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 4f1f13dbf75c7d0df1e6383043b3fa8636d54b2d
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Fri Feb 12 17:42:40 2021 -0600

Add support for dcmanager-audit-worker service
    
    We're moving the bulk of the dcmanager subcloud audits to separate
    worker processes, so we need to add a service for the main worker
    processes (which will then spawn additional workers).
    
    In order to ensure that audits can be processed as soon as
    dcmanager-audit starts up, we make enabling it dependent on
    dcmanager-audit-worker being already running.
    
    Story: 2007267
    Task: 41870
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/769216
    Change-Id: I162c00a3e8dba07f1912171e9371c29e5fd9a689

commit aaab51c1230a9194ec91886fe817cfb765d39bf5
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Feb 11 13:23:55 2021 -0500

Create device-image-fs SM service
    
    Added a device-image-fs SM service to manage the device image
    repository filesystem.
    
    Tests performed on the following systems:
    AIO-DX, AIO-DX plus compute, Standard 2+1
    DC with AIO-DX plus subcloud
    DC with Standard subcloud
    
    Story: 2007875
    Task: 41880
    Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/776488
    
    Change-Id: I068c26c524357176e4b526c405785768044c379c
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit df3a96d8072f21fbe37b2206679b0f0afeef27bf
Author: Bin Qian <bin.qian@windriver.com>
Date:   Fri Jan 8 10:18:13 2021 -0500

Skip logging state change of I/F not managed by SM
    
    Skip logging state changes of interfaces that are not being
    monitored by SM. This is to reduce the noise in the sm.log.
    
    Closes-Bug: 1910770
    Change-Id: I6e3d78255dc41c03f10af2fd5d778e2398ea8816
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit ae51b607366e0f27cc5f7256542105f55f9dfe32
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Sun Nov 29 22:11:07 2020 +0800

Add service rook-mon-exit for duplex mode, host-swact case.
    
    For duplex, when make host swact, sm should firstly remove ceph-mon
    and ceph-osd pod, which open the /var/lib/ceph/mon-a folder.
    Remove these pods on active controller and make drbd set to
    secondary to swact to the other controller
    
    Story: 2005527
    Task: 41328
    
    Change-Id: I7cb7af3b3a56afcff71087d7f3b4f09a384c8dc2
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 2d0fc9b6118c7c9dd69290aa133c223ad557e5ae
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Sep 17 12:36:31 2020 -0400

Detect peer SM failure
    
    This change is to detect SM failure/stall.
    
    1. SM sends alive pulse to hbsAgent, (hbsAgent sends SM failed state
       along with hbs cluster info)
    2. When SM lost heartbeat from peer, SM detects if peer has failed
       from hbs cluster info.
    3. On standby controller, if peer SM is stalled, it will take over
       to become active after signaling mtce to fail peer node.
    
    TCs passed:
       When SM detects peer failure from hbs cluster info, and signals
       mtce to fail peer node.
    
    Depends-on: https://review.opendev.org/#/c/751558
    
    Partial-bug: 1895350
    Change-Id: Id51e9adb4ef30bf806159366e6fdf115e743fe97
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit fa0d235555abb4e7fb4719ed70bfefac7213be72
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Tue Jan 12 17:05:46 2021 -0300

Remove database entries related to host-based snmp
    
    According to host-based SNMP removal, remove data entry related
    snmp.
    
    Story: 2008132
    Task: 41573
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
    Depends-On: https://review.opendev.org/765381
    Change-Id: I533d9286f1a384be9d3ea245dff68812a14a4cd3

commit ebdc59e3e1fa81e35a1f8a6306dc96a7e31cae0e
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Dec 9 17:38:28 2020 -0500

add disable dependency for drbd and fs services
    
    A DRBD service needs dependency on disable to its related fs
    service, for example: drbd-rabbit -> rabbit-fs, so that
    SM disables drbd-rabbit after rabbit-fs is disabled.
    This is important especially in AIO-SX in which drbd services
    are disabled when host is being unlocked.
    
    The following disable dependency are added in this change:
    drbd-pg -> pg-fs
    drbd-rabbit -> rabbit-fs
    drbd-platform -> platform-fs
    drbd-extension -> extension-fs
    drbd-cinder -> cinder-lvm
    drbd-dc-vault -> dc-vault-fs
    drbd-dockerdistribution -> dockerdistribution-fs
    drbd-cephmon -> cephmon-fs
    
    Test host lock/unlock in AIO-SX and standard DX environments.
    
    Closes-bug: 1907490
    Change-Id: I631f718add2c1d2756a36f3770a4d48f02904f1a
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit e52a67bfa0f181eccf60eaa704d3c2c3b1c83b32
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Dec 31 11:48:09 2020 -0500

Avoid sending UDP packets to ::1
    
    In AIO-SX, the peer IP address is not configured (blank),
    which is translated into '::'. When the '::' is used as
    dest address, it is translated to loopback '::1'.
    
    SM should skip sending packets to destination '::'.
    
    Closes-Bug: 1909769
    Change-Id: Id8a9a00adce6573bcccd60b1b2112b6ee8b2f8a3
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 6ab82889af6bdf4232045a473f4762c5c0401252
Author: albailey <Al.Bailey@windriver.com>
Date:   Thu Dec 17 13:18:43 2020 -0600

Fix zuul jobs broken due to pip upversion
    
    The install_command for docs, newnote and api-ref
    needed to be overridden to not use upper constraints.
    
    The bandit requirement needed to be made python3 only.
    
    The bandit scan was failing, so it is now updated to
    allow individual bandit failures to be suppressed in tox.ini
    
    Need to include a py file change in order for bandit to be
    triggered by zuul.
    
    Partial-Bug: #1907678
    Signed-off-by: albailey <Al.Bailey@windriver.com>
    Change-Id: Ic73d0ea590ab1b7857f7275fa9c71828b0d343ee

commit df739b210e3074d48adddf0d54b5b024cd7419dc
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:27:02 2020 -0500

Add auto-version for remaining stx/ha packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Story: 2008455
    Task: 41447
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: Idf5ef476192cdf4923d6c903f1a15e03cfe9d03f

commit e8af161b16c1f75e2f1bab7c257aaa66caae7fd1
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Oct 7 10:24:45 2020 -0400

Skip verifying h/w info for Not-In-Use interface
    
    When a domain interface state is changed, hardware information is
    verified to ensure the interface is OK to enter into the new state.
    
    However, when an interface is entering into Not-In-Use state, it should
    be always OK no matter what the h/w interface state is. Especially when
    the interface is back on lo, in which case getting hardware information
    will fail. This prevents moving interface to Not-In-Use state.
    
    This change skip verifying h/w state if state of an interface is changed
    to Not-In-Use.
    
    This fix will also skip checking h/w information for lo interface and
    always returns enabled = true.
    
    Closes-Bug: 1898629
    
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: I709708bce622f52bf84fc3fec749f204cfeee533

commit 57225bb34ae5380c95dddd0e556847f7a17e3d61
Author: albailey <Al.Bailey@windriver.com>
Date:   Wed Sep 16 13:01:03 2020 -0500

Use newer flake8 to run on ubuntu-focal Zuul machines
    
    flake8 2.5.5  fails on ubuntu-focal zuul machines running python3.8
    with the following error:
    AttributeError: 'FlakesChecker' object has no attribute 'CONSTANT'
    
    The update removes the version constraint to use newer flake8.
    
    The linters can be run in python3.
    Pylint cannot be run in python3 because mysql-python is not
    compatable, so a new zuul job for pylint is now added.
    
    The flake8 errors that the newer version raises are all suppressed,
    and some of them should be addressed by someone with familiarity in
    this repo.
    
    Change-Id: Ida6447728d4175173c02130cb04a6013e4f966f9
    Partial-Bug: 1895054
    Signed-off-by: albailey <Al.Bailey@windriver.com>

commit de04f2386039b7a393ff319405bb00dce5348001
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Wed Aug 19 16:01:58 2020 -0400

Move dcmanager orchestration to a separate process
    
    The DC manager orchestration is being removed from the
    dcmanager-manager process and it is running in
    dcmanager-orchestrator process.
    
    This update adds associated sm config for the new process.
    
    Change-Id: I7cc0869a123713d85b8167bd1f8a4481b8da0902
    Story: 2007267
    Task: 40715
    Depends-On: https://review.opendev.org/#/c/748452/
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 000df04ce109fe0e97721d9b4f3c842de754020d
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon Jun 22 10:10:55 2020 -0400

Move cert-mon service to controller-services
    
    Move the cert-mon service to controller-services service group, so
    to make it more generic for other platform certificate monitoring
    features.
    
    Story: 2007347
    Task: 40119
    
    Change-Id: Ib82a579dd2f1d0dcf97e90eed44fb095ee9ab6ca
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 18deafa5c3c5ffe94434c4d4b63232210440d8ef
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Jun 18 22:08:37 2020 -0400

Add new cert-mon service to sm db -- not provisioned
    
    Add new critical service cert-mon to under SM manage,
    in controller-services group.
    The new service will monitor admin endpoint service renewal in
    cert-manager and apply new certificates to controller nodes in a
    DC setup.
    
    Tests:
    Provision DC system controllers. Swact. all successful.
    
    Added and provisioned dummy service process, swact, all successful.
    
    Change-Id: Ic545fafc88be4acb4e5e0ea3e4449ade57dcef8c
    Story: 2007347
    Task: 40119
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 630a777cbb894501cb019c917c1be8288e7a7c36
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Jun 11 15:32:47 2020 -0400

Add unhealthy state recovery audit to service management (sm)
    
    Service Management (SM) monitors connectivity and health of
    its peer controller over the OAM, Mgmt and (if provisioned)
    Cluster-Host networks.
    
    If SM sees all the links to its peer go 'carrier down' virtually
    simultaneously, it is possible that both controllers might
    simultaneously declare themselves unhealthy and both go
    disabled; i.e. shutdown all services with no automatic recovery.
    
    This update adds an 'Unhealthy State Recovery Audit' to SM which
    forces a self restart when all of its monitored links recover
    for cases where both controllers go unhealthy-shutdown or both
    controllers remain active in split-brain.
    
    Test Plan:
    
    PASS: Verify AIO SX install
    PASS: Verify Standard system install and unhealthy state recovery
    PASS: Verify single link failure end to end behavior
    PASS: Verify 2 of 3 link failure end to end behavior
    PASS: Verify all link failure end to end behavior
    PASS: Verify SM and Mtce heartbeat recovery over unhealthy state recovery
    PASS: Verify swact back and forth following a recovery
    PASS: Verify process restart as part of unhealthy state recovery
    PASS: Verify AIO DX install and unhealthy state recovery
    
    Change-Id: Ie906eaf04bec607328b7e0af09b37fa0558e3bbe
    Closes-Bug: 1883004
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 3b68098be42b856471fa1cc39f359d6649585df4
Author: Sharath Kumar K <sharath.kumar@intel.com>
Date:   Mon May 4 08:01:38 2020 +0200

Tox and Zuul job for the  python code scan in starlingx/ha
    
    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/ha folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.
    
    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.
    
    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.
    
    Please note:
    Changes will be implemented in batches and this is Batch3 change.
    
    Story: 2007541
    Task: 39621
    Depends-On: https://review.opendev.org/#/c/721294/
    
    Change-Id: I01f81d7c52c12432965106f9603e4db600381971
    Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>

commit 58d1e5b3bfe4b08d6a7274d1d96449554058ff22
Author: Andreas Jaeger <aj@suse.com>
Date:   Thu Jun 4 14:25:29 2020 +0200

Switch to newer openstackdocstheme and reno versions
    
    Switch to openstackdocstheme 2.2.1 and reno 3.1.0 versions. Using
    these versions will allow especially:
    * Linking from HTML to PDF document
    * Allow parallel building of documents
    * Fix some rendering problems
    
    Update Sphinx version as well.
    
    Disable openstackdocs_auto_name to use 'project' variable as name.
    
    Change pygments_style to 'native' since old theme version always used
    'native' and the theme now respects the setting and using 'sphinx' can
    lead to some strange rendering.
    
    openstackdocstheme renames some variables, so follow the renames
    before the next release removes them. A couple of variables are also
    not needed anymore, remove them.
    
    See also
    http://lists.openstack.org/pipermail/openstack-discuss/2020-May/014971.html
    
    Change-Id: Iab15b05918b73ce9ba2ff0b479fdb8a0631fad42

tags:

added: in-f-centos8

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-06: Change abandoned on ha (f/centos8)

Change abandoned by "Bart Wensley <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ha/+/765061
Reason: This patch has been idle for more than six months. I am abandoning it to keep the review queue sane. If you are still interested in working on this patch, please unabandon it and upload a new patchset.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.