Bug #1922584 “Force active controller reboot results in a second...” : Bugs : StarlingX

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-04-17:

#1

screening: marking as low priority given the issue is intermittent and the system recovers after the 2nd reboot. Would be nice to fix in stx master for the next release, but will not gate stx.5.0

Changed in starlingx:
importance:	Undecided → Low
assignee:	nobody → Eric MacDonald (rocksolidmtce)
status:	New → Triaged
tags:	added: stx.metal

OpenStack Infra (hudson-openstack) on 2021-04-27

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-04-28: Fix proposed to metal (master)

#2

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/788495

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-03: Fix merged to metal (master)

#3

Download full text (5.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/788495
Committed: https://opendev.org/starlingx/metal/commit/48978d804d6f22130d0bd8bd17f361441024bc6c
Submitter: "Zuul (22348)"
Branch: master

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.

The primary fix in this update is to increase this long standing
threshold to 15 minutes to account for evolution of the product.

    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.

     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299

     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.

     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.

     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbe...

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/788495
Committed: https://opendev.org/starlingx/metal/commit/48978d804d6f22130d0bd8bd17f361441024bc6c
Submitter: "Zuul (22348)"
Branch:    master

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot
    
    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.
    
    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.
    
    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.
    
    The primary fix in this update is to increase this long standing
    threshold to 15 minutes to account for evolution of the product.
    
    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.
    
     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299
    
     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.
    
     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.
    
     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbeat soak.
    
     - Add Graceful Recovery Wait state recovery over process restart.
       Reason: Avoid double reboot of Gracefully Recovering host over
               SM service bounce.
    
     - Add requirement for a valid out-of-band mtce flags value before
       declaring configuration error in the subfunction enable handler.
       Reason: rebooting the active controller can sometimes result in
               a falsely reported configation error due to the
               subfunction enable handler interpreting a zero value as
               a configuration error.
    
     - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
       Reason: To assist log analysis and issue debug
    
    Test Plan:
    
    PASS: Verify handling active controller reboot
                 cases: AIO DC, AIO DX, Standard, and Storage
    PASS: Verify Graceful Recovery Wait behavior
                 cases: with and without timeout, with and without bmc
                 cases: uptime > 15 mins and 10 < uptime < 15 mins
    PASS: Verify Graceful Recovery continuation over mtcAgent restart
                 cases: peer controller, compute, MNFA 4 computes
    PASS: Verify AIO DX and DC active controller reboot to standby
                 takeover that up for less than 15 minutes.
    
    Regression:
    
    PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
    PASS: Verify cluster network only heartbeat loss handling
                 cases: worker and standby controller in all systems.
    PASS: Verify Dead Office Recovery (DOR)
                 cases: AIO DC, AIO DX, Standard, Storage
    PASS: Verify system installations
                 cases: AIO SX/DC/DX and 8 node Storage system
    PASS: Verify heartbeat and graceful recovery of both 'standby
                 controller' and worker nodes in AIO Plus.
    
    PASS: Verify logging and no coredumps over all of testing
    PASS: Verify no missing or stuck alarms over all of testing
    
    Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-04: Fix merged to ha (master)

#4

Reviewed: https://review.opendev.org/c/starlingx/ha/+/788299
Committed: https://opendev.org/starlingx/ha/commit/cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Submitter: "Zuul (22348)"
Branch: master

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 27 09:43:00 2021 -0400

Remove hbsAgent restart in failover failure recovery handling

    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.

    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.

update: Add unhealthy state recovery audit to service management (sm)
review: https://review.opendev.org/c/starlingx/ha/+/735219

    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.

update: Mtce heartbeat cluster state change notification improvement
review: https://review.opendev.org/c/starlingx/metal/+/769936

    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.

title: Force active controller reboot results in a second reboot
issue: https://bugs.launchpad.net/starlingx/+bug/1922584

    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps restarting it before it reaches
    the heartbeat loss state.

    With the cluster notification improvement update now implemented
    and merged it's time to remove the hbsAgent restart from SM's
    failover failure recovery algorithm.

Test Plan:

    PASS: Active controller force reboot handling in AIO DC, DX and
          standard systems.
    PASS: Standby controller force reboot handling in AIO DC, DX and
          standard systems

    Partial-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>
    Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66

Reviewed:  https://review.opendev.org/c/starlingx/ha/+/788299
Committed: https://opendev.org/starlingx/ha/commit/cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Submitter: "Zuul (22348)"
Branch:    master

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Apr 27 09:43:00 2021 -0400

Remove hbsAgent restart in failover failure recovery handling
    
    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.
    
    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.
    
    update: Add unhealthy state recovery audit to service management (sm)
    review: https://review.opendev.org/c/starlingx/ha/+/735219
    
    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.
    
    update: Mtce heartbeat cluster state change notification improvement
    review: https://review.opendev.org/c/starlingx/metal/+/769936
    
    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.
    
    title: Force active controller reboot results in a second reboot
    issue: https://bugs.launchpad.net/starlingx/+bug/1922584
    
    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps restarting it before it reaches
    the heartbeat loss state.
    
    With the cluster notification improvement update now implemented
    and merged it's time to remove the hbsAgent restart from SM's
    failover failure recovery algorithm.
    
    Test Plan:
    
    PASS: Active controller force reboot handling in AIO DC, DX and
          standard systems.
    PASS: Standby controller force reboot handling in AIO DC, DX and
          standard systems
    
    Partial-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2021-05-05:

#5

Re-opened. The above update prevents maintenance from enabling heartbeat of self by the peer controller.
A fix will be available shortly.

Changed in starlingx:
status:	Fix Released → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-05: Fix proposed to metal (master)

#6

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/789958

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-06: Fix merged to metal (master)

#7

Reviewed: https://review.opendev.org/c/starlingx/metal/+/789958
Committed: https://opendev.org/starlingx/metal/commit/ce7529964932a9fd1cc10ce18dbe11e89ee02223
Submitter: "Zuul (22348)"
Branch: master

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

This update reverts a small code change that was
introduced by the following update.

https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to metal (f/centos8)

#8

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to ha (f/centos8)

#9

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ha/+/792251

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-27: Fix merged to metal (f/centos8)

#10

Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

Revert "Align partitions created by kickstarters"

This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

Reason for revert: Review should have been abandoned rather than merged.

Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

This update reverts a small code change that was
introduced by the following update.

https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu May 13 15:57:43 2021 +0000

Revert "Align partitions created by kickstarters"
    
    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.
    
    Reason for revert: Review should have been abandoned rather than merged.
    
    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 7 08:56:06 2021 -0400

Add /pxeboot/grubx64.efi symlink for UEFI pxeboot
    
    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.
    
    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed May 5 19:05:55 2021 -0400

Fix enabling heartbeat of self from the peer controller
    
    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.
    
    This update reverts a small code change that was
    introduced by the following update.
    
    https://review.opendev.org/c/starlingx/metal/+/788495
    
    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.
    
    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot
    
    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.
    
    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.
    
    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.
    
    The primary fix in this update is to increase this long standing
    threshold to 15 minutes to account for evolution of the product.
    
    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.
    
     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299
    
     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.
    
     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.
    
     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbeat soak.
    
     - Add Graceful Recovery Wait state recovery over process restart.
       Reason: Avoid double reboot of Gracefully Recovering host over
               SM service bounce.
    
     - Add requirement for a valid out-of-band mtce flags value before
       declaring configuration error in the subfunction enable handler.
       Reason: rebooting the active controller can sometimes result in
               a falsely reported configation error due to the
               subfunction enable handler interpreting a zero value as
               a configuration error.
    
     - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
       Reason: To assist log analysis and issue debug
    
    Test Plan:
    
    PASS: Verify handling active controller reboot
                 cases: AIO DC, AIO DX, Standard, and Storage
    PASS: Verify Graceful Recovery Wait behavior
                 cases: with and without timeout, with and without bmc
                 cases: uptime > 15 mins and 10 < uptime < 15 mins
    PASS: Verify Graceful Recovery continuation over mtcAgent restart
                 cases: peer controller, compute, MNFA 4 computes
    PASS: Verify AIO DX and DC active controller reboot to standby
                 takeover that up for less than 15 minutes.
    
    Regression:
    
    PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
    PASS: Verify cluster network only heartbeat loss handling
                 cases: worker and standby controller in all systems.
    PASS: Verify Dead Office Recovery (DOR)
                 cases: AIO DC, AIO DX, Standard, Storage
    PASS: Verify system installations
                 cases: AIO SX/DC/DX and 8 node Storage system
    PASS: Verify heartbeat and graceful recovery of both 'standby
                 controller' and worker nodes in AIO Plus.
    
    PASS: Verify logging and no coredumps over all of testing
    PASS: Verify no missing or stuck alarms over all of testing
    
    Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 7539d36c3f01a338acfa449204c6034dc43f45df
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 21 10:12:30 2021 -0400

Prevent mtcClient from sending to uninitialized socket in AIO SX
    
    The mtcClient will perform a socket reinit if it detects a socket
    failure. The mtcClient also avoids setting up its controller-1
    cluster network socket for the AIO SX system type ; because there
    is no controller-1 provisioned.
    
    Most AIO SX systems have the management/cluster networks set to
    the 'loopback' interface. However, when an AIO SX system is setup
    with its management and cluster networks on physical interfaces,
    with or without vlan, the mtcAlive send message utility will try
    to send to the uninitialized controller-1 cluster socket. This
    leads to a socket error that triggers a socket reinitialization
    loop which causes log flooding.
    
    This update adds a check to the mtcAlive send utility to avoid
    sending mtcAlive to controller-1 for AIO SX system type where
    there is no controller-1 provisioned; no send,no error,no flood.
    
    Since this update needed to add a system type check, this update
    also implemented a system type definition rename from CPE to AIO.
    Other related definitions and comments were also changed to make
    the code base more understandable and maintainable
    
    Test Plan:
    
    PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
    PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
    PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
    PASS: Verify AIO SX locked-disabled-online state
    PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
    PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)
    
    Regression:
    
    PASS: Verify AIO SX Lock and Unlock (lazy reboot)
    PASS: Verify AIO DX and DC install with pv regression and sanity
    PASS: Verify Standard system install with pv regression and sanity
    
    Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
    Closes-Bug: 1897334
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 3c1e9d960198c044e382eb7d47b3bb70cbf6ba70
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Apr 6 10:29:09 2021 -0400

Modify mtce daemon log rotation config files
    
    This update make the following setting changes to the
    maintenance log rotation configuration files
    
     - add 'create' with permissions to each tuple
     - add 'delaycompress'
     - group together log files with similar settings
     - move global settings ro local settings
     - remove 'copytruncate' global setting
     - remove the 'nodateext' global and local setting
    
    Test Plan:
    
    PASS: Verify log rotation for all mtc log files
    PASS: Verify no log loss over rotation
    PASS: Verify log rotation file naming convention
    PASS: Verify delaycompress on all mtce log files
    PASS: Verify log permissions after rotate are 0640
    
    Regression:
    
    PASS: Verify AIO system install
    PASS: Verify Standard system install
    PASS: Verify full and dated collect
    
    Change-Id: I623030fa2c1ce4e8085e654ae3fb782c7e520924
    Partial-Bug: 1918979
    Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 99a871c7d9dd04b3bd2ce149dd43bf058d805f03
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jun 15 13:45:23 2020 -0400

Restrict isolcpu_plugin to nodes with worker function
    
    The isolcpu_plugin process is intended to run on worker nodes only.
    This update excludes its rpm parcel from standard controller and
    storage nodes.
    
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/783730
    Story: 2008760
    Task: 42189
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Change-Id: Iec61638b49692622e128d8388bc3aa78c922ac3a

commit 031818e55bc255b59e486ebf6faadf4b784c93fe
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Mar 26 13:05:51 2021 -0400

Add in-service test to clear stale config failure alarm
    
    A configuration failure alarm can get stuck asserted if
    that node experiences an uncontrolled reboot that recovers
    without a configuration failure.
    
    This update adds an in-service test that audits host health
    while there is a configuration failure alarm raised and
    clear that alarm if the failure condition goes away. This
    could be a result of an in-service manifest that runs and
    corrects the configuration or if the node reboots and comes
    back up in a healthy (properly configured) state.
    
    Fixed bug that was clearing config alarm severity state
    when a heartbeat clear event is received.
    
    This update also goes a step further and introduces an
    alarms state audit that detects and corrects maintenance
    alarm state mismatches.
    
    Test Plan:
    
    PASS: Verify the add handler loads config alarm state
    PASS: Verify in-service test clears stale config alarm
    PASS: Verify in-service test acts on new config failure
          ... degrade - active controller
          ... fail    - other hosts
    PASS: Verify audit fixes mtce alarm state mismatches
    PASS: Verify audit handles fm not running case
    PASS: Verify audit handling behavior with valid alarm cases
    PASS: Verify locked alarm management over process restart
    PASS: Verify audit only logs active alarms list changes
    PASS: Verify audit runs for both locked/unlocked nodes
    PASS: Verify update as a patch
    
    Regression:
    
    PASS: Verify enable sequence config failure handling
    PASS: ... active controller     - recoverable degrade
    PASS: ... other nodes           - threshold fail
    PASS: ... auto recovery disable - config failure
    PASS: Verify mtcAgent process logging
    PASS: Verify heartbeat handling and alarming
    PASS: Verify Standard system install
    PASS: Verify AIO system install
    
    Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
    Closes-Bug: 1918195
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 5c83453fdf8775e5d776a02a2b5c06810d84cb55
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Mar 16 17:03:49 2021 -0400

Fix Graceful Recovery handling while in Graceful Recovery handling
    
    The current Graceful Recovery handler is not properly handling
    back-to-back Multi Node Failure Avoidance (MNFA) events.
    
    There are two phases to MNFA
    
     phase 1: waiting for number of failed nodes to fall below
              mnfa_threahold as each affected node's heartbeat
              is recovered.
     phase 2: then a Graceful Recovery Wait period which is an
              11 second heartbeat soak to verify that a stable
              heartbeat is regained before declaring the NMFA
              event complete.
    
    The Graceful Recovery Wait status of one or more affected nodes
    has been seen to be left uncleared (stuck) on one or more of the
    affected nodes if phase 2 of MNFA is interrupted by another MNFA
    event ; aka MNFA Nesting.
    
    Although this stuck status is not service affecting it does leave
    one or more nodes' host.task field, as observed under host-show,
    with "Graceful Recovery Wait" rather than empty.
    
    This update makes Multi Node Failure Avoidance (MNFA) handling
    changes to ensure that, upon MNFA exit, the recovery handler
    is properly restarted if MNFA Nesting occurs.
    
    Two additional Graceful Recovery phase issues were identified
    and fixed by this update.
    
     1. Cut Graceful recovery handling in half
    
        - Found and removed a redundant 11 second heartbeat soak
          at the very end of the recovery handler.
        - This cuts the graceful recovery handling time down from
          22 to 11 seconds thereby cutting potential for nesting
          in half.
    
     2. Increased supported Graceful Recovery nesting from 3 to 5
    
        - Found that some links bounce more than others so a nesting
          count of 3 can lead to an occasional single node failure.
        - This adds a bit more resiliency to MNFA handling of cases
          that exhibit more link messaging bounce.
    
    Test Plan: Verified 60+ MNFA occurrences across 4 different
               system types including AIO plus, Standard and Storage
    
    PASS: Verify Single Node Graceful Recovery Handling
    PASS: Verify Multi Node Graceful Recovery Handling
    PASS: Verify Single Node Graceful Recovery Nesting Handling
    PASS: Verify Multi Node Graceful Recovery Nesting Handling
    PASS: Verify MNFA of up to 5 nests can be gracefully recovered
    PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
    PASS: Verify update as a patch
    PASS: Verify mtcAgent logging
    
    Regression:
    
    PASS: Verify standard system install
    PASS: Verify product verification maintenance regression (4 runs)
    PASS: Verify MNFA threshold increase and below threshold behavior
    PASS: Verify MNFA with reduced timeout behavior for
          ... nested case that does not timeout
          ... case that does not timeout
          ... case that does timeout
    
    Closes Bug: 1892877
    Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 497a6f93f422bdaab0a5779d5345ba814d1ab3bc
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Tue Mar 16 13:45:18 2021 +0200

Fix reinstall of controller nodes
    
    At shutdown, systemd will try to remount everything read-only
    before attempting to unmount it. In the wipedisk script we
    are deleting the partitions without unmounting
    their corresponding filesystems. This leads to errors because
    systemd will try to remount filesystems
    whose partitions were deleted.
    
    To fix this we have to unmount the filesystems that are linked to the
    removed partitions.
    
    Closes-Bug: 1919153
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I49a3c06ae6bce1324dd06f4fc63fb3e5cd4d28c1

commit 4f5bf78f55ec8b0983262ee351183b1edd8443ad
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Mar 12 17:10:00 2021 -0500

Improve mtcAgent interrupted thread cleanup
    
    A BMC command send will be rejected if its thread
    is not in the IDLE state going into the call.
    
    This issue is seen to occur over a reprovisioning action
    while the bmc access alarmable condition exists.
    
    Maintenance will do retries. So the only visible side affect
    of this issue is a failure to provision to 'redfish' over a
    provisioning switch to 'dynamic' (learn mode). Instead
    ipmi is selected.
    
    The non-return to idle can occur when the bmc handler FSM
    is interrupted by a reprovisioning request while a bmc
    command is in flight.
    
    This update enhances the thread management module by
    introducing a thread consumption utility that is called
    by the bmc command send utility. If the send finds that
    its thread is not in the IDLE state it will either kill
    the thread if it is running or free a completed but-not-
    consumed thread result.
    
    Note: Maintenance only supports the execution of
    a single thread per host per process at one time.
    
    Test Plan:
    
    PASS: Verify BMC provisioning change from ipmi to dynamic
          while the ipmi provisioning was failing prior to
          re-provisioning. Verify the previous error is cleaned
          up and the reprovisioning request succeeds as expected.
    
    PASS: Verify thread 'execution timeout kill' cleanup handling.
    PASS: Verify thread 'complete but not consumed' cleanup handling.
    PASS: Verify logging during regression soaks
    
    Regression:
    
    PASS: Verify bmc protocol reprovisioning script soak
    PASS: Verify sensor monitoring following BMC reprovisioning
    PASS: Verify product verification mtce regression test suite
    
    Change-Id: Ie5e9e89ed2f8db6888c0fc7de03d494c75517178
    Closes-Bug: 1864906
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 4f7d82308f5f7c663223344873f8b392a1311d82
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Mar 11 11:13:59 2021 -0500

Add NonRecoverable property to Hardware Monitor's Redfish
    
    This update adds 'NonRecoverable' sensor health property
    to the Hardware Monitor's Redfish platform management
    protocol support.
    
    Test Plan:
    
    PASS: Verify handling of Redfish NonRecoverable sensor
          ... using redfish
          ... switching between ipmi and redfish and back
    PASS: Verify sensor model relearn over change of bmc protocol
    
    Regression:
    
    PASS: Verify sensor model relearn by command
    PASS: Verify sensor suppression
    PASS: Verify sensor alarm and degrade management
          ... as sensor events come and go
          ... on sensor suppression and unsuppression
    PASS: Verify sensor monitoring regression test
    PASS: Verify update as a patch (apply/remove)
    
    Change-Id: I2770e63f4d44e269b4410f392707f3cd01e9a2cc
    Closes-Bug: 1918152
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 6cf5e848256c7612e2d5dc3c0a86ac7b76684b6e
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Feb 24 12:36:31 2021 -0500

Add alarmed process audit to Process Monitor
    
    A failure to query process monitor alarms from
    FM during process startup can lead to a stuck
    failed process alarm.
    
    Rather than hold up the process monitor startup
    sequence due to an unresponsive fault manager,
    this update introduces an in-service alarm audit
    that looks for asserted alarms and compares that
    readout to the process monitor's runtime view.
    
    A difference in view is considered a state mismatch
    that requires corrective action. The runtime state
    of the process monitor always takes precidence over
    what is found in the FM database.
    
    A mismatch is declared and corrective action is
    taken if:
    
     - FM has a process failure alarm that pmond does not
       Corrective Action: Clear alarm in FM database
    
     - FM has a process failure alarm with a severity
       that differs from the pmond runtime state.
       Corrective Action: Update severity in FM database
    
     - FM has a process failure alarm for a process
       that pmond does not recognize.
       Corrective Action: Clear alarm in FM database
    
    This update only runs the audit on process startup
    until first successful query.
    A future update may enable the audit in-service.
    
    Test Plan:
    
    PASS: Verify all mismatch case handling
    PASS: Verify handling of valid active alarm
    PASS: Verify handling severity mismatch ; unsupported
    PASS: Verify pmond failure handling regression soak
    PASS: Verify pmond process restart regression soak
    PASS: Verify alarm handling over pmond process restart
    PASS: Verify alarmed state audit period and logging
    PASS: Verify pmond process failure alarm remains ignored by pmond
    PASS: Verify handling of persistently failed process over pmond restart
    PASS: Verify audit handling while FM is not running
          - audit retries every 50 seconds until fm query is successful
    
    COND: Verify audit handling while FM is stopped/blocked/stalled
          - alarm query blocks till fm runs again or is killed
          - this is the reason the audit is not run in-service.
    
    Change-Id: I697faa804dc7979fbb8b6f6c63811a6dda8c3118
    Closes-Bug: 1892884
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit f34d51d3acf1ab45ae81e75ac620042f95d57b6f
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Fri Feb 26 17:50:35 2021 +0000

restrict kernel headers and devel package installation
    
    kernel change-id: Iafb3abe7 adds kernel headers and development
    packages to the default rootfs for pods needing to build drivers
    or other applications with kernel dependencies. This commit
    restricts installation of the above packages to worker and AIO.
    
    Story: 2008434
    Task: 41941
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: I5bb4e93a60a98dcd52be07c0baa6cb76517b30a8

commit 32fbc7e5aa8ad6e771598456961a760a875aa018
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Fri Feb 26 15:29:15 2021 +0200

Fix reinstall of worker nodes
    
    When the wipedisk code was updated, there were some
    changes that had to be used only on controllers
    but the code was doing the same thing on all the node types.
    
    In this review we add the proper branching of
    the code based on the node type.
    
    Closes-Bug: 1912623
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I91f68a7894da51a7d64602254a68cf7acbd4bcf2

commit 0a102143e9ee26485ef4b40b10bb8f32517ef5c2
Author: Angie Wang <angie.wang@windriver.com>
Date:   Wed Feb 24 17:15:54 2021 -0600

Fix mtce compiling issue with gcc8
    
    Remove superfluous 'const' to fix error:
      "type qualifiers ignored on cast result type
       [-Werror=ignored-qualifiers]"
    
    Update the usage of 'operater++' on type of 'bool'
    to fix error:
      "use of an operand of type 'bool' in 'operator++'
       is deprecated [-Werror=deprecated]"
    
    Change-Id: I0ce7b2d48f8365f1dcc23eb48e4c5148db817630
    Story: 2007506
    Task: 39279
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 5619e3e8b626e1d592f8b99b455de97438910df5
Author: Angie Wang <angie.wang@windriver.com>
Date:   Tue Feb 23 18:19:26 2021 -0500

Increase cgts-vg size for dc-vault fs
    
    Increase the partition size for cgts-vg to include
    dc-vault fs(15G) on AIO.
    
    Tested installation of AIO-DX and AIO-DX DCSC
    
    Partial-bug: 1916797
    Change-Id: I00427820f710946275f99970ad9a7c1d8437955c
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 95e5906a6b2b3e50cc04d661acf9821f657418f9
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Fri Feb 12 00:31:58 2021 +0000

Add ice kernel module filters
    
    This is in support of the new ice kernel module which is
    initially added to support Intel E810.
    
    Story: 2008436
    Task: 41821
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Ic78988e3396cd2504c2d345bc4ca9fd99f2b53ac

commit c3c7ef80e2e165760f317a51c6c5ace600c49794
Author: Nicolas Alvarez <nicolas.alvarez@windriver.com>
Date:   Fri Jan 29 14:55:45 2021 -0300

Filter snmp rpm from non controller nodes
    
    Remove SNMP Host-Based entries
    Add SNMP Armada App entry
    
    Story: 2008132
    Task: 41715
    Depends-On: https://review.opendev.org/766088
    Depends-On: https://review.opendev.org/765381
    Depends-On: https://review.opendev.org/765875
    Signed-off-by: Nicolas Alvarez <nicolas.alvarez@windriver.com>
    Change-Id: I186a1eefb234d9e9e73df41c5e1df29c866c38bf

commit 2d5c5b04edf0d84f78a87e971cf1646e6efda00f
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jan 25 10:20:05 2021 -0500

Make mtcClient stop collectd before shutdown
    
    The collectd process has been seen to segfault
    in its internal network plugin during system
    shutdown.
    
    This update modifies the mtcClient to stop
    collectd when it is commanded to reboot the
    system.
    
    Change-Id: I681ff45a2afb1ae66d2a929a64027ea3ed75721e
    Partial-Bug: 1872979
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 9ab726b0eba645d5b8a60fbce306035bb6c13149
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Sep 14 16:42:54 2020 -0400

Add support for peer controller reset via mtcClient
    
    This update adds the ability for SM to passively
    request the mtcClient to BMC reset its peer controller
    as a means to recover a severely loaded active controller.
    
    To do this the mtcAgent is modified keep the controllers'
    mtcClients updated with the BMC info of its peer.
    
    The mtcClient is modified to audit for the SM signal
    and then when asserted issue a BMC reset of its peer
    controller using ipmitool system call.
    
    The ability to command the peer mtcCient to 'sync'
    prior to the BMC reset is implemented but configured
    disabled for now.
    
    Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a
    Partial-Bug: #1895350
    Co-Authored-By: Bin.Qian@windriver.com
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 5ab03b5222f223e93ee299ed91a70a2df95647c4
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Jan 8 09:59:24 2021 -0500

Mtce heartbeat cluster state change notification improvement
    
    The current heartbeat cluster state change notification
    needs to be sent when heartbeat pulses begin to be missed
    rather than only after the host has reached the Heartbeat
    Loss threshold. This buys SM more time, almost a full
    second, and in doing so provides more accurate data for
    it to make its SM heartbeat failure handling decisions.
    
    This update also begins sending maintenance heartbeat
    cluster state change notifications just before the next
    multicast pulse request but after the cluster vault is
    updated from the last pulse period. This ensures that
    SM gets the most up-to-date cluster information.
    
    This update also changes the hbsAgent's service file
    to depend on the local hbsClient. By doing so, the
    hbsAgent shuts down earlier over a graceful reboot
    thereby preventing the hbsAgent from continuing to
    report healthy response to the inactive controller
    during active controller shutdown.
    
    This way the inactive SM sees the failed active
    controller when it queries the cluster in its
    fail-pending state resulting in an inactive SM
    take-over rather than stand-down.
    
    Additional hbsAgent service file changes were made to
    prevent systemd from auto recovering a failed hbsAgent
    process, as its monitored and managed by pmond, and
    fixed the ExecStop command line.
    
    Test Plan:
    
    PASS: Verify active controller graceful reboot.
          Standby controller takes over rather than shutdown
          - 30 of 30 iterations
    PASS: Verify active controller forced reboot
    PASS: Verify enabled standby controller graceful reboot
    PASS: Verify Standard System install
    PASS: Verify AIO DX system install
    
    Regression:
    
    PASS: Verify SM Uncontrolled Swact if active
          controller Mgmnt link drops.
    PASS: Verify handling of downed cluster interface in
          - AIO DX (fail) and Standard (degrade) system
    PASS: Verify no coredumps
    PASS: Verify update as a patch
    
    Change-Id: I6869631e091eb28a3cbb6f15d9a8ccd939c54410
    Closes-Bug: 1906556
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit f00de2a3114cbd906e18daf908a276c80fe032cb
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Dec 22 17:03:55 2020 -0500

Add controller-0 to Mtce Heartbeat Service in AIO SX
    
    All system types with the exception of AIO SX
    adds controller-0 to the heartbeat service.
    
    There is no enabled heartbeating in AIO SX so
    controller-0 was never added. However, without
    being added the alarms the hbsAgent raises are
    not cleared over a process startup.
    
    The local hbsClient was designed to monitor
    pmon, effectively monitor the process monitor,
    and report to the hbsAgent its onging health
    state. This way if pmond stops functioning
    maintenance is able to alarm that condition.
    
    However, because in AIO SX controller-0 is never
    added to the heartbeat service the current method
    of looping over the internal heartbeat service
    inventory clearing all the hbsAgent owned alarms
    for each host over a process restart is bypassed.
    
    So, the failure mode where pmond is failing and
    the hbsAgent has raised an alarm against it and is
    followed by a restart of the hbsAgent that coincides
    with 'pmond' process recovery, the pmond alarm gets
    stuck asserted.
    
    This update adds controller-0 to the heartbeat
    service inventory list for all system types so
    the hbsAgent managed alarms are cleared over a
    process restart regardless of the system type.
    
    Additionally, the following logging improvements
    were made:
    
     - add the network name to the heartbeat start log.
     - avoid heartbeat stop log when already stopped.
    
    Test Plan:
    
    PASS: Verify pmond alarm clears over hbsAgent process
          restart in AIO SX, AOI DX, Standard and Storage
          Systems.
    
    Regression:
    
    PASS: Verify Storage System Install and heartbeat
    PASS: Verify Standard System install and heartbeat
    PASS: Verify AIO DX install and heartbeat
    PASS: Verify AIO SX install and heartbeat
    PASS: Verify heartbeat logs and failure handling
    PEND: Verify update as a patch
    
    Change-Id: I9afd92a0b54296ef1f87ce7d912510649ae7560c
    Closes-Bug: 1904918
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 821f2840cc77250d55b6e3281936ebb92ae73f0c
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:26:24 2020 -0500

Add auto-version for remaining stx/metal packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Change-Id: I9fa1ceea76fa13ead2fed325e96a0be3028aa01e
    Story: 2008455
    Task: 41448
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 484d662cb748747aea4c5137c340cc7ac316d21c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Dec 16 21:16:48 2020 -0500

Fix hbsAgent log flooding when SM heartbeat fails persistently
    
    If the SM part of this update is missing or the SM heartbeat
    is missing for a long period of time the hbsAgent produces
    5 logs every 10 seconds reporting the missing SM heartbeat.
    
    This is a follow-up update to its parent update
    https://review.opendev.org/c/starlingx/metal/+/751558
    
    This update throttles the warning log and corresponding
    cluster dump when SM heartbeat is persistently missing.
    
    PASS: Verify hbsAgent service and log behavior when SM
          heartbeat is persistently missing.
    
    Change-Id: Ib379ed5d37b5349ca170b5661a930b6a71c2bed1
    Partial-Fix: 1895350
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 7f7ba86d4f2bc2c5e9ea30e29ff37d83e7fab2a2
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Mon Jun 22 16:00:52 2020 +0800

Add rook provisioned osd check in kickstart for restore case
    
    After rook deployed, osd disk like /dev/sdx or /dev/nvmex will
    be provisioned as pv in volume group named with "ceph" prefixed.
    When user make restore system, kickstart will check all disk
    whether it is osd provisioned, if not wipe the disk. Add the rook
    provsioned osd disk in not wipe list to enable rook restore.
    
    Story: 2005527
    Task: 39076
    
    Change-Id: Id0a5718dcdd1d9230ab1be4a33bc4af5cb356e14
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 0e89acc83c616741952a068a3ff07ba91440eff8
Author: Daniel Safta <daniel.safta@windriver.com>
Date:   Thu Aug 27 11:15:17 2020 +0000

Align partitions created by kickstarters
    
    Partitions on some disks may be created unaligned.
    
    The cause is that the creation of partitions is done between
    specific intervals expressed in MBs. The kernel exposed a
    specific variable for each disk for providing an offset to
    align each partitions (/sys/block/<disk>/alignment_offset).
    
    For better granular control, we transform MB units into
    logical sector units and use the alignment_offset variable
    to properly align the partitions.
    
    Change-Id: I971c232fe0969eac14b85c5796908f0c85e23dbf
    Closes-bug: 1883975
    Signed-off-by: Daniel Safta <daniel.safta@windriver.com>

tags:

added: in-f-centos8

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-01: Fix merged to ha (f/centos8)

#11

Download full text (20.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/792251
Committed: https://opendev.org/starlingx/ha/commit/85bab5d2b394114feabe524504339a55eb8904e0
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9f70df63fd0d83bf0f94d1b9ac2f98516d5971c8
Author: Bin Qian <email address hidden>
Date: Fri May 7 16:36:23 2021 -0400

Fix no swact for failure of critical services

This fix is to ensure keeping service failure counting over successful
audit.

    When service enabled audit successfully completes, SM reset the service
    failure state. However it should not reset the service fail-count.
    The fail-count should only be reset after the grace period.

    Closes-Bug: 1893669
    Change-Id: I6996fe3f1c08c38da6f26243aee2b95b083069f0
    Signed-off-by: Bin Qian <email address hidden>

commit 0b99b594f83b7c626cc0c4f7dc970ce373a7b748
Author: Bin Qian <email address hidden>
Date: Tue May 4 11:33:43 2021 -0400

Fix AIO-DX failover issues

    This fix is to fix AIO unexpected failover behaviors.
    1. active controller reboots itself when standby controller
       reboot/lost power
    2. standby controller becomes degraded after active controller
       reboot/lost power

    Closes-bug: 1927133
    Change-Id: If3c9f6251f689a89cd206c672092ba296f00bd6b
    Signed-off-by: Bin Qian <email address hidden>

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 27 09:43:00 2021 -0400

Remove hbsAgent restart in failover failure recovery handling

    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.

    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.

update: Add unhealthy state recovery audit to service management (sm)
review: https://review.opendev.org/c/starlingx/ha/+/735219

    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.

update: Mtce heartbeat cluster state change notification improvement
review: https://review.opendev.org/c/starlingx/metal/+/769936

    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.

title: Force active controller reboot results in a second reboot
issue: https://bugs.launchpad.net/starlingx/+bug/1922584

The hbsAgent is not able to detect the heartbeat loss of the just-
booted controller because SM keeps re...

Reviewed:  https://review.opendev.org/c/starlingx/ha/+/792251
Committed: https://opendev.org/starlingx/ha/commit/85bab5d2b394114feabe524504339a55eb8904e0
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 9f70df63fd0d83bf0f94d1b9ac2f98516d5971c8
Author: Bin Qian <bin.qian@windriver.com>
Date:   Fri May 7 16:36:23 2021 -0400

Fix no swact for failure of critical services
    
    This fix is to ensure keeping service failure counting over successful
    audit.
    
    When service enabled audit successfully completes, SM reset the service
    failure state. However it should not reset the service fail-count.
    The fail-count should only be reset after the grace period.
    
    Closes-Bug: 1893669
    Change-Id: I6996fe3f1c08c38da6f26243aee2b95b083069f0
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 0b99b594f83b7c626cc0c4f7dc970ce373a7b748
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue May 4 11:33:43 2021 -0400

Fix AIO-DX failover issues
    
    This fix is to fix AIO unexpected failover behaviors.
    1. active controller reboots itself when standby controller
       reboot/lost power
    2. standby controller becomes degraded after active controller
       reboot/lost power
    
    Closes-bug: 1927133
    Change-Id: If3c9f6251f689a89cd206c672092ba296f00bd6b
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Apr 27 09:43:00 2021 -0400

Remove hbsAgent restart in failover failure recovery handling
    
    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.
    
    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.
    
    update: Add unhealthy state recovery audit to service management (sm)
    review: https://review.opendev.org/c/starlingx/ha/+/735219
    
    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.
    
    update: Mtce heartbeat cluster state change notification improvement
    review: https://review.opendev.org/c/starlingx/metal/+/769936
    
    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.
    
    title: Force active controller reboot results in a second reboot
    issue: https://bugs.launchpad.net/starlingx/+bug/1922584
    
    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps restarting it before it reaches
    the heartbeat loss state.
    
    With the cluster notification improvement update now implemented
    and merged it's time to remove the hbsAgent restart from SM's
    failover failure recovery algorithm.
    
    Test Plan:
    
    PASS: Active controller force reboot handling in AIO DC, DX and
          standard systems.
    PASS: Standby controller force reboot handling in AIO DC, DX and
          standard systems
    
    Partial-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66

commit 05a01c2100de3108d0a8ac757f0939d5c61fedcb
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Mar 17 10:46:37 2021 -0400

Fix SQLite3 concurrent access issue
    
    SQLite3 does not support concurrent access with multiple connections
    that have writeable access. Currently SM opens database connections
    with full access, which causes concurrent issue.
    
    This fix includes:
    1. open readonly connection whenever the write permission is not needed
    2. remove code that open connections that are not being used
    3. remove reattempt and loggings from previous partial fix
    
    Now all writable connections are opened and used in main thread, this
    can ensure no more concurrent issue.
    
    Closes-Bug: 1915894
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: I200647a3733ac899b0b7498abd52992c7a87bd32

commit 7ca56fec9f2829953f934bad519a7eea0a27f3f2
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Mar 4 15:07:24 2021 -0500

Limit the troubleshooting log
    
    Stop the troubleshooting log once the execution passes the
    checkpoint.
    
    Change-Id: I4e1d7710d5216f7b5a908f56e72d5f95c35a6586
    Partial-bug: 1915894
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 10ff42ae1135b5cdd9df0a13cd0d18bfe8d655fe
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Mar 2 16:05:14 2021 -0500

Fix incorrect include causing build failure
    
    Previous commit:
    https://opendev.org/starlingx/ha/commit/f39ca95924a0a44dc287c1a560fa9f6f52cdea51
    added an incorrect #include which cause build failure.
    
    Closes-bug: 1917527
    
    Change-Id: I5d93d77fb0b14446e21a1ba160ffd0848533e970
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit f39ca95924a0a44dc287c1a560fa9f6f52cdea51
Author: Bin Qian <bin.qian@windriver.com>
Date:   Tue Dec 15 15:55:25 2020 -0500

Add reattempt and collect more data for SM init failure
    
    Multiple report to AIO-SX that SM failed its intialization due to
    a SQL failure. The issue had not been reproduced in DEV environment.
    This change adds logging, reattempt and collect SM troubleshooting
    data when SM fails in such situation.
    For potential recovery before pmon start actively monitoring SM,
    setting systemd restart=on-failure. Also set RestartSec=10 seconds
    in order to give pmon enough time to catch the failure and restart
    SM.
    
    Partial-bug: 1915894
    Change-Id: I5899e401742510158cd9c59a664b1dc329bb1075
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 4f1f13dbf75c7d0df1e6383043b3fa8636d54b2d
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Fri Feb 12 17:42:40 2021 -0600

Add support for dcmanager-audit-worker service
    
    We're moving the bulk of the dcmanager subcloud audits to separate
    worker processes, so we need to add a service for the main worker
    processes (which will then spawn additional workers).
    
    In order to ensure that audits can be processed as soon as
    dcmanager-audit starts up, we make enabling it dependent on
    dcmanager-audit-worker being already running.
    
    Story: 2007267
    Task: 41870
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/769216
    Change-Id: I162c00a3e8dba07f1912171e9371c29e5fd9a689

commit aaab51c1230a9194ec91886fe817cfb765d39bf5
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Feb 11 13:23:55 2021 -0500

Create device-image-fs SM service
    
    Added a device-image-fs SM service to manage the device image
    repository filesystem.
    
    Tests performed on the following systems:
    AIO-DX, AIO-DX plus compute, Standard 2+1
    DC with AIO-DX plus subcloud
    DC with Standard subcloud
    
    Story: 2007875
    Task: 41880
    Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/776488
    
    Change-Id: I068c26c524357176e4b526c405785768044c379c
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit df3a96d8072f21fbe37b2206679b0f0afeef27bf
Author: Bin Qian <bin.qian@windriver.com>
Date:   Fri Jan 8 10:18:13 2021 -0500

Skip logging state change of I/F not managed by SM
    
    Skip logging state changes of interfaces that are not being
    monitored by SM. This is to reduce the noise in the sm.log.
    
    Closes-Bug: 1910770
    Change-Id: I6e3d78255dc41c03f10af2fd5d778e2398ea8816
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit ae51b607366e0f27cc5f7256542105f55f9dfe32
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Sun Nov 29 22:11:07 2020 +0800

Add service rook-mon-exit for duplex mode, host-swact case.
    
    For duplex, when make host swact, sm should firstly remove ceph-mon
    and ceph-osd pod, which open the /var/lib/ceph/mon-a folder.
    Remove these pods on active controller and make drbd set to
    secondary to swact to the other controller
    
    Story: 2005527
    Task: 41328
    
    Change-Id: I7cb7af3b3a56afcff71087d7f3b4f09a384c8dc2
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 2d0fc9b6118c7c9dd69290aa133c223ad557e5ae
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Sep 17 12:36:31 2020 -0400

Detect peer SM failure
    
    This change is to detect SM failure/stall.
    
    1. SM sends alive pulse to hbsAgent, (hbsAgent sends SM failed state
       along with hbs cluster info)
    2. When SM lost heartbeat from peer, SM detects if peer has failed
       from hbs cluster info.
    3. On standby controller, if peer SM is stalled, it will take over
       to become active after signaling mtce to fail peer node.
    
    TCs passed:
       When SM detects peer failure from hbs cluster info, and signals
       mtce to fail peer node.
    
    Depends-on: https://review.opendev.org/#/c/751558
    
    Partial-bug: 1895350
    Change-Id: Id51e9adb4ef30bf806159366e6fdf115e743fe97
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit fa0d235555abb4e7fb4719ed70bfefac7213be72
Author: Takamasa Takenaka <takamasa.takenaka@windriver.com>
Date:   Tue Jan 12 17:05:46 2021 -0300

Remove database entries related to host-based snmp
    
    According to host-based SNMP removal, remove data entry related
    snmp.
    
    Story: 2008132
    Task: 41573
    Signed-off-by: Takamasa Takenaka <takamasa.takenaka@windriver.com>
    Depends-On: https://review.opendev.org/765381
    Change-Id: I533d9286f1a384be9d3ea245dff68812a14a4cd3

commit ebdc59e3e1fa81e35a1f8a6306dc96a7e31cae0e
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Dec 9 17:38:28 2020 -0500

add disable dependency for drbd and fs services
    
    A DRBD service needs dependency on disable to its related fs
    service, for example: drbd-rabbit -> rabbit-fs, so that
    SM disables drbd-rabbit after rabbit-fs is disabled.
    This is important especially in AIO-SX in which drbd services
    are disabled when host is being unlocked.
    
    The following disable dependency are added in this change:
    drbd-pg -> pg-fs
    drbd-rabbit -> rabbit-fs
    drbd-platform -> platform-fs
    drbd-extension -> extension-fs
    drbd-cinder -> cinder-lvm
    drbd-dc-vault -> dc-vault-fs
    drbd-dockerdistribution -> dockerdistribution-fs
    drbd-cephmon -> cephmon-fs
    
    Test host lock/unlock in AIO-SX and standard DX environments.
    
    Closes-bug: 1907490
    Change-Id: I631f718add2c1d2756a36f3770a4d48f02904f1a
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit e52a67bfa0f181eccf60eaa704d3c2c3b1c83b32
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Dec 31 11:48:09 2020 -0500

Avoid sending UDP packets to ::1
    
    In AIO-SX, the peer IP address is not configured (blank),
    which is translated into '::'. When the '::' is used as
    dest address, it is translated to loopback '::1'.
    
    SM should skip sending packets to destination '::'.
    
    Closes-Bug: 1909769
    Change-Id: Id8a9a00adce6573bcccd60b1b2112b6ee8b2f8a3
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 6ab82889af6bdf4232045a473f4762c5c0401252
Author: albailey <Al.Bailey@windriver.com>
Date:   Thu Dec 17 13:18:43 2020 -0600

Fix zuul jobs broken due to pip upversion
    
    The install_command for docs, newnote and api-ref
    needed to be overridden to not use upper constraints.
    
    The bandit requirement needed to be made python3 only.
    
    The bandit scan was failing, so it is now updated to
    allow individual bandit failures to be suppressed in tox.ini
    
    Need to include a py file change in order for bandit to be
    triggered by zuul.
    
    Partial-Bug: #1907678
    Signed-off-by: albailey <Al.Bailey@windriver.com>
    Change-Id: Ic73d0ea590ab1b7857f7275fa9c71828b0d343ee

commit df739b210e3074d48adddf0d54b5b024cd7419dc
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:27:02 2020 -0500

Add auto-version for remaining stx/ha packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Story: 2008455
    Task: 41447
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: Idf5ef476192cdf4923d6c903f1a15e03cfe9d03f

commit e8af161b16c1f75e2f1bab7c257aaa66caae7fd1
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Oct 7 10:24:45 2020 -0400

Skip verifying h/w info for Not-In-Use interface
    
    When a domain interface state is changed, hardware information is
    verified to ensure the interface is OK to enter into the new state.
    
    However, when an interface is entering into Not-In-Use state, it should
    be always OK no matter what the h/w interface state is. Especially when
    the interface is back on lo, in which case getting hardware information
    will fail. This prevents moving interface to Not-In-Use state.
    
    This change skip verifying h/w state if state of an interface is changed
    to Not-In-Use.
    
    This fix will also skip checking h/w information for lo interface and
    always returns enabled = true.
    
    Closes-Bug: 1898629
    
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: I709708bce622f52bf84fc3fec749f204cfeee533

commit 57225bb34ae5380c95dddd0e556847f7a17e3d61
Author: albailey <Al.Bailey@windriver.com>
Date:   Wed Sep 16 13:01:03 2020 -0500

Use newer flake8 to run on ubuntu-focal Zuul machines
    
    flake8 2.5.5  fails on ubuntu-focal zuul machines running python3.8
    with the following error:
    AttributeError: 'FlakesChecker' object has no attribute 'CONSTANT'
    
    The update removes the version constraint to use newer flake8.
    
    The linters can be run in python3.
    Pylint cannot be run in python3 because mysql-python is not
    compatable, so a new zuul job for pylint is now added.
    
    The flake8 errors that the newer version raises are all suppressed,
    and some of them should be addressed by someone with familiarity in
    this repo.
    
    Change-Id: Ida6447728d4175173c02130cb04a6013e4f966f9
    Partial-Bug: 1895054
    Signed-off-by: albailey <Al.Bailey@windriver.com>

commit de04f2386039b7a393ff319405bb00dce5348001
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Wed Aug 19 16:01:58 2020 -0400

Move dcmanager orchestration to a separate process
    
    The DC manager orchestration is being removed from the
    dcmanager-manager process and it is running in
    dcmanager-orchestrator process.
    
    This update adds associated sm config for the new process.
    
    Change-Id: I7cc0869a123713d85b8167bd1f8a4481b8da0902
    Story: 2007267
    Task: 40715
    Depends-On: https://review.opendev.org/#/c/748452/
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

commit 000df04ce109fe0e97721d9b4f3c842de754020d
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon Jun 22 10:10:55 2020 -0400

Move cert-mon service to controller-services
    
    Move the cert-mon service to controller-services service group, so
    to make it more generic for other platform certificate monitoring
    features.
    
    Story: 2007347
    Task: 40119
    
    Change-Id: Ib82a579dd2f1d0dcf97e90eed44fb095ee9ab6ca
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 18deafa5c3c5ffe94434c4d4b63232210440d8ef
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Jun 18 22:08:37 2020 -0400

Add new cert-mon service to sm db -- not provisioned
    
    Add new critical service cert-mon to under SM manage,
    in controller-services group.
    The new service will monitor admin endpoint service renewal in
    cert-manager and apply new certificates to controller nodes in a
    DC setup.
    
    Tests:
    Provision DC system controllers. Swact. all successful.
    
    Added and provisioned dummy service process, swact, all successful.
    
    Change-Id: Ic545fafc88be4acb4e5e0ea3e4449ade57dcef8c
    Story: 2007347
    Task: 40119
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 630a777cbb894501cb019c917c1be8288e7a7c36
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Jun 11 15:32:47 2020 -0400

Add unhealthy state recovery audit to service management (sm)
    
    Service Management (SM) monitors connectivity and health of
    its peer controller over the OAM, Mgmt and (if provisioned)
    Cluster-Host networks.
    
    If SM sees all the links to its peer go 'carrier down' virtually
    simultaneously, it is possible that both controllers might
    simultaneously declare themselves unhealthy and both go
    disabled; i.e. shutdown all services with no automatic recovery.
    
    This update adds an 'Unhealthy State Recovery Audit' to SM which
    forces a self restart when all of its monitored links recover
    for cases where both controllers go unhealthy-shutdown or both
    controllers remain active in split-brain.
    
    Test Plan:
    
    PASS: Verify AIO SX install
    PASS: Verify Standard system install and unhealthy state recovery
    PASS: Verify single link failure end to end behavior
    PASS: Verify 2 of 3 link failure end to end behavior
    PASS: Verify all link failure end to end behavior
    PASS: Verify SM and Mtce heartbeat recovery over unhealthy state recovery
    PASS: Verify swact back and forth following a recovery
    PASS: Verify process restart as part of unhealthy state recovery
    PASS: Verify AIO DX install and unhealthy state recovery
    
    Change-Id: Ie906eaf04bec607328b7e0af09b37fa0558e3bbe
    Closes-Bug: 1883004
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 3b68098be42b856471fa1cc39f359d6649585df4
Author: Sharath Kumar K <sharath.kumar@intel.com>
Date:   Mon May 4 08:01:38 2020 +0200

Tox and Zuul job for the  python code scan in starlingx/ha
    
    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/ha folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.
    
    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.
    
    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.
    
    Please note:
    Changes will be implemented in batches and this is Batch3 change.
    
    Story: 2007541
    Task: 39621
    Depends-On: https://review.opendev.org/#/c/721294/
    
    Change-Id: I01f81d7c52c12432965106f9603e4db600381971
    Signed-off-by: Sharath Kumar K <sharath.kumar@intel.com>

commit 58d1e5b3bfe4b08d6a7274d1d96449554058ff22
Author: Andreas Jaeger <aj@suse.com>
Date:   Thu Jun 4 14:25:29 2020 +0200

Switch to newer openstackdocstheme and reno versions
    
    Switch to openstackdocstheme 2.2.1 and reno 3.1.0 versions. Using
    these versions will allow especially:
    * Linking from HTML to PDF document
    * Allow parallel building of documents
    * Fix some rendering problems
    
    Update Sphinx version as well.
    
    Disable openstackdocs_auto_name to use 'project' variable as name.
    
    Change pygments_style to 'native' since old theme version always used
    'native' and the theme now respects the setting and using 'sphinx' can
    lead to some strange rendering.
    
    openstackdocstheme renames some variables, so follow the renames
    before the next release removes them. A couple of variables are also
    not needed anymore, remove them.
    
    See also
    http://lists.openstack.org/pipermail/openstack-discuss/2020-May/014971.html
    
    Change-Id: Iab15b05918b73ce9ba2ff0b479fdb8a0631fad42

StarlingX

Force active controller reboot results in a second reboot

Bug Description

Other bug subscribers

Remote bug watches