StarlingX

Standby controller reboots if active controller gracefully reboots

Bug #1906556 reported by Eric MacDonald on 2020-12-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

SM fails the the standby controller on its way down from a spontaneous graceful reboot.

Although gracefully rebooting the active controller is not something that is supported, the fact that the standby controller is also taken down by that event is very undesirable.

Issue does not happen on a forced reboot (with --force option) of the active controller.

This is because of the timing around the graceful process shutdown leads to SM experiencing a heartbeat failure with its peer without the maintenance heartbeat cluster information providing the necessary data to allow SM to know that it needs to be the survivor in this case.

Suggest implementing a change in maintenance to make its heartbeat cluster state change notifications more timely.

Severity
--------
Minor: System recovers after unsupported spontaneous graceful reboot of the active controller.

Steps to Reproduce
------------------
In a duplex system 'sudo reboot' the active controller

Expected Behavior
------------------
SM on the standby controller takes over activity

Actual Behavior
----------------
SM on the standby controller fails itself and gets rebooted by maintenance

Reproducibility
---------------
Highly reproducible

System Configuration
--------------------
Duplex system

Branch/Pull Time/Commit
-----------------------
starlingx/master at time this issue was created.
Actually, long standing behavior.

Last Pass
---------
Unknown

Timestamp/Logs
--------------
from /var/log/mtcAgent.log

2020-08-31T14:39:06.476 [3576821.01162] controller-0 mtcAgent hbs nodeClass.cpp (4061) set_mtce_flags :Error : controller-1 reported unhealthy by SM (Mgmnt)

from /var/log/sm.log

2020-08-31T14:39:03.000 controller-1 sm: debug time[766.694] log<441> INFO: sm[88025]: sm_failover_ss.c(352): Loss of heartbeat ALL
2020-08-31T14:39:03.000 controller-1 sm: debug time[766.694] log<442> INFO: sm[88025]: sm_failover_ss.c(478): host reaches 11 nodes, peer reaches 11 nodes, peer will be survivor

Test Activity
-------------
[Feature Testing, Regression Testing]

Workaround
----------
Don't gracefully reboot the active controller

Tags:

Eric MacDonald (rocksolidmtce) on 2020-12-02

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Ghada Khalil (gkhalil) on 2020-12-04

tags:

added: stx.metal

Ghada Khalil (gkhalil) on 2020-12-05

Changed in starlingx:
importance:	Undecided → Critical
importance:	Critical → Low
status:	New → Triaged
tags:	added: stx.5.0
Changed in starlingx:
importance:	Low → Medium

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2021-01-19:

Fixed by:
review: https://review.opendev.org/c/starlingx/metal/+/769936
commit: https://opendev.org/starlingx/metal/commit/7a3adb2cdce217e1eaaf5e0d9669dc1190f62763

Changed in starlingx:
status:	Triaged → Fix Released

Ghada Khalil (gkhalil) on 2021-01-21

summary:

- Standby controller reboots if active controller spontaneously gracefully
- reboots
+ Standby controller reboots if active controller gracefully reboots

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-27: Fix merged to metal (f/centos8)

Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

Revert "Align partitions created by kickstarters"

This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

Reason for revert: Review should have been abandoned rather than merged.

Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

This update reverts a small code change that was
introduced by the following update.

https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu May 13 15:57:43 2021 +0000

Revert "Align partitions created by kickstarters"
    
    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.
    
    Reason for revert: Review should have been abandoned rather than merged.
    
    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 7 08:56:06 2021 -0400

Add /pxeboot/grubx64.efi symlink for UEFI pxeboot
    
    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.
    
    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed May 5 19:05:55 2021 -0400

Fix enabling heartbeat of self from the peer controller
    
    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.
    
    This update reverts a small code change that was
    introduced by the following update.
    
    https://review.opendev.org/c/starlingx/metal/+/788495
    
    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.
    
    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot
    
    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.
    
    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.
    
    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.
    
    The primary fix in this update is to increase this long standing
    threshold to 15 minutes to account for evolution of the product.
    
    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.
    
     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299
    
     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.
    
     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.
    
     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbeat soak.
    
     - Add Graceful Recovery Wait state recovery over process restart.
       Reason: Avoid double reboot of Gracefully Recovering host over
               SM service bounce.
    
     - Add requirement for a valid out-of-band mtce flags value before
       declaring configuration error in the subfunction enable handler.
       Reason: rebooting the active controller can sometimes result in
               a falsely reported configation error due to the
               subfunction enable handler interpreting a zero value as
               a configuration error.
    
     - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
       Reason: To assist log analysis and issue debug
    
    Test Plan:
    
    PASS: Verify handling active controller reboot
                 cases: AIO DC, AIO DX, Standard, and Storage
    PASS: Verify Graceful Recovery Wait behavior
                 cases: with and without timeout, with and without bmc
                 cases: uptime > 15 mins and 10 < uptime < 15 mins
    PASS: Verify Graceful Recovery continuation over mtcAgent restart
                 cases: peer controller, compute, MNFA 4 computes
    PASS: Verify AIO DX and DC active controller reboot to standby
                 takeover that up for less than 15 minutes.
    
    Regression:
    
    PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
    PASS: Verify cluster network only heartbeat loss handling
                 cases: worker and standby controller in all systems.
    PASS: Verify Dead Office Recovery (DOR)
                 cases: AIO DC, AIO DX, Standard, Storage
    PASS: Verify system installations
                 cases: AIO SX/DC/DX and 8 node Storage system
    PASS: Verify heartbeat and graceful recovery of both 'standby
                 controller' and worker nodes in AIO Plus.
    
    PASS: Verify logging and no coredumps over all of testing
    PASS: Verify no missing or stuck alarms over all of testing
    
    Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 7539d36c3f01a338acfa449204c6034dc43f45df
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 21 10:12:30 2021 -0400

Prevent mtcClient from sending to uninitialized socket in AIO SX
    
    The mtcClient will perform a socket reinit if it detects a socket
    failure. The mtcClient also avoids setting up its controller-1
    cluster network socket for the AIO SX system type ; because there
    is no controller-1 provisioned.
    
    Most AIO SX systems have the management/cluster networks set to
    the 'loopback' interface. However, when an AIO SX system is setup
    with its management and cluster networks on physical interfaces,
    with or without vlan, the mtcAlive send message utility will try
    to send to the uninitialized controller-1 cluster socket. This
    leads to a socket error that triggers a socket reinitialization
    loop which causes log flooding.
    
    This update adds a check to the mtcAlive send utility to avoid
    sending mtcAlive to controller-1 for AIO SX system type where
    there is no controller-1 provisioned; no send,no error,no flood.
    
    Since this update needed to add a system type check, this update
    also implemented a system type definition rename from CPE to AIO.
    Other related definitions and comments were also changed to make
    the code base more understandable and maintainable
    
    Test Plan:
    
    PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
    PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
    PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
    PASS: Verify AIO SX locked-disabled-online state
    PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
    PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)
    
    Regression:
    
    PASS: Verify AIO SX Lock and Unlock (lazy reboot)
    PASS: Verify AIO DX and DC install with pv regression and sanity
    PASS: Verify Standard system install with pv regression and sanity
    
    Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
    Closes-Bug: 1897334
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 3c1e9d960198c044e382eb7d47b3bb70cbf6ba70
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Apr 6 10:29:09 2021 -0400

Modify mtce daemon log rotation config files
    
    This update make the following setting changes to the
    maintenance log rotation configuration files
    
     - add 'create' with permissions to each tuple
     - add 'delaycompress'
     - group together log files with similar settings
     - move global settings ro local settings
     - remove 'copytruncate' global setting
     - remove the 'nodateext' global and local setting
    
    Test Plan:
    
    PASS: Verify log rotation for all mtc log files
    PASS: Verify no log loss over rotation
    PASS: Verify log rotation file naming convention
    PASS: Verify delaycompress on all mtce log files
    PASS: Verify log permissions after rotate are 0640
    
    Regression:
    
    PASS: Verify AIO system install
    PASS: Verify Standard system install
    PASS: Verify full and dated collect
    
    Change-Id: I623030fa2c1ce4e8085e654ae3fb782c7e520924
    Partial-Bug: 1918979
    Depends-On: https://review.opendev.org/c/starlingx/config-files/+/784943
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 99a871c7d9dd04b3bd2ce149dd43bf058d805f03
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jun 15 13:45:23 2020 -0400

Restrict isolcpu_plugin to nodes with worker function
    
    The isolcpu_plugin process is intended to run on worker nodes only.
    This update excludes its rpm parcel from standard controller and
    storage nodes.
    
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/783730
    Story: 2008760
    Task: 42189
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Change-Id: Iec61638b49692622e128d8388bc3aa78c922ac3a

commit 031818e55bc255b59e486ebf6faadf4b784c93fe
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Mar 26 13:05:51 2021 -0400

Add in-service test to clear stale config failure alarm
    
    A configuration failure alarm can get stuck asserted if
    that node experiences an uncontrolled reboot that recovers
    without a configuration failure.
    
    This update adds an in-service test that audits host health
    while there is a configuration failure alarm raised and
    clear that alarm if the failure condition goes away. This
    could be a result of an in-service manifest that runs and
    corrects the configuration or if the node reboots and comes
    back up in a healthy (properly configured) state.
    
    Fixed bug that was clearing config alarm severity state
    when a heartbeat clear event is received.
    
    This update also goes a step further and introduces an
    alarms state audit that detects and corrects maintenance
    alarm state mismatches.
    
    Test Plan:
    
    PASS: Verify the add handler loads config alarm state
    PASS: Verify in-service test clears stale config alarm
    PASS: Verify in-service test acts on new config failure
          ... degrade - active controller
          ... fail    - other hosts
    PASS: Verify audit fixes mtce alarm state mismatches
    PASS: Verify audit handles fm not running case
    PASS: Verify audit handling behavior with valid alarm cases
    PASS: Verify locked alarm management over process restart
    PASS: Verify audit only logs active alarms list changes
    PASS: Verify audit runs for both locked/unlocked nodes
    PASS: Verify update as a patch
    
    Regression:
    
    PASS: Verify enable sequence config failure handling
    PASS: ... active controller     - recoverable degrade
    PASS: ... other nodes           - threshold fail
    PASS: ... auto recovery disable - config failure
    PASS: Verify mtcAgent process logging
    PASS: Verify heartbeat handling and alarming
    PASS: Verify Standard system install
    PASS: Verify AIO system install
    
    Change-Id: If9957229810435e9faeb08374f2b5fbcb5b0f826
    Closes-Bug: 1918195
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 5c83453fdf8775e5d776a02a2b5c06810d84cb55
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Mar 16 17:03:49 2021 -0400

Fix Graceful Recovery handling while in Graceful Recovery handling
    
    The current Graceful Recovery handler is not properly handling
    back-to-back Multi Node Failure Avoidance (MNFA) events.
    
    There are two phases to MNFA
    
     phase 1: waiting for number of failed nodes to fall below
              mnfa_threahold as each affected node's heartbeat
              is recovered.
     phase 2: then a Graceful Recovery Wait period which is an
              11 second heartbeat soak to verify that a stable
              heartbeat is regained before declaring the NMFA
              event complete.
    
    The Graceful Recovery Wait status of one or more affected nodes
    has been seen to be left uncleared (stuck) on one or more of the
    affected nodes if phase 2 of MNFA is interrupted by another MNFA
    event ; aka MNFA Nesting.
    
    Although this stuck status is not service affecting it does leave
    one or more nodes' host.task field, as observed under host-show,
    with "Graceful Recovery Wait" rather than empty.
    
    This update makes Multi Node Failure Avoidance (MNFA) handling
    changes to ensure that, upon MNFA exit, the recovery handler
    is properly restarted if MNFA Nesting occurs.
    
    Two additional Graceful Recovery phase issues were identified
    and fixed by this update.
    
     1. Cut Graceful recovery handling in half
    
        - Found and removed a redundant 11 second heartbeat soak
          at the very end of the recovery handler.
        - This cuts the graceful recovery handling time down from
          22 to 11 seconds thereby cutting potential for nesting
          in half.
    
     2. Increased supported Graceful Recovery nesting from 3 to 5
    
        - Found that some links bounce more than others so a nesting
          count of 3 can lead to an occasional single node failure.
        - This adds a bit more resiliency to MNFA handling of cases
          that exhibit more link messaging bounce.
    
    Test Plan: Verified 60+ MNFA occurrences across 4 different
               system types including AIO plus, Standard and Storage
    
    PASS: Verify Single Node Graceful Recovery Handling
    PASS: Verify Multi Node Graceful Recovery Handling
    PASS: Verify Single Node Graceful Recovery Nesting Handling
    PASS: Verify Multi Node Graceful Recovery Nesting Handling
    PASS: Verify MNFA of up to 5 nests can be gracefully recovered
    PASS: Verify MNFA of 6 nests lead to full enable of affected nodes
    PASS: Verify update as a patch
    PASS: Verify mtcAgent logging
    
    Regression:
    
    PASS: Verify standard system install
    PASS: Verify product verification maintenance regression (4 runs)
    PASS: Verify MNFA threshold increase and below threshold behavior
    PASS: Verify MNFA with reduced timeout behavior for
          ... nested case that does not timeout
          ... case that does not timeout
          ... case that does timeout
    
    Closes Bug: 1892877
    Change-Id: I6b7d4478b5cae9521583af78e1370dadacd9536e
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 497a6f93f422bdaab0a5779d5345ba814d1ab3bc
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Tue Mar 16 13:45:18 2021 +0200

Fix reinstall of controller nodes
    
    At shutdown, systemd will try to remount everything read-only
    before attempting to unmount it. In the wipedisk script we
    are deleting the partitions without unmounting
    their corresponding filesystems. This leads to errors because
    systemd will try to remount filesystems
    whose partitions were deleted.
    
    To fix this we have to unmount the filesystems that are linked to the
    removed partitions.
    
    Closes-Bug: 1919153
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I49a3c06ae6bce1324dd06f4fc63fb3e5cd4d28c1

commit 4f5bf78f55ec8b0983262ee351183b1edd8443ad
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Mar 12 17:10:00 2021 -0500

Improve mtcAgent interrupted thread cleanup
    
    A BMC command send will be rejected if its thread
    is not in the IDLE state going into the call.
    
    This issue is seen to occur over a reprovisioning action
    while the bmc access alarmable condition exists.
    
    Maintenance will do retries. So the only visible side affect
    of this issue is a failure to provision to 'redfish' over a
    provisioning switch to 'dynamic' (learn mode). Instead
    ipmi is selected.
    
    The non-return to idle can occur when the bmc handler FSM
    is interrupted by a reprovisioning request while a bmc
    command is in flight.
    
    This update enhances the thread management module by
    introducing a thread consumption utility that is called
    by the bmc command send utility. If the send finds that
    its thread is not in the IDLE state it will either kill
    the thread if it is running or free a completed but-not-
    consumed thread result.
    
    Note: Maintenance only supports the execution of
    a single thread per host per process at one time.
    
    Test Plan:
    
    PASS: Verify BMC provisioning change from ipmi to dynamic
          while the ipmi provisioning was failing prior to
          re-provisioning. Verify the previous error is cleaned
          up and the reprovisioning request succeeds as expected.
    
    PASS: Verify thread 'execution timeout kill' cleanup handling.
    PASS: Verify thread 'complete but not consumed' cleanup handling.
    PASS: Verify logging during regression soaks
    
    Regression:
    
    PASS: Verify bmc protocol reprovisioning script soak
    PASS: Verify sensor monitoring following BMC reprovisioning
    PASS: Verify product verification mtce regression test suite
    
    Change-Id: Ie5e9e89ed2f8db6888c0fc7de03d494c75517178
    Closes-Bug: 1864906
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 4f7d82308f5f7c663223344873f8b392a1311d82
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Mar 11 11:13:59 2021 -0500

Add NonRecoverable property to Hardware Monitor's Redfish
    
    This update adds 'NonRecoverable' sensor health property
    to the Hardware Monitor's Redfish platform management
    protocol support.
    
    Test Plan:
    
    PASS: Verify handling of Redfish NonRecoverable sensor
          ... using redfish
          ... switching between ipmi and redfish and back
    PASS: Verify sensor model relearn over change of bmc protocol
    
    Regression:
    
    PASS: Verify sensor model relearn by command
    PASS: Verify sensor suppression
    PASS: Verify sensor alarm and degrade management
          ... as sensor events come and go
          ... on sensor suppression and unsuppression
    PASS: Verify sensor monitoring regression test
    PASS: Verify update as a patch (apply/remove)
    
    Change-Id: I2770e63f4d44e269b4410f392707f3cd01e9a2cc
    Closes-Bug: 1918152
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 6cf5e848256c7612e2d5dc3c0a86ac7b76684b6e
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Feb 24 12:36:31 2021 -0500

Add alarmed process audit to Process Monitor
    
    A failure to query process monitor alarms from
    FM during process startup can lead to a stuck
    failed process alarm.
    
    Rather than hold up the process monitor startup
    sequence due to an unresponsive fault manager,
    this update introduces an in-service alarm audit
    that looks for asserted alarms and compares that
    readout to the process monitor's runtime view.
    
    A difference in view is considered a state mismatch
    that requires corrective action. The runtime state
    of the process monitor always takes precidence over
    what is found in the FM database.
    
    A mismatch is declared and corrective action is
    taken if:
    
     - FM has a process failure alarm that pmond does not
       Corrective Action: Clear alarm in FM database
    
     - FM has a process failure alarm with a severity
       that differs from the pmond runtime state.
       Corrective Action: Update severity in FM database
    
     - FM has a process failure alarm for a process
       that pmond does not recognize.
       Corrective Action: Clear alarm in FM database
    
    This update only runs the audit on process startup
    until first successful query.
    A future update may enable the audit in-service.
    
    Test Plan:
    
    PASS: Verify all mismatch case handling
    PASS: Verify handling of valid active alarm
    PASS: Verify handling severity mismatch ; unsupported
    PASS: Verify pmond failure handling regression soak
    PASS: Verify pmond process restart regression soak
    PASS: Verify alarm handling over pmond process restart
    PASS: Verify alarmed state audit period and logging
    PASS: Verify pmond process failure alarm remains ignored by pmond
    PASS: Verify handling of persistently failed process over pmond restart
    PASS: Verify audit handling while FM is not running
          - audit retries every 50 seconds until fm query is successful
    
    COND: Verify audit handling while FM is stopped/blocked/stalled
          - alarm query blocks till fm runs again or is killed
          - this is the reason the audit is not run in-service.
    
    Change-Id: I697faa804dc7979fbb8b6f6c63811a6dda8c3118
    Closes-Bug: 1892884
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit f34d51d3acf1ab45ae81e75ac620042f95d57b6f
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Fri Feb 26 17:50:35 2021 +0000

restrict kernel headers and devel package installation
    
    kernel change-id: Iafb3abe7 adds kernel headers and development
    packages to the default rootfs for pods needing to build drivers
    or other applications with kernel dependencies. This commit
    restricts installation of the above packages to worker and AIO.
    
    Story: 2008434
    Task: 41941
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: I5bb4e93a60a98dcd52be07c0baa6cb76517b30a8

commit 32fbc7e5aa8ad6e771598456961a760a875aa018
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Fri Feb 26 15:29:15 2021 +0200

Fix reinstall of worker nodes
    
    When the wipedisk code was updated, there were some
    changes that had to be used only on controllers
    but the code was doing the same thing on all the node types.
    
    In this review we add the proper branching of
    the code based on the node type.
    
    Closes-Bug: 1912623
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I91f68a7894da51a7d64602254a68cf7acbd4bcf2

commit 0a102143e9ee26485ef4b40b10bb8f32517ef5c2
Author: Angie Wang <angie.wang@windriver.com>
Date:   Wed Feb 24 17:15:54 2021 -0600

Fix mtce compiling issue with gcc8
    
    Remove superfluous 'const' to fix error:
      "type qualifiers ignored on cast result type
       [-Werror=ignored-qualifiers]"
    
    Update the usage of 'operater++' on type of 'bool'
    to fix error:
      "use of an operand of type 'bool' in 'operator++'
       is deprecated [-Werror=deprecated]"
    
    Change-Id: I0ce7b2d48f8365f1dcc23eb48e4c5148db817630
    Story: 2007506
    Task: 39279
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 5619e3e8b626e1d592f8b99b455de97438910df5
Author: Angie Wang <angie.wang@windriver.com>
Date:   Tue Feb 23 18:19:26 2021 -0500

Increase cgts-vg size for dc-vault fs
    
    Increase the partition size for cgts-vg to include
    dc-vault fs(15G) on AIO.
    
    Tested installation of AIO-DX and AIO-DX DCSC
    
    Partial-bug: 1916797
    Change-Id: I00427820f710946275f99970ad9a7c1d8437955c
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 95e5906a6b2b3e50cc04d661acf9821f657418f9
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Fri Feb 12 00:31:58 2021 +0000

Add ice kernel module filters
    
    This is in support of the new ice kernel module which is
    initially added to support Intel E810.
    
    Story: 2008436
    Task: 41821
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Ic78988e3396cd2504c2d345bc4ca9fd99f2b53ac

commit c3c7ef80e2e165760f317a51c6c5ace600c49794
Author: Nicolas Alvarez <nicolas.alvarez@windriver.com>
Date:   Fri Jan 29 14:55:45 2021 -0300

Filter snmp rpm from non controller nodes
    
    Remove SNMP Host-Based entries
    Add SNMP Armada App entry
    
    Story: 2008132
    Task: 41715
    Depends-On: https://review.opendev.org/766088
    Depends-On: https://review.opendev.org/765381
    Depends-On: https://review.opendev.org/765875
    Signed-off-by: Nicolas Alvarez <nicolas.alvarez@windriver.com>
    Change-Id: I186a1eefb234d9e9e73df41c5e1df29c866c38bf

commit 2d5c5b04edf0d84f78a87e971cf1646e6efda00f
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Jan 25 10:20:05 2021 -0500

Make mtcClient stop collectd before shutdown
    
    The collectd process has been seen to segfault
    in its internal network plugin during system
    shutdown.
    
    This update modifies the mtcClient to stop
    collectd when it is commanded to reboot the
    system.
    
    Change-Id: I681ff45a2afb1ae66d2a929a64027ea3ed75721e
    Partial-Bug: 1872979
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 9ab726b0eba645d5b8a60fbce306035bb6c13149
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Mon Sep 14 16:42:54 2020 -0400

Add support for peer controller reset via mtcClient
    
    This update adds the ability for SM to passively
    request the mtcClient to BMC reset its peer controller
    as a means to recover a severely loaded active controller.
    
    To do this the mtcAgent is modified keep the controllers'
    mtcClients updated with the BMC info of its peer.
    
    The mtcClient is modified to audit for the SM signal
    and then when asserted issue a BMC reset of its peer
    controller using ipmitool system call.
    
    The ability to command the peer mtcCient to 'sync'
    prior to the BMC reset is implemented but configured
    disabled for now.
    
    Change-Id: Ibe4c8aaa3a980cbe5f34c3e22f015698a6453c1a
    Partial-Bug: #1895350
    Co-Authored-By: Bin.Qian@windriver.com
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 5ab03b5222f223e93ee299ed91a70a2df95647c4
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Fri Jan 8 09:59:24 2021 -0500

Mtce heartbeat cluster state change notification improvement
    
    The current heartbeat cluster state change notification
    needs to be sent when heartbeat pulses begin to be missed
    rather than only after the host has reached the Heartbeat
    Loss threshold. This buys SM more time, almost a full
    second, and in doing so provides more accurate data for
    it to make its SM heartbeat failure handling decisions.
    
    This update also begins sending maintenance heartbeat
    cluster state change notifications just before the next
    multicast pulse request but after the cluster vault is
    updated from the last pulse period. This ensures that
    SM gets the most up-to-date cluster information.
    
    This update also changes the hbsAgent's service file
    to depend on the local hbsClient. By doing so, the
    hbsAgent shuts down earlier over a graceful reboot
    thereby preventing the hbsAgent from continuing to
    report healthy response to the inactive controller
    during active controller shutdown.
    
    This way the inactive SM sees the failed active
    controller when it queries the cluster in its
    fail-pending state resulting in an inactive SM
    take-over rather than stand-down.
    
    Additional hbsAgent service file changes were made to
    prevent systemd from auto recovering a failed hbsAgent
    process, as its monitored and managed by pmond, and
    fixed the ExecStop command line.
    
    Test Plan:
    
    PASS: Verify active controller graceful reboot.
          Standby controller takes over rather than shutdown
          - 30 of 30 iterations
    PASS: Verify active controller forced reboot
    PASS: Verify enabled standby controller graceful reboot
    PASS: Verify Standard System install
    PASS: Verify AIO DX system install
    
    Regression:
    
    PASS: Verify SM Uncontrolled Swact if active
          controller Mgmnt link drops.
    PASS: Verify handling of downed cluster interface in
          - AIO DX (fail) and Standard (degrade) system
    PASS: Verify no coredumps
    PASS: Verify update as a patch
    
    Change-Id: I6869631e091eb28a3cbb6f15d9a8ccd939c54410
    Closes-Bug: 1906556
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit f00de2a3114cbd906e18daf908a276c80fe032cb
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue Dec 22 17:03:55 2020 -0500

Add controller-0 to Mtce Heartbeat Service in AIO SX
    
    All system types with the exception of AIO SX
    adds controller-0 to the heartbeat service.
    
    There is no enabled heartbeating in AIO SX so
    controller-0 was never added. However, without
    being added the alarms the hbsAgent raises are
    not cleared over a process startup.
    
    The local hbsClient was designed to monitor
    pmon, effectively monitor the process monitor,
    and report to the hbsAgent its onging health
    state. This way if pmond stops functioning
    maintenance is able to alarm that condition.
    
    However, because in AIO SX controller-0 is never
    added to the heartbeat service the current method
    of looping over the internal heartbeat service
    inventory clearing all the hbsAgent owned alarms
    for each host over a process restart is bypassed.
    
    So, the failure mode where pmond is failing and
    the hbsAgent has raised an alarm against it and is
    followed by a restart of the hbsAgent that coincides
    with 'pmond' process recovery, the pmond alarm gets
    stuck asserted.
    
    This update adds controller-0 to the heartbeat
    service inventory list for all system types so
    the hbsAgent managed alarms are cleared over a
    process restart regardless of the system type.
    
    Additionally, the following logging improvements
    were made:
    
     - add the network name to the heartbeat start log.
     - avoid heartbeat stop log when already stopped.
    
    Test Plan:
    
    PASS: Verify pmond alarm clears over hbsAgent process
          restart in AIO SX, AOI DX, Standard and Storage
          Systems.
    
    Regression:
    
    PASS: Verify Storage System Install and heartbeat
    PASS: Verify Standard System install and heartbeat
    PASS: Verify AIO DX install and heartbeat
    PASS: Verify AIO SX install and heartbeat
    PASS: Verify heartbeat logs and failure handling
    PEND: Verify update as a patch
    
    Change-Id: I9afd92a0b54296ef1f87ce7d912510649ae7560c
    Closes-Bug: 1904918
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 821f2840cc77250d55b6e3281936ebb92ae73f0c
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:26:24 2020 -0500

Add auto-version for remaining stx/metal packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Change-Id: I9fa1ceea76fa13ead2fed325e96a0be3028aa01e
    Story: 2008455
    Task: 41448
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 484d662cb748747aea4c5137c340cc7ac316d21c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Dec 16 21:16:48 2020 -0500

Fix hbsAgent log flooding when SM heartbeat fails persistently
    
    If the SM part of this update is missing or the SM heartbeat
    is missing for a long period of time the hbsAgent produces
    5 logs every 10 seconds reporting the missing SM heartbeat.
    
    This is a follow-up update to its parent update
    https://review.opendev.org/c/starlingx/metal/+/751558
    
    This update throttles the warning log and corresponding
    cluster dump when SM heartbeat is persistently missing.
    
    PASS: Verify hbsAgent service and log behavior when SM
          heartbeat is persistently missing.
    
    Change-Id: Ib379ed5d37b5349ca170b5661a930b6a71c2bed1
    Partial-Fix: 1895350
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

commit 7f7ba86d4f2bc2c5e9ea30e29ff37d83e7fab2a2
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Mon Jun 22 16:00:52 2020 +0800

Add rook provisioned osd check in kickstart for restore case
    
    After rook deployed, osd disk like /dev/sdx or /dev/nvmex will
    be provisioned as pv in volume group named with "ceph" prefixed.
    When user make restore system, kickstart will check all disk
    whether it is osd provisioned, if not wipe the disk. Add the rook
    provsioned osd disk in not wipe list to enable rook restore.
    
    Story: 2005527
    Task: 39076
    
    Change-Id: Id0a5718dcdd1d9230ab1be4a33bc4af5cb356e14
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 0e89acc83c616741952a068a3ff07ba91440eff8
Author: Daniel Safta <daniel.safta@windriver.com>
Date:   Thu Aug 27 11:15:17 2020 +0000

Align partitions created by kickstarters
    
    Partitions on some disks may be created unaligned.
    
    The cause is that the creation of partitions is done between
    specific intervals expressed in MBs. The kernel exposed a
    specific variable for each disk for providing an offset to
    align each partitions (/sys/block/<disk>/alignment_offset).
    
    For better granular control, we transform MB units into
    logical sector units and use the alignment_offset variable
    to properly align the partitions.
    
    Change-Id: I971c232fe0969eac14b85c5796908f0c85e23dbf
    Closes-bug: 1883975
    Signed-off-by: Daniel Safta <daniel.safta@windriver.com>

tags:

added: in-f-centos8