StarlingX

Bug #1922584
Comment #3

Comment 3 for bug 1922584

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-03: Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/788495
Committed: https://opendev.org/starlingx/metal/commit/48978d804d6f22130d0bd8bd17f361441024bc6c
Submitter: "Zuul (22348)"
Branch: master

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.

The primary fix in this update is to increase this long standing
threshold to 15 minutes to account for evolution of the product.

    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.

     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299

     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.

     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.

     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbeat soak.

     - Add Graceful Recovery Wait state recovery over process restart.
       Reason: Avoid double reboot of Gracefully Recovering host over
               SM service bounce.

     - Add requirement for a valid out-of-band mtce flags value before
       declaring configuration error in the subfunction enable handler.
       Reason: rebooting the active controller can sometimes result in
               a falsely reported configation error due to the
               subfunction enable handler interpreting a zero value as
               a configuration error.

- Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
Reason: To assist log analysis and issue debug

Test Plan:

    PASS: Verify handling active controller reboot
                 cases: AIO DC, AIO DX, Standard, and Storage
    PASS: Verify Graceful Recovery Wait behavior
                 cases: with and without timeout, with and without bmc
                 cases: uptime > 15 mins and 10 < uptime < 15 mins
    PASS: Verify Graceful Recovery continuation over mtcAgent restart
                 cases: peer controller, compute, MNFA 4 computes
    PASS: Verify AIO DX and DC active controller reboot to standby
                 takeover that up for less than 15 minutes.

Regression:

    PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
    PASS: Verify cluster network only heartbeat loss handling
                 cases: worker and standby controller in all systems.
    PASS: Verify Dead Office Recovery (DOR)
                 cases: AIO DC, AIO DX, Standard, Storage
    PASS: Verify system installations
                 cases: AIO SX/DC/DX and 8 node Storage system
    PASS: Verify heartbeat and graceful recovery of both 'standby
                 controller' and worker nodes in AIO Plus.

PASS: Verify logging and no coredumps over all of testing
PASS: Verify no missing or stuck alarms over all of testing

    Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/788495
Committed: https://opendev.org/starlingx/metal/commit/48978d804d6f22130d0bd8bd17f361441024bc6c
Submitter: "Zuul (22348)"
Branch:    master

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Apr 28 09:39:19 2021 -0400

Improved maintenance handling of spontaneous active controller reboot
    
    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.
    
    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.
    
    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.
    
    The primary fix in this update is to increase this long standing
    threshold to 15 minutes to account for evolution of the product.
    
    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.
    
     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299
    
     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.
    
     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.
    
     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbeat soak.
    
     - Add Graceful Recovery Wait state recovery over process restart.
       Reason: Avoid double reboot of Gracefully Recovering host over
               SM service bounce.
    
     - Add requirement for a valid out-of-band mtce flags value before
       declaring configuration error in the subfunction enable handler.
       Reason: rebooting the active controller can sometimes result in
               a falsely reported configation error due to the
               subfunction enable handler interpreting a zero value as
               a configuration error.
    
     - Add uptime to all Graceful Recovery 'Connectivity Recovered' logs.
       Reason: To assist log analysis and issue debug
    
    Test Plan:
    
    PASS: Verify handling active controller reboot
                 cases: AIO DC, AIO DX, Standard, and Storage
    PASS: Verify Graceful Recovery Wait behavior
                 cases: with and without timeout, with and without bmc
                 cases: uptime > 15 mins and 10 < uptime < 15 mins
    PASS: Verify Graceful Recovery continuation over mtcAgent restart
                 cases: peer controller, compute, MNFA 4 computes
    PASS: Verify AIO DX and DC active controller reboot to standby
                 takeover that up for less than 15 minutes.
    
    Regression:
    
    PASS: Verify MNFA feature ; 4 computes in 8 node Storage system
    PASS: Verify cluster network only heartbeat loss handling
                 cases: worker and standby controller in all systems.
    PASS: Verify Dead Office Recovery (DOR)
                 cases: AIO DC, AIO DX, Standard, Storage
    PASS: Verify system installations
                 cases: AIO SX/DC/DX and 8 node Storage system
    PASS: Verify heartbeat and graceful recovery of both 'standby
                 controller' and worker nodes in AIO Plus.
    
    PASS: Verify logging and no coredumps over all of testing
    PASS: Verify no missing or stuck alarms over all of testing
    
    Change-Id: I3d16d8627b7e838faf931a3c2039a6babf2a79ef
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>