Force active controller reboot results in a second reboot

Bug #1922584 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

A forced reboot of the active controller while there is an available standby controller results in an unexpected second reboot of that force rebooted controller.

Severity
--------
Minor: Server is already out of service ; just stays out of service for another 10 minutes

Steps to Reproduce
------------------
Execute "sudo reboot -f" on the active controller of an AIO system while there is an available standby controller

Expected Behavior
------------------
Rebooted active controller should not be rebooted a second time by maintenance.

Actual Behavior
----------------
Rebooted active controller is sometimes rebooted a second time by maintenance.

Reproducibility
---------------
Intermittent: Based on server performance/boot time. More likely with longer boot times.

System Configuration
--------------------
AIO DX

Branch/Pull Time/Commit
-----------------------
April, 2021

Last Pass
---------
Unknown

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Regression Testing

Workaround
----------
None.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as low priority given the issue is intermittent and the system recovers after the 2nd reboot. Would be nice to fix in stx master for the next release, but will not gate stx.5.0

Changed in starlingx:
importance: Undecided → Low
assignee: nobody → Eric MacDonald (rocksolidmtce)
status: New → Triaged
tags: added: stx.metal
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/788495

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)
Download full text (5.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/788495
Committed: https://opendev.org/starlingx/metal/commit/48978d804d6f22130d0bd8bd17f361441024bc6c
Submitter: "Zuul (22348)"
Branch: master

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

    Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold and in this case exceeded
    the threshold.

    The primary fix in this update is to increase this long standing
    threshold to 15 minutes to account for evolution of the product.

    During the debug of this issue a few other related undesirable
    behaviors related to Graceful Recovery were observed with the
    following additional changes implemented.

     - Remove hbsAgent process restart in ha service management
       failover failure recovery handling. This change is in the
       ha git with a loose dependency placed on this update.
       Reason: https://review.opendev.org/c/starlingx/ha/+/788299

     - Prevent the hbsAgent from sending heartbeat clear events
       to maintenance in response to a heartbeat stop command.
       Reason: Maintenance receiving these clear events while in
               Graceful Recovery causes it to pop out of graceful
               recovery only to re-enter as a retry and therefore
               needlessly consumes one (of a max of 5) retry count.

     - Prevent successful Graceful Recovery until all heartbeat
       monitored networks recover.
       Reason: If heartbeat of one network, say cluster recovers but
               another (management) does not then its possible the
               max Graceful Recovery Retries could be reached quite
               quickly, while one network recovered but the other
               may not have, causing maintenance to fail the host and
               force a full enable with reboot.

     - Extend the wait for the hbsClient ready event in the graceful
       recovery handler timout from 1 minute to worker config timeout.
       Reason: To give the worker config time to complete before force
               starting the recovery handler's heartbe...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/788299
Committed: https://opendev.org/starlingx/ha/commit/cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Submitter: "Zuul (22348)"
Branch: master

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 27 09:43:00 2021 -0400

    Remove hbsAgent restart in failover failure recovery handling

    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.

    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.

    update: Add unhealthy state recovery audit to service management (sm)
    review: https://review.opendev.org/c/starlingx/ha/+/735219

    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.

    update: Mtce heartbeat cluster state change notification improvement
    review: https://review.opendev.org/c/starlingx/metal/+/769936

    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.

    title: Force active controller reboot results in a second reboot
    issue: https://bugs.launchpad.net/starlingx/+bug/1922584

    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps restarting it before it reaches
    the heartbeat loss state.

    With the cluster notification improvement update now implemented
    and merged it's time to remove the hbsAgent restart from SM's
    failover failure recovery algorithm.

    Test Plan:

    PASS: Active controller force reboot handling in AIO DC, DX and
          standard systems.
    PASS: Standby controller force reboot handling in AIO DC, DX and
          standard systems

    Partial-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>
    Change-Id: I26aa5ed9e0faec7294816269dbaa49cbb4696f66

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Re-opened. The above update prevents maintenance from enabling heartbeat of self by the peer controller.
A fix will be available shortly.

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/789958

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/789958
Committed: https://opendev.org/starlingx/metal/commit/ce7529964932a9fd1cc10ce18dbe11e89ee02223
Submitter: "Zuul (22348)"
Branch: master

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

    Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

    This update reverts a small code change that was
    introduced by the following update.

    https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ha/+/792251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

    Revert "Align partitions created by kickstarters"

    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

    Reason for revert: Review should have been abandoned rather than merged.

    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

    Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

    Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

    This update reverts a small code change that was
    introduced by the following update.

    https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

    Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (f/centos8)
Download full text (20.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/792251
Committed: https://opendev.org/starlingx/ha/commit/85bab5d2b394114feabe524504339a55eb8904e0
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9f70df63fd0d83bf0f94d1b9ac2f98516d5971c8
Author: Bin Qian <email address hidden>
Date: Fri May 7 16:36:23 2021 -0400

    Fix no swact for failure of critical services

    This fix is to ensure keeping service failure counting over successful
    audit.

    When service enabled audit successfully completes, SM reset the service
    failure state. However it should not reset the service fail-count.
    The fail-count should only be reset after the grace period.

    Closes-Bug: 1893669
    Change-Id: I6996fe3f1c08c38da6f26243aee2b95b083069f0
    Signed-off-by: Bin Qian <email address hidden>

commit 0b99b594f83b7c626cc0c4f7dc970ce373a7b748
Author: Bin Qian <email address hidden>
Date: Tue May 4 11:33:43 2021 -0400

    Fix AIO-DX failover issues

    This fix is to fix AIO unexpected failover behaviors.
    1. active controller reboots itself when standby controller
       reboot/lost power
    2. standby controller becomes degraded after active controller
       reboot/lost power

    Closes-bug: 1927133
    Change-Id: If3c9f6251f689a89cd206c672092ba296f00bd6b
    Signed-off-by: Bin Qian <email address hidden>

commit cb5fa9510f3ebda66f9850ac697e542bf041ce8c
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 27 09:43:00 2021 -0400

    Remove hbsAgent restart in failover failure recovery handling

    A forced reboot of the active controller in an AIO DC system
    puts SM into a failover failure recovery loop that prevents
    maintenance from detecting the heartbeat failure of the just-
    rebooted controller.

    The SM's failover failure recovery handling algorithm includes
    a self (sm process) restart preceded by a restart of the
    hbsAgent, both added by the following update last year.

    update: Add unhealthy state recovery audit to service management (sm)
    review: https://review.opendev.org/c/starlingx/ha/+/735219

    The self restart of SM was and is required in this case. However,
    the restart of the hbsAgent was only included as a safety measure,
    at the time, to ensure SM received updated cluster state info. The
    hbsAgent restart was only added at that time with the longer term
    intention to have it removed once the hbsAgent cluster state change
    notification improvement was implemented. That change is now
    implemented and merged by the following update.

    update: Mtce heartbeat cluster state change notification improvement
    review: https://review.opendev.org/c/starlingx/metal/+/769936

    Testing of the fix for the following issue in an AIO DC system
    resulted in the takeover controller not detecting a heartbeat loss
    of the just rebooted standby controller.

    title: Force active controller reboot results in a second reboot
    issue: https://bugs.launchpad.net/starlingx/+bug/1922584

    The hbsAgent is not able to detect the heartbeat loss of the just-
    booted controller because SM keeps re...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.