StarlingX

mtcAgent sometimes fails AIO standby controller over DOR

Bug #1928095 reported by Eric MacDonald on 2021-05-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Eric MacDonald

Bug Description

If the mtcAgent starts up on a host whose uptime is less than 15 minutes it treats it as a DOR (Dead Office Recovery) scenario. If this happens on an AIO when the peer controller takes longer to reach the configured state then maintenance fails the peer controller on the first failure of the sub function's start host services request.

Severity:

Minor: already booting (out-of-service) standby controller takes longer
to recover due to a second reboot.

Lock and unlock the standby controller so it starts with a low uptime.

As soon as it enabled then force reboot the active controller so that SM
swacts to the standby with low uptime.

Expected Behavior

Maintenance should Gracefully Recover the force rebooted controller.

Actual Behavior

Maintenance fails the force rebooted standby controller and re-enables
it via another reboot which always succeeds.

Reproducibility:

Is 100% reproducible in the above 'already faulted' scenario

System Configuration:

AIO Duplex

Branch/Pull Time/Commit:

Any pull in May, 2021 and before.

Last Pass:

Unknown

Test Activity:

Developer Testing

Work Around:

None

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-11: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/790792

Changed in starlingx:
status:	New → In Progress

Ghada Khalil (gkhalil) on 2021-05-14

Changed in starlingx:
importance:	Undecided → Low
tags:	added: stx.metal
Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-31: Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/790792
Committed: https://opendev.org/starlingx/metal/commit/ba6c61584d62f4584dab6c8187f5973211e7e720
Submitter: "Zuul (22348)"
Branch: master

commit ba6c61584d62f4584dab6c8187f5973211e7e720
Author: Eric MacDonald <email address hidden>
Date: Tue May 11 12:15:42 2021 -0400

Refactor background in-service start host services handling

    The maintenance add_handler fsm loads inventory and recovers
    host state over a process restart. If the active controller's
    uptime is less than 15 minutes the restart event is treated as
    a Dead Office Recovery (DOR) and is more forgiving to host
    recovery by scheduling the 'start host services' as a
    background operation so as to not hold up the add operation.

    The current implementation of the background handling of
    'start host services' is not handling the AIO subfunction
    case properly in DOR mode as well as being difficult to
    follow and therfore fix and maintain. This miss handling
    leads to maintenance incorrectly failing the node with a
    subfunction configuration error over the DOR case.

    This update refactors the background handling of 'start host
    services' to fix the issue and improve its clearity and
    maintainability.

Test Cases:

    PASS: Verify AIO DX DOR handling
    PASS: Verify AIO DX active controller reboot handling
          - standby with uptime ; < 15 min and > 15 min
    PASS: Verify AIO DX standby controller reboot handling
    PASS: Verify subfunction configuration error handling

Regression:

    PASS: Verify start host services wait/retry handling.
    PASS: Verify start host services failure handling.
    PASS: Verify DOR of Standard system
    PASS: Verify DOR of AIO Plus system
    PASS: Verify AIO System Install
    PASS: Verify Standard System Install
    PASS: Verify AIO plus system install

    Change-Id: Ia4683672e3a2852b5b4837167b2dcd2a1e4e6d57
    Closes-Bug: 1928095
    Signed-off-by: Eric MacDonald <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/790792
Committed: https://opendev.org/starlingx/metal/commit/ba6c61584d62f4584dab6c8187f5973211e7e720
Submitter: "Zuul (22348)"
Branch:    master

commit ba6c61584d62f4584dab6c8187f5973211e7e720
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Tue May 11 12:15:42 2021 -0400

Refactor background in-service start host services handling
    
    The maintenance add_handler fsm loads inventory and recovers
    host state over a process restart. If the active controller's
    uptime is less than 15 minutes the restart event is treated as
    a Dead Office Recovery (DOR) and is more forgiving to host
    recovery by scheduling the 'start host services' as a
    background operation so as to not hold up the add operation.
    
    The current implementation of the background handling of
    'start host services' is not handling the AIO subfunction
    case properly in DOR mode as well as being difficult to
    follow and therfore fix and maintain. This miss handling
    leads to maintenance incorrectly failing the node with a
    subfunction configuration error over the DOR case.
    
    This update refactors the background handling of 'start host
    services' to fix the issue and improve its clearity and
    maintainability.
    
    Test Cases:
    
    PASS: Verify AIO DX DOR handling
    PASS: Verify AIO DX active controller reboot handling
          - standby with uptime ; < 15 min and > 15 min
    PASS: Verify AIO DX standby controller reboot handling
    PASS: Verify subfunction configuration error handling
    
    Regression:
    
    PASS: Verify start host services wait/retry handling.
    PASS: Verify start host services failure handling.
    PASS: Verify DOR of Standard system
    PASS: Verify DOR of AIO Plus system
    PASS: Verify AIO System Install
    PASS: Verify Standard System Install
    PASS: Verify AIO plus system install
    
    Change-Id: Ia4683672e3a2852b5b4837167b2dcd2a1e4e6d57
    Closes-Bug: 1928095
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.