mtcAgent sometimes fails AIO standby controller over DOR
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Eric MacDonald |
Bug Description
If the mtcAgent starts up on a host whose uptime is less than 15 minutes it treats it as a DOR (Dead Office Recovery) scenario. If this happens on an AIO when the peer controller takes longer to reach the configured state then maintenance fails the peer controller on the first failure of the sub function's start host services request.
Severity:
Minor: already booting (out-of-service) standby controller takes longer
to recover due to a second reboot.
Lock and unlock the standby controller so it starts with a low uptime.
As soon as it enabled then force reboot the active controller so that SM
swacts to the standby with low uptime.
Expected Behavior
Maintenance should Gracefully Recover the force rebooted controller.
Actual Behavior
Maintenance fails the force rebooted standby controller and re-enables
it via another reboot which always succeeds.
Reproducibility:
Is 100% reproducible in the above 'already faulted' scenario
System Configuration:
AIO Duplex
Branch/Pull Time/Commit:
Any pull in May, 2021 and before.
Last Pass:
Unknown
Test Activity:
Developer Testing
Work Around:
None
Changed in starlingx: | |
importance: | Undecided → Low |
tags: | added: stx.metal |
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Fix proposed to branch: master /review. opendev. org/c/starlingx /metal/ +/790792
Review: https:/