Mtce http connection failures should delay between retries
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description:
Maintenance interfaces with sysinv, sm and the vim using http requests. Request timeout's have an inherent delay between retries. However, command failures or outright connection failures don't.
This has only become obvious in mtce's communication with the vim where there appears to be a process startup timing change that leads to the 'vim' not being ready to handle commands before mtcAgent startup starts sending them after a platform services group startup by sm.
Rather than go after the subtle timing change, this Jira is created requesting mtce http retry handling be improved by adding a proper retry wait state to mtce's http command handling work queue fsm.
Severity:
Major: Recent timing change has exposed this weakness. Can cause a node to fail over a swact
Steps to Reproduce:
swact
Expected Behavior:
mtcAgent handles VIM starting up a few seconds late.
Actual Behavior:
mtcAgent can fail a host state update to the VIM over a swact.
Reproducibility:
Intermittent - rare to see a swact failure.
System Configuration:
DX
Load info (eg: 2022-03-
any 24.09 master branch
Workaround
None
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.metal |
Fix proposed to branch: master /review. opendev. org/c/starlingx /metal/ +/908218
Review: https:/