Mtce http connection failures should delay between retries

Bug #2047958 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description:

Maintenance interfaces with sysinv, sm and the vim using http requests. Request timeout's have an inherent delay between retries. However, command failures or outright connection failures don't.

This has only become obvious in mtce's communication with the vim where there appears to be a process startup timing change that leads to the 'vim' not being ready to handle commands before mtcAgent startup starts sending them after a platform services group startup by sm.

Rather than go after the subtle timing change, this Jira is created requesting mtce http retry handling be improved by adding a proper retry wait state to mtce's http command handling work queue fsm.

Severity:

Major: Recent timing change has exposed this weakness. Can cause a node to fail over a swact

Steps to Reproduce:

swact

Expected Behavior:

mtcAgent handles VIM starting up a few seconds late.

Actual Behavior:

mtcAgent can fail a host state update to the VIM over a swact.

Reproducibility:

Intermittent - rare to see a swact failure.

System Configuration:

DX

Load info (eg: 2022-03-10_20-00-07):

any 24.09 master branch

Workaround

None

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/908218

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/908218
Committed: https://opendev.org/starlingx/metal/commit/191c0aa6a8618b5c3530fabc2733da985eb3acc3
Submitter: "Zuul (22348)"
Branch: master

commit 191c0aa6a8618b5c3530fabc2733da985eb3acc3
Author: Eric MacDonald <email address hidden>
Date: Wed Feb 7 02:09:46 2024 +0000

    Add a wait time between http request retries

    Maintenance interfaces with sysinv, sm and the vim using http requests.
    Request timeout's have an implicit delay between retries. However,
    command failures or outright connection failures don't.

    This has only become obvious in mtce's communication with the vim
    where there appears to be a process startup timing change that leads
    to the 'vim' not being ready to handle commands before mtcAgent
    startup starts sending them after a platform services group startup
    by sm.

    This update adds a 10 second http retry wait as a configuration option
    to mtc.conf. The mtcAgent loads this value at startup and uses it
    in a new HTTP__RETRY_WAIT state of http request work FSM.

    The number of retries remains unchanged. This update is only forcing
    a minimum wait time between retries, regardless of cause.

    Failure path testing was done using Fault Insertion Testing (FIT).

    Test Plan:

    PASS: Verify the reported issue is resolved by this update.
    PASS: Verify http retry config value load on process startup.
    PASS: Verify updated value is used over a process -sighup.
    PASS: Verify default value if new mtc.conf config value is not found.
    PASS: Verify http connection failure http retry handling.
    PASS: Verify http request timeout failure retry handling.
    PASS: Verify http request operation failure retry handling.

    Regression:

    PASS: Build and install ISO - Standard and AIO DX.
    PASS: Verify http failures do not fail a lock operation.
    PASS: Verify host unlock fails if its http done queue shows failures.
    PASS: Verify host swact.
    PASS: Verify handling of random and persistent http errors involving
          the need for retries.

    Closes-Bug: 2047958
    Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
importance: Undecided → Medium
tags: added: stx.9.0 stx.metal
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.