StarlingX

Mtce http connection failures should delay between retries

Bug #2047958 reported by Eric MacDonald on 2024-01-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description:

Maintenance interfaces with sysinv, sm and the vim using http requests. Request timeout's have an inherent delay between retries. However, command failures or outright connection failures don't.

This has only become obvious in mtce's communication with the vim where there appears to be a process startup timing change that leads to the 'vim' not being ready to handle commands before mtcAgent startup starts sending them after a platform services group startup by sm.

Rather than go after the subtle timing change, this Jira is created requesting mtce http retry handling be improved by adding a proper retry wait state to mtce's http command handling work queue fsm.

Severity:

Major: Recent timing change has exposed this weakness. Can cause a node to fail over a swact

Steps to Reproduce:

swact

Expected Behavior:

mtcAgent handles VIM starting up a few seconds late.

Actual Behavior:

mtcAgent can fail a host state update to the VIM over a swact.

Reproducibility:

Intermittent - rare to see a swact failure.

System Configuration:

Load info (eg: 2022-03-10_20-00-07):

any 24.09 master branch

Workaround

None

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-02-07: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/908218

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-02-13: Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/908218
Committed: https://opendev.org/starlingx/metal/commit/191c0aa6a8618b5c3530fabc2733da985eb3acc3
Submitter: "Zuul (22348)"
Branch: master

commit 191c0aa6a8618b5c3530fabc2733da985eb3acc3
Author: Eric MacDonald <email address hidden>
Date: Wed Feb 7 02:09:46 2024 +0000

Add a wait time between http request retries

    Maintenance interfaces with sysinv, sm and the vim using http requests.
    Request timeout's have an implicit delay between retries. However,
    command failures or outright connection failures don't.

    This has only become obvious in mtce's communication with the vim
    where there appears to be a process startup timing change that leads
    to the 'vim' not being ready to handle commands before mtcAgent
    startup starts sending them after a platform services group startup
    by sm.

    This update adds a 10 second http retry wait as a configuration option
    to mtc.conf. The mtcAgent loads this value at startup and uses it
    in a new HTTP__RETRY_WAIT state of http request work FSM.

The number of retries remains unchanged. This update is only forcing
a minimum wait time between retries, regardless of cause.

Failure path testing was done using Fault Insertion Testing (FIT).

Test Plan:

    PASS: Verify the reported issue is resolved by this update.
    PASS: Verify http retry config value load on process startup.
    PASS: Verify updated value is used over a process -sighup.
    PASS: Verify default value if new mtc.conf config value is not found.
    PASS: Verify http connection failure http retry handling.
    PASS: Verify http request timeout failure retry handling.
    PASS: Verify http request operation failure retry handling.

Regression:

    PASS: Build and install ISO - Standard and AIO DX.
    PASS: Verify http failures do not fail a lock operation.
    PASS: Verify host unlock fails if its http done queue shows failures.
    PASS: Verify host swact.
    PASS: Verify handling of random and persistent http errors involving
          the need for retries.

    Closes-Bug: 2047958
    Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
    Signed-off-by: Eric MacDonald <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/908218
Committed: https://opendev.org/starlingx/metal/commit/191c0aa6a8618b5c3530fabc2733da985eb3acc3
Submitter: "Zuul (22348)"
Branch:    master

commit 191c0aa6a8618b5c3530fabc2733da985eb3acc3
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Wed Feb 7 02:09:46 2024 +0000

Add a wait time between http request retries
    
    Maintenance interfaces with sysinv, sm and the vim using http requests.
    Request timeout's have an implicit delay between retries. However,
    command failures or outright connection failures don't.
    
    This has only become obvious in mtce's communication with the vim
    where there appears to be a process startup timing change that leads
    to the 'vim' not being ready to handle commands before mtcAgent
    startup starts sending them after a platform services group startup
    by sm.
    
    This update adds a 10 second http retry wait as a configuration option
    to mtc.conf. The mtcAgent loads this value at startup and uses it
    in a new HTTP__RETRY_WAIT state of http request work FSM.
    
    The number of retries remains unchanged. This update is only forcing
    a minimum wait time between retries, regardless of cause.
    
    Failure path testing was done using Fault Insertion Testing (FIT).
    
    Test Plan:
    
    PASS: Verify the reported issue is resolved by this update.
    PASS: Verify http retry config value load on process startup.
    PASS: Verify updated value is used over a process -sighup.
    PASS: Verify default value if new mtc.conf config value is not found.
    PASS: Verify http connection failure http retry handling.
    PASS: Verify http request timeout failure retry handling.
    PASS: Verify http request operation failure retry handling.
    
    Regression:
    
    PASS: Build and install ISO - Standard and AIO DX.
    PASS: Verify http failures do not fail a lock operation.
    PASS: Verify host unlock fails if its http done queue shows failures.
    PASS: Verify host swact.
    PASS: Verify handling of random and persistent http errors involving
          the need for retries.
    
    Closes-Bug: 2047958
    Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2024-02-14

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)
importance:	Undecided → Medium
tags:	added: stx.9.0 stx.metal

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.