Collect tool times out collecting from large fully loaded stressed system

Bug #2004666 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
The 'containerization_api' and 'memory' collect plugins are taking over 20 minutes to execute their commands when the system is running a pod deployment stress test. This exceeds the default timeout of over 1000 secs (16.66 mins).

Severity
--------
Minor: This is not in-service product affecting but does prevent getting a collect bundle that includes the active controller in large systems under stress.

Steps to Reproduce
------------------
Run 'collect -a' while deploying 1500 pods, 30 pods per node in a 2+2+50 standard system

Expected Behavior
------------------
Collect completes just fine

Actual Behavior
----------------
Collect times out.

Reproducibility
---------------
100%

System Configuration
--------------------
2+2+50 Standard System

Branch/Pull Time/Commit
-----------------------
Any

Last Pass
---------
Never passed under this type of stress loading.

Timestamp/Logs
--------------
Error: operation timeout ; failed to collect from controller-0 [target] (reason:10)

Test Activity
-------------
Stress testing

Workaround
----------
Reduce pod loading and retry

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.tools
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/utilities/+/873473

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/873473
Committed: https://opendev.org/starlingx/utilities/commit/0b079b4804d635fb80a8eafeeb0a8b61ab486951
Submitter: "Zuul (22348)"
Branch: master

commit 0b079b4804d635fb80a8eafeeb0a8b61ab486951
Author: Eric MacDonald <email address hidden>
Date: Sun Feb 12 15:15:43 2023 -0500

    Add --timeout option to collect tool

    This update adds a new --timeout command line option to the collect
    tool so that users can extend collect's global timeout.

    Prior to this update the collect tool had a fixed 1000 second
    or 16.6 minute timeout. Collect of hosts in large busy systems can
    take an unpredictably long time. Sometimes longer than 1000 seconds.
    This can be particularly true when collecting from the active
    controller deploying and managing lots of pods across many hosts.

    This new timeout option allows the user to specify a specific timeout
    in minutes, between 10 and 120, while defaulting to 20 minutes.
    The default or user specified global timeout is passed to subclouds
    for subcloud collect as well.

    Test Plan:

    PASS: Verify new --timeout or -t options at command line arg level
    PASS: Verify --timeout <minutes> parse; error, in and out of bounds
    PASS: Verify timeout option is described in collect help
    PASS: Verify 110 minute collect with --timeout 120
    PASS: Verify 45 minute collect times out with --timeout 40
    PASS: Verify 2 minute collect with --timeout 10
    PASS: Verify default timeout is 20 minutes
    PASS: Verify default or specified timeout is displayed
    PASS: Verify default or specified timeout is shared with the subcloud
    PASS: Verify timeout error handling.
    PASS: Verify collect error handling behavior if --timeout or -t is
          specified but the number of minutes is missing.

    Regression:

    PASS: Verify collect system and subcloud handling
    PASS: Verify system and subcloud dated collects ; verified content
    PASS: Verify collect with a variety of options

    Closes-Bug: 2004666
    Signed-off-by: Eric MacDonald <email address hidden>
    Change-Id: Ib68b78f7c810f43fc8d13cbf291ac00f08c3c4f4

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.9.0
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.