DOC: Add how to collect from remote hosts that exceed the default host collect timeout

Bug #2068506 reported by Elisamara Aoki Gonçalves
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Elisamara Aoki Gonçalves

Bug Description

Brief Description
-----------------

The current "Collect Caveats and Usage" documentation needs to explain how to collect from a remote host that exceeds the default host collect timeout.

https://docs.starlingx.io/fault-mgmt/kubernetes/troubleshooting-log-collection.html#collect-tool-caveats-and-usage

Customer support organization shared with me the confusion the current documentation can lead to in this 'host collect timeout' case.

sysadmin@controller-0:~$ collect -l controller-1 --inline --timeout 60
[sudo] password for sysadmin:
collect bundle timeout set to 60 minutes
collecting data from 1 host(s): controller-1
collecting controller-1_20240404.174805 ... Error: operation timeout ; failed to collect from controller-1 [host] (reason:15)
Support tried to use the newly introduced --timeout option with the expectation that would solve this problem. However, the --timeout specifies a global or system level collect timeout rather than a host level timeout.

That update link is below, notice the first paragraph in the description

This update adds a new --timeout command line option to the collect
tool so that users can extend collect's global timeout.
Update: Add --timeout option to collect tool

https://opendev.org/starlingx/utilities/commit/0b079b4804d635fb80a8eafeeb0a8b61ab486951

That update did not affect the 10 minute default timeout for collecting from a remote host.

There was a subsequent update that consolidated collect's default timeout settings into a file stored in the read/write /etc filesystem.

Update: Increase collect ssh, scp and sudo expect operation timeouts

https://opendev.org/starlingx/utilities/commit/29fb1c44353b1301868095030603f09642bf438f

/etc/collect/collect_timeouts

# default timeouts for collect ; in seconds
declare -i SCP_TIMEOUT_DEFAULT=600
declare -i SSH_TIMEOUT_DEFAULT=60
declare -i SUDO_TIMEOUT_DEFAULT=60
declare -i COLLECT_HOST_TIMEOUT_DEFAULT=600
declare -i CREATE_TARBALL_TIMEOUT_DEFAULT=200

declare -i TIMEOUT_MIN_MINS=10
declare -i TIMEOUT_MAX_MINS=120
declare -i TIMEOUT_DEF_MINS=20
The existing customer documentation should be updated to point out the following clarification and options

A. Enhance the description of the fact that the --timeout <mins> option extends the overall timeout of that collect operation. It does not apply to or affect the default per remote host collect timeout.

B. point out that the timeout for collecting from the local host, the host that collect is run from, does adopt the global timeout.

C. There are two ways to deal with this case

run collect with an extended --timeout locally on the host that is experiencing the timeout. That way the global timeout applies.
optionally, modify the default COLLECT_HOST_TIMEOUT_DEFAULT value in the /etc/collect/collect_timeouts file. Requires sudo to do so and no processes need to be restarted after the change. All subsequent collects will adopt the new values in that file.

Severity
--------

<Minor: System/Feature is usable with minor issue>

Changed in starlingx:
assignee: nobody → Elisamara Aoki Gonçalves (egoncalv)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to docs (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/docs/+/921395

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to docs (master)

Reviewed: https://review.opendev.org/c/starlingx/docs/+/921395
Committed: https://opendev.org/starlingx/docs/commit/1acc9a534dd76bd4f45e96aa74070fb6dcd4d7af
Submitter: "Zuul (22348)"
Branch: master

commit 1acc9a534dd76bd4f45e96aa74070fb6dcd4d7af
Author: Elisamara Aoki Goncalves <email address hidden>
Date: Wed Jun 5 15:46:02 2024 +0000

    Add how to collect from remote hosts that exceed the default host collect timeout

    Closes-bug: 2068506

    Change-Id: I8bd8e383391f29d8b5120a46b6bfcc5bd8cb45ce
    Signed-off-by: Elisamara Aoki Goncalves <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.