Collect tool times out due to long running lsof call

Bug #1906537 reported by Tee Ngo
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Pablo Bovina

Bug Description

Brief Description
-----------------
The current use of lsof in collect tool is not ideal for StarlingX and can lead to timeout failure especially on a node with many containers/apps and limited resources (e.g. subcloud running in a small VM)

The following code in /usr/local/sbin/collect_host, which is a script called by collect, Currently does not provide useful output
    # mounted hugepages
    delimiter ${LOGFILE} "lsof | grep /mnt/huge"
    lsof | awk '($3 !~ /^[0-9]+$/ && /\/mnt\/huge/) || NR==1 {print $0;}' >> ${LOGFILE} 2>>${COLLECT_ERROR_LOG}

Furthermore, the use of lsof without at least options -lw to disable uid->username lookup and warning messages can cause collect tool to timeout.

Severity
--------
Major

Steps to Reproduce
------------------
In a virtual AIOSX subcloud, deploy a few apps
Issue collect command

Expected Behavior
------------------
Collect completes successfully

Actual Behavior
----------------
Collect timed out due to long running lsof call.
Manual run of lsof command completed but did take a long time in this environment. A sample lsof output with a non-platform app produced 73399 entries, 9200 of which belong to UID 1000, 10482 of which belong to UID 33

The UIDs 1000 and 33 are UIDs created in the containers

Of 10,482 entries of UID 33 for instance:
    3952 related to /usr/lib
    1680 related to pipe
    805 related to socket
    768 related to eventpoll
    520 related to /dev/null or /dev/zero
    252 related to inotify
nginx 95332 33 mem REG 252,4 33863948 /usr/local/lib/lua/librestychash.so
nginx 95332 33 mem REG 252,4 52918672 /usr/local/lib/lua/5.1/rex_pcre.so
nginx 95332 33 1w FIFO 0,9 0t0 193152 pipe
nginx 95332 33 2w FIFO 0,9 0t0 193153 pipe
nginx-ing 94002 2058734 33 61u unix 0xffff9465dc432000 0t0 10348560 socket
nginx-ing 94002 2058734 33 62u unix 0xffff9465d324b000 0t0 10349650 socket
nginx 95332 33 7u a_inode 0,10 0 6226 [eventpoll]
nginx 95332 33 15u a_inode 0,10 0 6226 [eventpoll]
nginx-ing 4056541 4058477 33 0u CHR 1,3 0t0 9399025 /dev/null
nginx 4058467 33 0r CHR 1,3 0t0 9399025 /dev/null
nginx 4094479 4094534 33 DEL REG 0,4 9408161 /dev/zero
nginx 4094479 4094534 33 DEL REG 0,4 9408160 /dev/zero
nginx-ing 4056541 4058477 33 10r a_inode 0,10 0 6226 inotify
nginx-ing 4056541 4058477 33 14r a_inode 0,10 0 6226 inotify

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIOSX virtual subcloud

Branch/Pull Time/Commit
-----------------------
Nov. 30th nightly load

Last Pass
---------
Collect was successfully used in a virtual subcloud in the past to collect logs as a result of failed bootstrap/deployment.

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
Comment out the lsof call in collect_host script for collect tool to complete

Revision history for this message
Tee Ngo (teewrs) wrote :

Timing info:

Without any options, lsof elapsed time in the virtual subcloud is 37m47s
With -lw options, lsof elapsed time in the virtual subcloud is 7m2s
With -lwX options, lsof elapsed time in the virtual subcloud is 2s

Ghada Khalil (gkhalil)
tags: added: stx.tools
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium - small change with a large improvement when running collect on large systems

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Gustavo Dobro (mgdobro)
tags: added: stx.5.0
Pablo Bovina (pbovina)
Changed in starlingx:
status: Triaged → Fix Committed
Revision history for this message
Pablo Bovina (pbovina) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Committed → Fix Released
Changed in starlingx:
assignee: Gustavo Dobro (mgdobro) → Enzo Candotti (ecandotti)
assignee: Enzo Candotti (ecandotti) → nobody
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to the developer who fixed the issue: Pablo Bovina

Changed in starlingx:
assignee: nobody → Pablo Bovina (pbovina)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/utilities/+/792213

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (f/centos8)
Download full text (29.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/792213
Committed: https://opendev.org/starlingx/utilities/commit/c4d042615e6fe8944a4628fa1a29e86e012a9bf5
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 557cada006fd5a3bd81ad5af387c37657801f8c5
Author: Fernando Theirs <email address hidden>
Date: Thu May 13 16:21:47 2021 -0300

    Collect is missing etcdctl output

    When the collect tool is run, it does not include the contents
    of the etcd database. Fixes have been made for this to dump the
    contents in "etcd_database.dump" file.

    Verify if etcd access is secured. In that case, certificates
    will be used.

    Closes-Bug: 1911935

    Signed-off-by: Fernando Theirs <email address hidden>
    Change-Id: Idbc60edffa978a7a6bead939a4eb54f4abae29a6

commit 6045b1b8a0d8ed6a94d06cdfc994bf1a5fa9dbb5
Author: Jim Gauld <email address hidden>
Date: Thu May 6 11:58:34 2021 -0400

    Provide utility script is-rootdisk-device.sh

    This provides a utility script to determine which disk contains the root
    filesystem. This can also be used as a helper function for io-scheduler
    udev rules that require specific configuration for root disk.

    Example usage:
    /usr/local/bin/is-rootdisk-device.sh
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sda
    ROOTDISK_DEVICE=sda

    /usr/local/bin/is-rootdisk-device.sh /dev/sdb
    (i.e., no output)

    Partial-Bug: 1927515
    Signed-off-by: Jim Gauld <email address hidden>
    Change-Id: Ib0d4a161a407b08d294c5ff9aa0b7590961e18c9

commit 88a678f142cfe86c58b6405aae6babbc08de0e8f
Author: Chen, Haochuan Z <email address hidden>
Date: Fri Mar 26 09:09:41 2021 +0800

    Add packages to stx-ceph-manager image

    This update installs ceph-mgr, ceph-mon, ceph-osd packages as part
    of stx-ceph-manager image.

    Partial-Bug: 1920882

    Change-Id: I4afde8b1476e14453fac8561f1edde7360b8ee96
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 09b3542fcc6cc0300a9cae0d302225e6977780f3
Author: Scott Little <email address hidden>
Date: Thu Mar 25 11:49:49 2021 -0400

    Set SW_VERSION 21.05

    Prep for the StarlingX 5.0 release.
    SW_VERSION, also known as PLATFORM_RELEASE, uses YY.MM format.

    Story: 2008055
    Task: 42115
    Signed-off-by: Scott Little <email address hidden>
    Change-Id: If7c91a2b523358269ae4850961cf4189ffcd7a75

commit ae4cefd0e2a0001476782c31e1003810da2b4838
Author: Chris Friesen <email address hidden>
Date: Thu Mar 4 18:04:12 2021 -0500

    add dcmanager-audit-worker to patch restart script

    Need to add the new process to the patch restart script.

    Story: 2007267
    Task: 41999
    Signed-off-by: Chris Friesen <email address hidden>
    Change-Id: If5faa806bd0d52ddbf1343b064959f4207cf975a

commit 27fce5a52321f3014fa8ae9181d344bc774289da
Author: Enzo Candotti <email address hidden>
Date: Mon Feb 1 12:47:38 2021 -0300

    Add resource CPU and memory info in collect

    This adds commands to collect more data to debug
    resource allocations and...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.