Collect tool times out due to long running lsof call
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Pablo Bovina |
Bug Description
Brief Description
-----------------
The current use of lsof in collect tool is not ideal for StarlingX and can lead to timeout failure especially on a node with many containers/apps and limited resources (e.g. subcloud running in a small VM)
The following code in /usr/local/
# mounted hugepages
delimiter ${LOGFILE} "lsof | grep /mnt/huge"
lsof | awk '($3 !~ /^[0-9]+$/ && /\/mnt\/huge/) || NR==1 {print $0;}' >> ${LOGFILE} 2>>${COLLECT_
Furthermore, the use of lsof without at least options -lw to disable uid->username lookup and warning messages can cause collect tool to timeout.
Severity
--------
Major
Steps to Reproduce
------------------
In a virtual AIOSX subcloud, deploy a few apps
Issue collect command
Expected Behavior
------------------
Collect completes successfully
Actual Behavior
----------------
Collect timed out due to long running lsof call.
Manual run of lsof command completed but did take a long time in this environment. A sample lsof output with a non-platform app produced 73399 entries, 9200 of which belong to UID 1000, 10482 of which belong to UID 33
The UIDs 1000 and 33 are UIDs created in the containers
Of 10,482 entries of UID 33 for instance:
3952 related to /usr/lib
1680 related to pipe
805 related to socket
768 related to eventpoll
520 related to /dev/null or /dev/zero
252 related to inotify
nginx 95332 33 mem REG 252,4 33863948 /usr/local/
nginx 95332 33 mem REG 252,4 52918672 /usr/local/
nginx 95332 33 1w FIFO 0,9 0t0 193152 pipe
nginx 95332 33 2w FIFO 0,9 0t0 193153 pipe
nginx-ing 94002 2058734 33 61u unix 0xffff9465dc432000 0t0 10348560 socket
nginx-ing 94002 2058734 33 62u unix 0xffff9465d324b000 0t0 10349650 socket
nginx 95332 33 7u a_inode 0,10 0 6226 [eventpoll]
nginx 95332 33 15u a_inode 0,10 0 6226 [eventpoll]
nginx-ing 4056541 4058477 33 0u CHR 1,3 0t0 9399025 /dev/null
nginx 4058467 33 0r CHR 1,3 0t0 9399025 /dev/null
nginx 4094479 4094534 33 DEL REG 0,4 9408161 /dev/zero
nginx 4094479 4094534 33 DEL REG 0,4 9408160 /dev/zero
nginx-ing 4056541 4058477 33 10r a_inode 0,10 0 6226 inotify
nginx-ing 4056541 4058477 33 14r a_inode 0,10 0 6226 inotify
Reproducibility
---------------
Reproducible
System Configuration
-------
AIOSX virtual subcloud
Branch/Pull Time/Commit
-------
Nov. 30th nightly load
Last Pass
---------
Collect was successfully used in a virtual subcloud in the past to collect logs as a result of failed bootstrap/
Timestamp/Logs
--------------
N/A
Test Activity
-------------
Developer Testing
Workaround
----------
Comment out the lsof call in collect_host script for collect tool to complete
tags: | added: stx.tools |
Changed in starlingx: | |
status: | Triaged → Fix Committed |
Changed in starlingx: | |
status: | Fix Committed → Fix Released |
Changed in starlingx: | |
assignee: | Gustavo Dobro (mgdobro) → Enzo Candotti (ecandotti) |
assignee: | Enzo Candotti (ecandotti) → nobody |
Timing info:
Without any options, lsof elapsed time in the virtual subcloud is 37m47s
With -lw options, lsof elapsed time in the virtual subcloud is 7m2s
With -lwX options, lsof elapsed time in the virtual subcloud is 2s