AIOSX: Host lock failed due to nfv-vim process hanging
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Bart Wensley |
Bug Description
Brief Description
-----------------
Host lock failed as nfv-vim was hanging and thus did not process the request from sysinv.
Severity
--------
Major
Steps to Reproduce
------------------
Bring up an AIOSX
Apply stx-openstack
Provision quotas, images, networks, etc...
Launch some VMs
Delete and relaunch VMs. This process was repeated several times using different images
Lock host in order to allocate more disk to nova-local
Expected Behavior
------------------
Host was successfully locked
Actual Behavior
----------------
Host lock failed as notification message from sysinv to vim was not acknowledged and processed.
Analysis by Bart Wensley and Chris Friesen:
Based on nfv-vim log, the process crashed days ago while trying to its database
2020-01-
Traceback (most recent call last):
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
File "/usr/lib64/
OperationalError: (raised as a result of Query-invoked autoflush; consider using a session.
When vim crashes it is supposed to shutdown its threads and exit.
The code is in nfv_vim/vim.py and looks like this:
finally:
# Allow up to 10 seconds for the process to shut down. If the
# process_finalize hangs, we will do a hard exit.
What could happen was the open hung due to lack of file descriptors so the failsafe mechanism was never activated.
The output of lsof (see attached) shows that most of the open files look something like this
nfv-vim 632016 root DEL REG 0,18 107932251 /dev/shm/sem.SVb1Kq
/dev/shm/sem.* would correspond to a named POSIX semaphore opened via the sem_open() library call. It seems likely that a code path is not calling sem_close() (and maybe sem_unlink())when it should be.
Reproducibility
---------------
Seen once
System Configuration
-------
IPv4 StarlingX with OpenStack
Branch/Pull Time/Commit
-------
Jan. 21st master
Last Pass
---------
Not sure
Timestamp/Logs
--------------
See nfv-vim.log, lsof_output, sysinv.log attached
Test Activity
-------------
Evaluation
Workaround
----------
kill nfv-vim process
stx.4.0 / medium priority - serious issue, but only seen once to date.