Comment 0 for bug 1862049

Revision history for this message
Tee Ngo (teewrs) wrote :

Brief Description
-----------------
Host lock failed as nfv-vim was hanging and thus did not process the request from sysinv.

Severity
--------
Major

Steps to Reproduce
------------------
Bring up an AIOSX
Apply stx-openstack
Provision quotas, images, networks, etc...
Launch some VMs
Delete and relaunch VMs. This process was repeated several times using different images
Lock host in order to allocate more disk to nova-local

Expected Behavior
------------------
Host was successfully locked

Actual Behavior
----------------
Host lock failed as notification message from sysinv to vim was not acknowledged and processed.

Based on nfv-vim log, the process crashed days ago while trying to its database

2020-01-28T00:49:41.253 controller-0 VIM_Thread[632016] ERROR (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (sqlite3.OperationalError) unable to open database file [SQL: u'UPDATE hosts_v6 SET uptime=?, elapsed_time_in_state=? WHERE hosts_v6.uuid = ?'] [parameters: (283085, 277980, u'6765bfd1-c298-4b05-8ed9-9902d4c80f70')]
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 167, in process_main
  File "/usr/lib64/python2.7/site-packages/nfv_common/selobj/_selobj_module.py", line 105, in selobj_dispatch
  File "/usr/lib64/python2.7/site-packages/nfv_common/tasks/_task_scheduler.py", line 332, in task_work_complete
  File "/usr/lib64/python2.7/site-packages/nfv_common/tasks/_task.py", line 193, in task_work_complete
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 495, in get_hosts
  File "/usr/lib64/python2.7/site-packages/nfv_vim/audits/_vim_nfvi_audits.py", line 148, in _audit_nfvi_hosts_callback
  File "/usr/lib64/python2.7/site-packages/nfv_vim/objects/_host.py", line 721, in nfvi_host_update
  File "/usr/lib64/python2.7/site-packages/nfv_vim/objects/_host.py", line 863, in _persist
  File "/usr/lib64/python2.7/site-packages/nfv_vim/database/_database_infrastructure_module.py", line 68, in database_host_add
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2755, in first
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2547, in __getitem__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2854, in __iter__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1407, in _autoflush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1397, in _autoflush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2171, in flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2291, in _flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2255, in _flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 177, in save_obj
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 737, in _emit_update_statements
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
OperationalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (sqlite3.OperationalError) unable to open database file [SQL: u'UPDATE hosts_v6 SET uptime=?, elapsed_time_in_state=? WHERE hosts_v6.uuid = ?'] [parameters: (283085, 277980, u'6765bfd1-c298-4b05-8ed9-9902d4c80f70')]

When vim crashes it is supposed to shutdown its threads and exit.
The code is in nfv_vim/vim.py and looks like this:

    finally:
        open(PROCESS_NOT_RUNNING_FILE, 'w').close()
        # Allow up to 10 seconds for the process to shut down. If the
        # process_finalize hangs, we will do a hard exit.
        signal.signal(signal.SIGALRM, _force_exit)
        signal.alarm(10)
        process_finalize()

What could happen was the open hung due to lack of file descriptors so the failsafe mechanism was never activated.

The output of lsof (see attached) shows that most of the open files look something like this
nfv-vim 632016 root DEL REG 0,18 107932251 /dev/shm/sem.SVb1Kq

/dev/shm/sem.* would correspond to a named POSIX semaphore opened via the sem_open() library call. It seems likely that a code path is not calling sem_close() (and maybe sem_unlink())when it should be.

Reproducibility
---------------
Seen once

System Configuration
--------------------
IPv4 StarlingX with OpenStack

Branch/Pull Time/Commit
-----------------------
Jan. 21st master

Last Pass
---------
Not sure

Timestamp/Logs
--------------
See nfv-vim.log, lsof_output, sysinv.log attached

Test Activity
-------------
Evaluation

 Workaround
 ----------
 kill nfv-vim process