AIOSX: Host lock failed due to nfv-vim process hanging

Bug #1862049 reported by Tee Ngo
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bart Wensley

Bug Description

Brief Description
-----------------
Host lock failed as nfv-vim was hanging and thus did not process the request from sysinv.

Severity
--------
Major

Steps to Reproduce
------------------
Bring up an AIOSX
Apply stx-openstack
Provision quotas, images, networks, etc...
Launch some VMs
Delete and relaunch VMs. This process was repeated several times using different images
Lock host in order to allocate more disk to nova-local

Expected Behavior
------------------
Host was successfully locked

Actual Behavior
----------------
Host lock failed as notification message from sysinv to vim was not acknowledged and processed.

Analysis by Bart Wensley and Chris Friesen:
Based on nfv-vim log, the process crashed days ago while trying to its database

2020-01-28T00:49:41.253 controller-0 VIM_Thread[632016] ERROR (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (sqlite3.OperationalError) unable to open database file [SQL: u'UPDATE hosts_v6 SET uptime=?, elapsed_time_in_state=? WHERE hosts_v6.uuid = ?'] [parameters: (283085, 277980, u'6765bfd1-c298-4b05-8ed9-9902d4c80f70')]
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 167, in process_main
  File "/usr/lib64/python2.7/site-packages/nfv_common/selobj/_selobj_module.py", line 105, in selobj_dispatch
  File "/usr/lib64/python2.7/site-packages/nfv_common/tasks/_task_scheduler.py", line 332, in task_work_complete
  File "/usr/lib64/python2.7/site-packages/nfv_common/tasks/_task.py", line 193, in task_work_complete
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 495, in get_hosts
  File "/usr/lib64/python2.7/site-packages/nfv_vim/audits/_vim_nfvi_audits.py", line 148, in _audit_nfvi_hosts_callback
  File "/usr/lib64/python2.7/site-packages/nfv_vim/objects/_host.py", line 721, in nfvi_host_update
  File "/usr/lib64/python2.7/site-packages/nfv_vim/objects/_host.py", line 863, in _persist
  File "/usr/lib64/python2.7/site-packages/nfv_vim/database/_database_infrastructure_module.py", line 68, in database_host_add
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2755, in first
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2547, in __getitem__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2854, in __iter__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1407, in _autoflush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1397, in _autoflush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2171, in flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2291, in _flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2255, in _flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 177, in save_obj
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 737, in _emit_update_statements
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
OperationalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (sqlite3.OperationalError) unable to open database file [SQL: u'UPDATE hosts_v6 SET uptime=?, elapsed_time_in_state=? WHERE hosts_v6.uuid = ?'] [parameters: (283085, 277980, u'6765bfd1-c298-4b05-8ed9-9902d4c80f70')]

When vim crashes it is supposed to shutdown its threads and exit.
The code is in nfv_vim/vim.py and looks like this:

    finally:
        open(PROCESS_NOT_RUNNING_FILE, 'w').close()
        # Allow up to 10 seconds for the process to shut down. If the
        # process_finalize hangs, we will do a hard exit.
        signal.signal(signal.SIGALRM, _force_exit)
        signal.alarm(10)
        process_finalize()

What could happen was the open hung due to lack of file descriptors so the failsafe mechanism was never activated.

The output of lsof (see attached) shows that most of the open files look something like this
nfv-vim 632016 root DEL REG 0,18 107932251 /dev/shm/sem.SVb1Kq

/dev/shm/sem.* would correspond to a named POSIX semaphore opened via the sem_open() library call. It seems likely that a code path is not calling sem_close() (and maybe sem_unlink())when it should be.

Reproducibility
---------------
Seen once

System Configuration
--------------------
IPv4 StarlingX with OpenStack

Branch/Pull Time/Commit
-----------------------
Jan. 21st master

Last Pass
---------
Not sure

Timestamp/Logs
--------------
See nfv-vim.log, lsof_output, sysinv.log attached

Test Activity
-------------
Evaluation

 Workaround
 ----------
 kill nfv-vim process

Revision history for this message
Tee Ngo (teewrs) wrote :
Tee Ngo (teewrs)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - serious issue, but only seen once to date.

tags: added: stx.4.0 stx.nfv
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Bart Wensley (bartwensley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/731302

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/731302
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=fca019fbe949d5d5a5499c22fea3720d7ee23149
Submitter: Zuul
Branch: master

commit fca019fbe949d5d5a5499c22fea3720d7ee23149
Author: Bart Wensley <email address hidden>
Date: Wed May 27 13:50:03 2020 -0500

    Allow VIM to shutdown when all FDs are used

    When the VIM attempts to shut down because it runs out of FDs,
    the shutdown code attempts to create a file, before it activates
    a SIGALRM timer that triggers fail safe shutdown. When all FDs
    have been consumed, the shutdown code hangs while creating the
    file and never shuts down.

    This is fixed by changing the shutdown code to first start the
    failsafe SIGALRM timer. That way the process is guaranteed to
    be terminated even if it hangs while attempting to create the
    file.

    Change-Id: I7184fe1333d2641227c193f7c8e2e7b1bebcfed8
    Partial-Bug: 1862049
    Signed-off-by: Bart Wensley <email address hidden>

Revision history for this message
Bart Wensley (bartwensley) wrote :

The fix I pushed (https://review.opendev.org/731302) allows the VIM to automatically recover when it runs out of FDs. But the VIM shouldn’t be leaking FDs in the first place. It appears to me that the VIM leaks file descriptors when it does not get a response to messages sent from its plugins. From further investigation, it looks like the FDs are being leaked when the VIM shuts down a worker thread - some example logs:

2020-05-27T18:43:29.485 controller-0 VIM_Thread[2249636] INFO _task_worker_pool.py.73 Timeout worker BlockStorage-Worker-0
2020-05-27T18:43:30.490 controller-0 VIM_Thread[2249636] ERROR _task.py.200 Task(get_volumes) work (get_volumes) timed out, id=16520.
2020-05-27T18:43:30.492 controller-0 VIM_Thread[2249636] ERROR _vim_nfvi_audits.py.635 Audit-Volumes callback, not completed, responses={'completed': False, 'reason': '', 'page-request-id': '3e598409-1207-47b3-9dc2-3d8eccb2d1cb'}.
2020-05-27T18:45:16.438 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _timer_scheduler.py.57 Not scheduling on time, elapsed=127499 ms.
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.232 Thread BlockStorage-Worker-0: shutting down.
2020-05-27T18:45:16.440 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.236 Thread BlockStorage-Worker-0: shutdown.

This requires another fix in the VIM, so I am leaving this LP open to investigate why the FDs are being leaked.

Revision history for this message
Bart Wensley (bartwensley) wrote :

I have figured out why the VIM is leaking FDs. The problem happens as follows:
- The VIM has worker processes that are used to communicate with other processes through their REST APIs (e.g. sysinv, nova, cinder). The VIM does not specify a timeout when sending REST API requests.
- The VIM does have a timeout for how long a worker process takes to process a request, which can vary depending on the request.
- If the worker process sends a REST API request and does not get a response in time (e.g. because a message is lost or the target process is down), the VIM terminates the worker process. This is being done with a call to Process.terminate in the python multiprocessing library. The docs for this library (https://docs.python.org/2/library/multiprocessing.html#multiprocessing-programming) clearly indicate that Process.terminate should not be used for a process that uses any shared resources (e.g. pipes). In this case, the worker processes are using shared resources (pipes for one) and these resources are not freed, leading to the FD leak.

The solution here is likely to ensure that a timeout is set when sending REST API requests (call to urllib.request.urlopen from _rest_api_request in nfvi_plugins/openstack/rest_api.py). This timeout must be less than the worker timeout. Unfortunately, this is easier said than done, because the worker timeout is configurable (through the /etc/nfv/nfv_plugins/nfvi_plugins/config.ini file and the code sending these REST API requests has no way to know what that timeout is.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/733246

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/733246
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=ccd59a07116d676b645831047df3d0b77db4a0cc
Submitter: Zuul
Branch: master

commit ccd59a07116d676b645831047df3d0b77db4a0cc
Author: Bart Wensley <email address hidden>
Date: Wed Jun 3 11:37:56 2020 -0500

    Handle REST API timeouts gracefully in the VIM

    The VIM is leaking FDs. The problem happens as follows:
    - The VIM has worker processes that are used to communicate with
      other processes through their REST APIs (e.g. sysinv, nova,
      cinder). The VIM does not specify a timeout when sending REST API
      requests.
    - The VIM does have a timeout for how long a worker process takes to
      process a request, which can vary depending on the request.
    - If the worker process sends a REST API request and does not get a
      response in time (e.g. because a message is lost or the target
      process is down), the VIM terminates the worker process. This is
      being done with a call to Process.terminate in the python
      multiprocessing library. The docs for this library clearly indicate
      that Process.terminate should not be used for a process that uses
      any shared resources (e.g. pipes). In this case, the worker
      processes are using shared resources (pipes for one) and these
      resources are not freed, leading to the FD leak.

    The solution is to ensure that a timeout is set when sending REST API
    requests. This timeout must be less than the worker timeout to ensure
    that the workers do not timeout (and leak FDs) except in the rarest
    of cases.

    Change-Id: Iccff914e86224be96689738cdcc536a4d5acb861
    Closes-Bug: 1862049
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.