StarlingX

Bug #1886037
Comment #0

Comment 0 for bug 1886037

Revision history for this message

Bart Wensley (bartwensley) wrote on 2020-07-02:

Brief Description
-----------------
There have been two bugs where the dcmanager subcloud audit process hung (bug 1884560 and bug 1870413). The root cause was determined to be that the REST API calls made by the audit did not use timeouts, so a lost message could result in the process hanging forever. Timeouts for the REST API calls used by the audit were added as a fix for bug 1870413.

This LP is being opened to fix the remainder of the REST API calls in the distributed cloud code. Some notes:
- The DC orchestration code is also vulnerable to this issue as it uses the sysinv/patching/vim clients with no request timeouts.
- We need to protect all rest API requests we send with a timeout.
- This will also require changes to the nfv_client itself (in the VIM) because it currently doesn’t provide any way for users to specify a timeout.

Severity
--------
Major: processes can hand with the only recourse being to restart them

Steps to Reproduce
------------------
This is a race condition - it can be difficult to reproduce. It is usually triggered when a subcloud is powered down.

Expected Behavior
------------------
All REST API calls from the distributed cloud processes should time out - even in the case of message loss.

Actual Behavior
----------------
See above

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
See bug 1870413

Last Pass
---------
N/A

Timestamp/Logs
--------------
See bug 1870413

Test Activity
-------------
See bug 1870413

Workaround
----------
A controller swact will restart the affected processes