Distributed Cloud: Processes can hang due to missing REST API timeouts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Yuxing |
Bug Description
Brief Description
-----------------
There have been two bugs where the dcmanager subcloud audit process hung (bug 1884560 and bug 1870413). The root cause was determined to be that the REST API calls made by the audit did not use timeouts, so a lost message could result in the process hanging forever. Timeouts for the REST API calls used by the audit were added as a fix for bug 1870413.
This LP is being opened to fix the remainder of the REST API calls in the distributed cloud code. Some notes:
- The DC orchestration code is also vulnerable to this issue as it uses the sysinv/patching/vim clients with no request timeouts.
- We need to protect all rest API requests we send with a timeout.
- This will also require changes to the nfv_client itself (in the VIM) because it currently doesn’t provide any way for users to specify a timeout.
Severity
--------
Major: processes can hang with the only recourse being to restart them
Steps to Reproduce
------------------
This is a race condition - it can be difficult to reproduce. It is usually triggered when a subcloud is powered down.
Expected Behavior
------------------
All REST API calls from the distributed cloud processes should time out - even in the case of message loss.
Actual Behavior
----------------
See above
Reproducibility
---------------
Intermittent
System Configuration
-------
Distributed Cloud
Branch/Pull Time/Commit
-------
See bug 1870413
Last Pass
---------
N/A
Timestamp/Logs
--------------
See bug 1870413
Test Activity
-------------
See bug 1870413
Workaround
----------
A controller swact will restart the affected processes
tags: | added: stx.distcloud |
description: | updated |
Changed in starlingx: | |
assignee: | nobody → John Kung (john-kung) |
Changed in starlingx: | |
assignee: | John Kung (john-kung) → Yuxing (yuxing) |
Reviewed: https:/ /review. opendev. org/740779 /git.openstack. org/cgit/ starlingx/ nfv/commit/ ?id=b39bc8c0349 e68373d30360c1d 99ad02c7b262aa
Committed: https:/
Submitter: Zuul
Branch: master
commit b39bc8c0349e683 73d30360c1d99ad 02c7b262aa
Author: Bart Wensley <email address hidden>
Date: Mon Jul 13 14:01:27 2020 -0500
Add REST API timeouts to nfv-client
The VIM's nfv-client sends REST API messages to the VIM, but it
currently does not set a timeout for these calls. In the case
where a message is lost, the nfv-client would block forever,
causing the calling process to hang.
Fixing this by adding timeouts to the nfv-client to ensure that
all REST API messages it sends will timeout even in the case of
lost messages.
Change-Id: Iba06924fb7bd14 a1ee3362b4fd19a a4114dc34cd
Partial-Bug: 1886037
Signed-off-by: Bart Wensley <email address hidden>