Distributed Cloud - Unable to unmanage a subcloud that has gone offline
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Bart Wensley |
Bug Description
Brief Description
-----------------
Occasionally a subcloud that had been shutdown cannot be unmanaged and deleted.
Severity
--------
Critical - with this bug, one is unable to delete a subcloud
Steps to Reproduce
------------------
Shutdown the subcloud
Run dcmanager subcloud unmanage <subcloud-name> command to unmanage the subcloud
Expected Behavior
------------------
The offline subcloud can be unmanaged and deleted.
Actual Behavior
----------------
[root@controller-0 ~(keystone_admin)]# dcmanager subcloud unmanage 8
Unable to update subcloud
ERROR (app) Unable to unmanage subcloud 8
The issue is that the dcmanager is attempting to tell dcorch that the subcloud is offline, but the RPC times out (after 60 seconds):
2020-01-27 15:14:11.982 1860648 INFO dcmanager.
2020-01-27 15:14:11.982 1860648 INFO dcmanager.
2020-01-27 15:15:11.981 1860944 ERROR dcmanager.
The problem was introduced into dcorch by the fernet key syncing code. When the subcloud is unmanaged, the dcorch attempts to reset the fernet keys in the subcloud (see update_
2020-01-27 15:14:12.000 3352435 INFO dcorch.
2020-01-27 15:14:12.004 3352435 INFO dcorch.
However, since the subcloud is not there, the attempt to connect to keystone fails (times out), but that takes more than four minutes:
2020-01-27 15:18:26.769 3352435 INFO dcorch.
This probably needs to be fixed by configuring the timeout here so that the connection times out much quicker. With the recent fix Barton Wensley made (https:/
Reproducibility
---------------
Intermittent
System Configuration
-------
IPv6 Distributed Cloud
Branch/Pull Time/Commit
-------
Jan. 21st master
Last Pass
---------
This is an intermittent issue.
Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or for large collect files use: https:/
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem
Test Activity
-------------
Evaluation
Workaround
----------
Setting the http_connect_
Changed in starlingx: | |
importance: | Undecided → High |
tags: | added: stx.4.0 stx.distcloud |
Changed in starlingx: | |
status: | New → Triaged |
assignee: | nobody → Dariush Eslimi (deslimi) |
Changed in starlingx: | |
assignee: | Dariush Eslimi (deslimi) → Bart Wensley (bartwensley) |
Fix proposed to branch: master /review. opendev. org/707258
Review: https:/