Distributed Cloud - Subcloud manage often times out due to keystone related sync
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Bart Wensley |
Bug Description
Brief Description
-----------------
DCManager <-> DCOrch interactions don't work reliably due to Keystone related sync
Severity
--------
Critical as subcloud manage command often fails
Steps to Reproduce
------------------
Add some subclouds
Once they are online, run command "dcmanager subcloud manage <subcloud-name>" to register the newly added subclouds with the System Controller.
Expected Behavior
------------------
All newly added subclouds that are online can be managed afater the command
Actual Behavior
----------------
Most subcloud manage commands failed
[Analysis, courtesy of Bart Wensley]
The addition of the fernet key sync and keystone sync has fundamentally broken the dcmanager <-> dcorch interactions. When a subcloud is managed, the dcmanager sends an update_
Originally, the update_
# Initial identity sync. It's synchronous so that identity <---- new comment
# get synced before fernet token keys are synced. This is <---- new comment
# necessary since we want to revoke all existing tokens on <---- new comment
# this subcloud after its services user IDs and project <---- new comment
# IDs are changed. Otherwise subcloud services will fail <---- new comment
# authentication since they keep on using their existing tokens <---- new comment
# issued before these IDs change, until these tokens expires. <---- new comment
try:
except Exception as ex:
This needs to be redesigned so that the update_
The key logs are as follows.
Subcloud 103 is set to managed and request received by dcmanager:
2020-01-27 23:28:37.478 1860648 INFO dcmanager.
2020-01-27 23:28:37.479 1860648 INFO dcmanager.
Note - the above log refers to the subcloud id (194). Other logs contain the subcloud name (subcloud103) - This also needs to be fixed so that all the logs should refer to the subcloud name to make debugging easier.
The dcmanager sends an update_
2020-01-27 23:29:37.497 1860648 WARNING dcmanager.
Meanwhile the logs show the dcorch is trying to do a full sync and distribute the fernet keys for this subcloud:
2020-01-27 23:28:37.522 3141782 INFO dcorch.
This takes about 90 seconds to complete:
2020-01-27 23:30:11.367 3141782 INFO dcorch.
The reply to the update_
2020-01-27 23:30:12.043 1860648 INFO oslo_messaging.
Reproducibility
---------------
80% reproducible
System Configuration
-------
IPv6 distributed cloud
Branch/Pull Time/Commit
-------
Jan. 21 master
Last Pass
---------
Not sure if this was
Timestamp/Logs
--------------
See logs above
Test Activity
-------------
Evaluation
Workaround
----------
None
Changed in starlingx: | |
importance: | Undecided → High |
Changed in starlingx: | |
assignee: | Dariush Eslimi (deslimi) → Bart Wensley (bartwensley) |
stx.4.0 / high priority - distributed cloud sync issues