Dcorch RPC server is not accessible timely after process restart or swact
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Tee Ngo |
Bug Description
Brief Description
-----------------
The dcorch-engine worker queues are not created and bound timely to the dcorch-engine exchange following process restart as a result of patching of controller-swact in a large DC system.
When this occurs, most dcmanager commands will fail and subcloud status "can" be inaccurate due to broken communications between dcmanager and dcorch.
Severity
--------
Critical
Steps to Reproduce
------------------
Restart dcorch-engine, either using sm-restart command or host-swact
Perform any operation that results in messages being sent from the dcmanager to dcorch e.g. dcmanager subcloud manage/unmanage
Expected Behavior
------------------
Dcorch worker queues are created and bound to the exchange and workers are ready to process RPC requests shortly after restart.
Actual Behavior
----------------
Worker message queues are not created and bound to the exchange for a long time causing dcmanager command to fail.
Reproducibility
---------------
Reproducible in a large DC lab.
System Configuration
-------
Distributed Cloud
Branch/Pull Time/Commit
-------
Jan. 12th, 2021 master build
Last Pass
---------
Missing test case
Timestamp/Logs
--------------
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.
Alarms
Test Activity
-------------
Developer Testing
Workaround
----------
Wait until dcorch-
Changed in starlingx: | |
assignee: | nobody → Tee Ngo (teewrs) |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.7.0 stx.distcloud |
Fix proposed to branch: master /review. opendev. org/c/starlingx /distcloud/ +/824793
Review: https:/