Dcorch RPC server is not accessible timely after process restart or swact

Bug #1957954 reported by Tee Ngo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tee Ngo

Bug Description

Brief Description
-----------------
The dcorch-engine worker queues are not created and bound timely to the dcorch-engine exchange following process restart as a result of patching of controller-swact in a large DC system.

When this occurs, most dcmanager commands will fail and subcloud status "can" be inaccurate due to broken communications between dcmanager and dcorch.

Severity
--------
Critical

Steps to Reproduce
------------------

Restart dcorch-engine, either using sm-restart command or host-swact
Perform any operation that results in messages being sent from the dcmanager to dcorch e.g. dcmanager subcloud manage/unmanage

Expected Behavior
------------------
Dcorch worker queues are created and bound to the exchange and workers are ready to process RPC requests shortly after restart.

Actual Behavior
----------------
Worker message queues are not created and bound to the exchange for a long time causing dcmanager command to fail.

Reproducibility
---------------
Reproducible in a large DC lab.

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Jan. 12th, 2021 master build

Last Pass
---------
Missing test case

Timestamp/Logs
--------------
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/dcmanager/manager/subcloud_manager.py", line 1551, in _update_subcloud_state
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager context, subcloud_name, management_state, availability_status)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/dcorch/rpc/client.py", line 100, in update_subcloud_states
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager availability_status=availability_status))
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/dcorch/rpc/client.py", line 49, in call
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager return client.call(ctxt, method, **kwargs)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 465, in call
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager return self.prepare().call(ctxt, method, **kwargs)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager retry=self.retry)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 123, in _send
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager timeout=timeout, retry=retry)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 566, in send
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager retry=retry)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 555, in _send
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager result = self._waiter.wait(msg_id, timeout)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in wait
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager message = self.waiters.get(msg_id, timeout=timeout)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 335, in get
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager 'to message ID %s' % msg_id)
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager MessagingTimeout: Timed out waiting for a reply to message ID 317d8f0fd8d643a1b3dc90777c3ac93e
2022-01-13 23:27:03.086 1379454 ERROR dcmanager.manager.subcloud_manager

Alarms

Test Activity
-------------
Developer Testing

Workaround
----------
Wait until dcorch-engine_fanout* queues appear in the output or rabbitmqctl list_queues|grep dcorch-engine before trying dcmanager commands (e.g. subcloud manage/unmanage) again.

Tee Ngo (teewrs)
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/824793

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/824793
Committed: https://opendev.org/starlingx/distcloud/commit/0b3af3a4e579fad7e36e1719efe6878172319535
Submitter: "Zuul (22348)"
Branch: master

commit 0b3af3a4e579fad7e36e1719efe6878172319535
Author: Tee Ngo <email address hidden>
Date: Fri Jan 14 14:42:01 2022 -0500

    Update dcorch-engine startup sequence

    Start dcorch RPC server first to ensure worker
    message queues are created and bound to the
    exchange promptly upon process startup/restart.

    Test Plan:
      1. In a large DC system that exhibits the issue
         described in LP1957954, apply the fix and
         restart dcorch-engine process.
      3. Verify that dcorch engine workers are connected
         to rabbit as soon as dcorch-engine starts up.
      4. Perform subcloud unmanage/manage and verify
         that the command completes successfully.
      5. Perform controllers swact.
      6. Repeat steps 3 and 4.

    Closes-Bug: 1957954
    Change-Id: Id2bf9feb0d7f599d27bca800547a08ee310f6d4d
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.