Fix subcloud manage/unmanage issues caused by identity sync
Recently identity (keystone) sync functionality was added to the
dcorch. This changed the behaviour of the update_subcloud_states
RPC. The dcmanager expects this RPC to be handled quickly and
a reply sent almost immediately (timeout is 60s). Instead, the
dcorch is now performing an identity sync when handling this
RPC, which involves sending multiple messages to a subcloud and
waiting for replies. This causes the update_subcloud_states RPC
to time out sometimes (especially if a subcloud is unreachable)
and the dcmanager/dcorch states to get out of sync, with no
recovery mechanism in place.
To fix this, I have create a new initial sync manager in the
dcorch. When the dcorch handles the update_subcloud_states RPC,
it will now just update the subcloud to indicate that an initial
sync is required and then reply to the RPC immediately. The
initial sync manager will perform the initial sync in the
background (separate greenthreads) and enable the subcloud when
it has completed. I also enhanced the dcmanager subcloud audit
to periodically send a state update for each subcloud to the
dcorch, which will correct any state mismatches that might
occur.
Reviewed: https:/ /review. opendev. org/707258 /git.openstack. org/cgit/ starlingx/ distcloud/ commit/ ?id=0389c7fbb16 30988acd385140c 9fc16835aae090
Committed: https:/
Submitter: Zuul
Branch: master
commit 0389c7fbb163098 8acd385140c9fc1 6835aae090
Author: Bart Wensley <email address hidden>
Date: Tue Feb 11 15:21:09 2020 -0600
Fix subcloud manage/unmanage issues caused by identity sync
Recently identity (keystone) sync functionality was added to the subcloud_ states subcloud_ states RPC
dcorch. This changed the behaviour of the update_
RPC. The dcmanager expects this RPC to be handled quickly and
a reply sent almost immediately (timeout is 60s). Instead, the
dcorch is now performing an identity sync when handling this
RPC, which involves sending multiple messages to a subcloud and
waiting for replies. This causes the update_
to time out sometimes (especially if a subcloud is unreachable)
and the dcmanager/dcorch states to get out of sync, with no
recovery mechanism in place.
To fix this, I have create a new initial sync manager in the subcloud_ states RPC,
dcorch. When the dcorch handles the update_
it will now just update the subcloud to indicate that an initial
sync is required and then reply to the RPC immediately. The
initial sync manager will perform the initial sync in the
background (separate greenthreads) and enable the subcloud when
it has completed. I also enhanced the dcmanager subcloud audit
to periodically send a state update for each subcloud to the
dcorch, which will correct any state mismatches that might
occur.
Change-Id: I70b98d432c3ed5 6b9532117f69f02 d4a0cff5742
Closes-Bug: 1860999
Closes-Bug: 1861157
Signed-off-by: Bart Wensley <email address hidden>