Distributed Cloud - Subcloud manage often times out due to keystone related sync

Bug #1861157 reported by Tee Ngo
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bart Wensley

Bug Description

Brief Description
-----------------
DCManager <-> DCOrch interactions don't work reliably due to Keystone related sync

Severity
--------
Critical as subcloud manage command often fails

Steps to Reproduce
------------------
Add some subclouds
Once they are online, run command "dcmanager subcloud manage <subcloud-name>" to register the newly added subclouds with the System Controller.

Expected Behavior
------------------
All newly added subclouds that are online can be managed afater the command

Actual Behavior
----------------
Most subcloud manage commands failed

[Analysis, courtesy of Bart Wensley]
The addition of the fernet key sync and keystone sync has fundamentally broken the dcmanager <-> dcorch interactions. When a subcloud is managed, the dcmanager sends an update_subcloud_states RPC to the dcorch. This RPC must be handled and replied to within 60 seconds.

Originally, the update_subcloud_sync RPC was handled quickly by dcorch and resulted in the sync threads being enabled, which would perform a sync. With the addition of fernet key sync and the keystone sync, the dcorch is now trying to do a full sync and distribute the fernet keys to the subcloud BEFORE replying to the dcmanager. Here is the code in update_subcloud_states (dcorch/engine/service.py) - new code for keystone data sync spcified below:

            # Initial identity sync. It's synchronous so that identity <---- new comment
            # get synced before fernet token keys are synced. This is <---- new comment
            # necessary since we want to revoke all existing tokens on <---- new comment
            # this subcloud after its services user IDs and project <---- new comment
            # IDs are changed. Otherwise subcloud services will fail <---- new comment
            # authentication since they keep on using their existing tokens <---- new comment
            # issued before these IDs change, until these tokens expires. <---- new comment
            try:
                self.gsm.initial_sync(ctxt, subcloud_name) <---- new code
                self.fkm.distribute_keys(ctxt, subcloud_name) <---- new code
                self.aam.enable_snmp(ctxt, subcloud_name)
                self.gsm.enable_subcloud(ctxt, subcloud_name)
            except Exception as ex:
                LOG.warning('Update subcloud state failed for %s: %s',
                            subcloud_name, six.text_type(ex))
                raise

This needs to be redesigned so that the update_subcloud_sync is essentially an asynchronous operation as it was before. There must be zero communication with the subcloud when handling this RPC.

The key logs are as follows.

Subcloud 103 is set to managed and request received by dcmanager:

2020-01-27 23:28:37.478 1860648 INFO dcmanager.manager.service [req-f9e187cb-c362-44e1-8aad-110d018a302a a981317518794800b4623fa6914d66bc - - default default] Handling update_subcloud request for: 194
2020-01-27 23:28:37.479 1860648 INFO dcmanager.manager.subcloud_manager [req-f9e187cb-c362-44e1-8aad-110d018a302a a981317518794800b4623fa6914d66bc - - default default] Updating subcloud 194.

Note - the above log refers to the subcloud id (194). Other logs contain the subcloud name (subcloud103) - This also needs to be fixed so that all the logs should refer to the subcloud name to make debugging easier.

The dcmanager sends an update_subcloud_status RPC to the dcorch, which times out after one minute:

2020-01-27 23:29:37.497 1860648 WARNING dcmanager.manager.subcloud_manager [req-f9e187cb-c362-44e1-8aad-110d018a302a a981317518794800b4623fa6914d66bc - - default default] Problem informing dcorch of subcloud state change, resume to original state, subcloud: subcloud103: MessagingTimeout: Timed out waiting for a reply to message ID 26c73fbdc2a442e8943bcba3b1773ee5

Meanwhile the logs show the dcorch is trying to do a full sync and distribute the fernet keys for this subcloud:

2020-01-27 23:28:37.522 3141782 INFO dcorch.engine.generic_sync_manager [req-f9e187cb-c362-44e1-8aad-110d018a302a a981317518794800b4623fa6914d66bc - - default default] Initial sync subcloud subcloud103

This takes about 90 seconds to complete:

2020-01-27 23:30:11.367 3141782 INFO dcorch.engine.generic_sync_manager [req-f9e187cb-c362-44e1-8aad-110d018a302a a981317518794800b4623fa6914d66bc - - default default] enabling subcloud subcloud103

The reply to the update_subcloud_status RPC is sent to the dcmanager, but arrives too late:

2020-01-27 23:30:12.043 1860648 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 26c73fbdc2a442e8943bcba3b1773ee5

Reproducibility
---------------
80% reproducible

System Configuration
--------------------
IPv6 distributed cloud

Branch/Pull Time/Commit
-----------------------
Jan. 21 master

Last Pass
---------
Not sure if this was

Timestamp/Logs
--------------
See logs above

Test Activity
-------------
Evaluation

 Workaround
 ----------
 None

Changed in starlingx:
importance: Undecided → High
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - distributed cloud sync issues

tags: added: stx.distcloud
tags: added: stx.4.0
Changed in starlingx:
status: New → Triaged
assignee: nobody → Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: Dariush Eslimi (deslimi) → Bart Wensley (bartwensley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/707258

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/707258
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=0389c7fbb1630988acd385140c9fc16835aae090
Submitter: Zuul
Branch: master

commit 0389c7fbb1630988acd385140c9fc16835aae090
Author: Bart Wensley <email address hidden>
Date: Tue Feb 11 15:21:09 2020 -0600

    Fix subcloud manage/unmanage issues caused by identity sync

    Recently identity (keystone) sync functionality was added to the
    dcorch. This changed the behaviour of the update_subcloud_states
    RPC. The dcmanager expects this RPC to be handled quickly and
    a reply sent almost immediately (timeout is 60s). Instead, the
    dcorch is now performing an identity sync when handling this
    RPC, which involves sending multiple messages to a subcloud and
    waiting for replies. This causes the update_subcloud_states RPC
    to time out sometimes (especially if a subcloud is unreachable)
    and the dcmanager/dcorch states to get out of sync, with no
    recovery mechanism in place.

    To fix this, I have create a new initial sync manager in the
    dcorch. When the dcorch handles the update_subcloud_states RPC,
    it will now just update the subcloud to indicate that an initial
    sync is required and then reply to the RPC immediately. The
    initial sync manager will perform the initial sync in the
    background (separate greenthreads) and enable the subcloud when
    it has completed. I also enhanced the dcmanager subcloud audit
    to periodically send a state update for each subcloud to the
    dcorch, which will correct any state mismatches that might
    occur.

    Change-Id: I70b98d432c3ed56b9532117f69f02d4a0cff5742
    Closes-Bug: 1860999
    Closes-Bug: 1861157
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/716140

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (15.6 KiB)

Reviewed: https://review.opendev.org/716140
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=04b49dd093ab850f4520cdb85638221120dd7568
Submitter: Zuul
Branch: f/centos8

commit 25c9d6ed3861f2d783404fcf84b186441ab9cd4d
Author: albailey <email address hidden>
Date: Wed Mar 25 15:43:32 2020 -0500

    Removing ddt from unit tests

    This cleanup should assist in transitioning to
    stestr and fixtures, as well as py3 support.

    The ddt data is primarily unused, only subcloud, route
    and endpoints were being loaded.

    The information in the data files was out of date,
    and not necessarily matching the current product model.

    Story: 2004515
    Task: 39160
    Change-Id: Iddd7ed4664b0d59dbc58aae5c3fedd74c9a138c0
    Signed-off-by: albailey <email address hidden>

commit 7f3827f24d2fb3cb546d3caf71d505d23187b0dc
Author: Tao Liu <email address hidden>
Date: Thu Mar 12 09:46:29 2020 -0400

    Keystone token and resource caching

    Add the following misc. changes to dcorch and dcmanager components:
    - Cache the master resource in dcorch audit
    - Consolidate the openstack drivers to common module, combine the
      dcmanager and dcorch sysinv client. (Note: the sdk driver that
      used by nova, neutron and cinder will be cleaned as part of
      story 2006588).
    - Update the common sdk driver:
      . in order to avoid creating new keystone client multiple times
      . to add a option for caching region clients, in addition to the
        keystone client
      . finally, to randomize the token early renewal duration
    - Change subcloud audit manager, patch audit manager,
      and sw update manager to:
      utilize the sdk driver which caches the keystone client and token

    Test cases:
    1. Manage/unmanage subclouds
    2. Platform resources sync and audit
    3. Verify the keystone token is cached until the token is
       expired
    4. Add/delete subclouds
    5. Managed subcloud goes offline/online (power off/on)
    6. Managed subcloud goes offline/online (delete/add a static route)
    7. Apply a patch to all subclouds via patch Orchestration

    Story: 2007267
    Task: 38865

    Change-Id: I75e0cf66a797a65faf75e7c64dafb07f54c2df06
    Signed-off-by: Tao Liu <email address hidden>

commit 3a1bf60caddfa2e807d4f5996ff94fea7dde5477
Author: Jessica Castelino <email address hidden>
Date: Wed Mar 11 16:23:21 2020 -0400

    Cleanup subcloud details when subcloud add fails

    Failure during add subcloud prevents subcloud from being added again
    with the same name as the subcloud details are not cleaned up
    properly. Fixes have been added for proper cleanup of dcorch database
    tables, ansible subcloud inventory files, keystone endpoints, keystone
    region, and addn_hosts_dc file when failure is encountered.

    Test cases:
    1. Add subcloud
    2. Add subcloud with "--deploy-playbook"
    3. Delete subcloud
    4. Raise explicit exception in dcorch/objects/subcloud.py
    5. Raise explicit exception in dcmanager/manager/subcloud_manager.py

    Change-Id: Iedf172c3e9c3c4bdb9b9482dc5d46f072b3ccf61
    ...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.