Identity sync issues after subcloud is first managed

Bug #1887849 reported by Bart Wensley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Andy

Bug Description

Brief Description
-----------------
When a subcloud is first managed, it can sometimes take up to an hour for the identity and platform endpoints to go in-sync. This is intermittent.

Severity
--------
Major: subcloud can stay out of sync for up to an hour which will be confusing to the user

Steps to Reproduce
------------------
Configure a DC system
Install and add a subcloud
Manage the subcloud

Expected Behavior
------------------
All the endpoints for the subcloud should go in-sync within a minute or two.

Actual Behavior
----------------
Sometimes the identity and platform endpoints don't go in sync for up to an hour.

Reproducibility
---------------
Intermittent - this may have something to do with how soon the subcloud is managed after it goes online. I suspect that the sooner it is managed, the more likely this is to happen, but that is just a theory.

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
Designer load built from starlingx master:
BUILD_DATE="2020-07-10 12:28:29 -0500"

Last Pass
---------
I suspect this was broken by https://review.opendev.org/#/c/716951, which was pushed on April 13, 2020.

Timestamp/Logs
--------------
When the failure occurs, the dcorch initial identity sync is failing every minute with logs like this:
2020-07-16 13:15:06.295 102948 INFO dcorch.engine.generic_sync_manager [-] updating state for subcloud subcloud3 - management_state: None availability_status: None initial_sync_state: requested
2020-07-16 13:15:06.342 102948 INFO dcorch.engine.initial_sync_manager [-] Initial sync for subcloud subcloud3
2020-07-16 13:15:06.342 102948 INFO dcorch.engine.generic_sync_manager [-] updating state for subcloud subcloud3 - management_state: None availability_status: None initial_sync_state: in-progress
2020-07-16 13:15:06.351 102948 INFO dcorch.engine.generic_sync_manager [-] Initial sync subcloud subcloud3
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager [-] Initial sync failed for subcloud3: Authorization failed: Could not recognize Fernet token (HTTP 404) (Request-ID: req-fab18f16-16df-4928-b7b8-3d96d0bb53d7) (HTTP 404): AuthorizationFailure: Authorization failed: Could not recognize Fernet token (HTTP 404) (Request-ID: req-fab18f16-16df-4928-b7b8-3d96d0bb53d7) (HTTP 404)
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager Traceback (most recent call last):
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/dcorch/engine/initial_sync_manager.py", line 143, in _initial_sync_subcloud
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager self.gsm.initial_sync(self.context, subcloud_name)
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/dcorch/engine/generic_sync_manager.py", line 151, in initial_sync
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager subcloud_engine.initial_sync()
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/dcorch/engine/subcloud.py", line 164, in initial_sync
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager thread.initial_sync()
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/dcorch/engine/sync_services/identity.py", line 319, in initial_sync
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager self._initial_sync_users(m_users, sc_users)
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/dcorch/engine/sync_services/identity.py", line 205, in _initial_sync_users
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager self.reauthenticate_sc_clients()
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/dcorch/engine/sync_services/identity.py", line 124, in reauthenticate_sc_clients
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager self.reauthenticate_sc_ks_client()
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/dcorch/engine/sync_services/identity.py", line 138, in reauthenticate_sc_ks_client
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager project_domain_name=self.sc_admin_session.auth._project_domain_name,
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/keystoneclient/httpclient.py", line 583, in authenticate
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager resp = self.get_raw_token_from_identity_service(**kwargs)
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager File "/usr/lib/python2.7/site-packages/keystoneclient/v3/client.py", line 349, in get_raw_token_from_identity_service
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager _('Authorization failed: %s') % e)
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager AuthorizationFailure: Authorization failed: Could not recognize Fernet token (HTTP 404) (Request-ID: req-fab18f16-16df-4928-b7b8-3d96d0bb53d7) (HTTP 404)
2020-07-16 13:15:07.770 102948 ERROR dcorch.engine.initial_sync_manager
2020-07-16 13:15:07.772 102948 INFO dcorch.engine.generic_sync_manager [-] updating state for subcloud subcloud3 - management_state: None availability_status: None initial_sync_state: failed

Test Activity
-------------
Developer Testing

Workaround
----------
Wait for about an hour and the subcloud should go in sync.

tags: added: stx.distcloud
Frank Miller (sensfan22)
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
tags: added: stx.5.0
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Andy (andy.wrs)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/742974

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/742974
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=9a4ad4921a56556eb47968257f80da49cb2e586b
Submitter: Zuul
Branch: master

commit 9a4ad4921a56556eb47968257f80da49cb2e586b
Author: Andy Ning <email address hidden>
Date: Fri Jul 24 14:19:34 2020 -0400

    Fix subcloud slow initial sync caused by keystone reauthentication

    Under certain conditions subcloud keystone client reauthentication
    will fail during initial sync. It will keep on failing in reattempts,
    causing endpoints to go in-sync slow after the subcloud is managed.

    This is caused by the keystone reauthentication reusing the existing
    token that is already invalid. Once that existing token expires
    (an hour at maximum), the reauthentication will succeed and endpoints
    go in-sync.

    This update fixed this by recreating subcloud admin session (and
    keystone, dcdbsync clients) by using username/password. With this
    the reauthentication will cover cases where subcloud admin session
    user changes (eg, from admin to dcmanager or vise versa).

    Change-Id: I8a49bf8c55e3538fc47b833ae648b667b5e9e9e5
    Closes-Bug: 1887849
    Signed-off-by: Andy Ning <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.