Distributed Cloud: Initial keystone sync does not handle failures correctly

Bug #1862946 reported by Bart Wensley
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
The initial sync code for keystone (identity) does not handle failures properly. For example, if an initial sync is triggered, but the request to the subcloud to get the users fails due to “Unauthorized request”, then the initial sync passes without actually doing an initial sync. This will delay the availability of the subcloud - it may go offline due to unsynced identity data.

Severity
--------
Major: Subcloud seems to recover after some time.

Steps to Reproduce
------------------
This happens regularly when a subcloud is installed and managed.

Expected Behavior
------------------
If the initial keystone sync fails for any reason, the sync should indicate the failure so the dcorch code can re-attempt the initial sync until it is successful.

Actual Behavior
----------------
Failures in the initial keystone sync are silent.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Distributed cloud

Branch/Pull Time/Commit
-----------------------
Designer load built from master on Feburary 4, 2020.

Last Pass
---------
Unknown

Timestamp/Logs
--------------
The following logs show the initial sync passing when it should have failed:

2020-02-11 15:24:26.036 1779411 INFO dcorch.engine.generic_sync_manager [-] Initial sync subcloud subcloud3
2020-02-11 15:24:27.023 1779411 INFO dcorch.engine.sync_services.identity [-] subcloud3/identity: Get resource [users] request failed for subcloud3: Unauthorized request..
2020-02-11 15:24:28.018 1779411 INFO dcorch.drivers.openstack.sdk_platform [-] get new keystone client for subcloud RegionOne
2020-02-11 15:24:28.727 1779411 INFO dcorch.drivers.openstack.sdk_platform [-] get new keystone client for subcloud subcloud3
2020-02-11 15:24:30.042 1779411 INFO dcorch.drivers.openstack.sysinv_v1 [-] post_fernet_repo driver region=subcloud3 fernet_repo_list=[{'id': 0, 'key': u'Glxb7qe-NAxWagc-Mg8Cm9iyAO4knyrGkgmq7mni-Po='}, {'id': 1, 'key': u'ZoiHpBPAL0sMWJ40Nbc7sUxzbxJQRfmgy6kpVzYJH0I='}]
2020-02-11 15:24:30.181 1779411 INFO dcorch.engine.alarm_aggregate_manager [-] Enabling fm-aggregation trap for region_name=subcloud3
2020-02-11 15:24:30.182 1779411 INFO dcorch.drivers.openstack.sdk_platform [-] get new keystone client for subcloud subcloud3
2020-02-11 15:24:31.358 1779411 INFO dcorch.drivers.openstack.sysinv_v1 [-] snmp_trapdest_create driver region=subcloud3trapdest_create_dict={'ip_address': '192.168.204.2', 'community': 'dcorchAlarmAggregator'}
2020-02-11 15:24:31.454 1779411 WARNING cgtsclient.common.http [-] Request returned failure status.
2020-02-11 15:24:31.455 1779411 INFO dcorch.drivers.openstack.sysinv_v1 [-] snmp_trapdest_create exists region=subcloud3trapdest_dict={'ip_address': '192.168.204.2', 'community': 'dcorchAlarmAggregator'}
2020-02-11 15:24:31.475 1779411 INFO dcorch.drivers.openstack.sysinv_v1 [-] snmp_trapdest_create found existing <iTrapdest {u'uuid': u'4f09545d-21ea-4d0f-9925-0cb976701961', u'links': [{u'href': u'http://192.168.102.2:6385/v1/itrapdest/4f09545d-21ea-4d0f-9925-0cb976701961', u'rel': u'self'}, {u'href': u'http://192.168.102.2:6385/itrapdest/4f09545d-21ea-4d0f-9925-0cb976701961', u'rel': u'bookmark'}], u'ip_address': u'192.168.204.2', u'community': u'dcorchAlarmAggregator', u'type': u'snmpv2c_trap', u'port': 162, u'transport': u'udp'}>for region: subcloud3
2020-02-11 15:24:31.476 1779411 INFO dcorch.engine.alarm_aggregate_manager [-] Updating alarm summary for subcloud3
2020-02-11 15:24:31.611 1779411 INFO dcorch.drivers.openstack.sdk [-] Using cached OS client objects subcloud3
2020-02-11 15:24:31.611 1779411 INFO dcorch.drivers.openstack.fm [-] get_alarm_summary region subcloud3
2020-02-11 15:24:32.680 1779411 INFO dcorch.engine.generic_sync_manager [-] enabling subcloud subcloud3
2020-02-11 15:24:33.193 1779411 INFO dcorch.engine.sync_thread [-] subcloud3/platform: Got 0 sync request(s)
2020-02-11 15:24:33.196 1779411 INFO dcorch.engine.sync_thread [-] subcloud3/platform: subcloud is disabled
2020-02-11 15:24:33.697 1779411 INFO dcorch.engine.sync_services.identity [-] subcloud3/identity: Identity session and clients initialized
2020-02-11 15:24:33.699 1779411 INFO dcorch.engine.generic_sync_manager [-] updating state for subcloud subcloud3 - management_state: None availability_status: None initial_sync_state: completed
2020-02-11 15:24:33.712 1779411 INFO dcorch.engine.sync_thread [-] subcloud3/identity: Got 0 sync request(s)

In this case, I think the bug is in get_subcloud_resources (dcorch/engine/sync_services/identity.py) - when handling the Unauthorized exception, it calls an audit function, which will just fail and return None if there is another auth failure. There are probably similar bugs in other places.

Test Activity
-------------
Developer Testing

Workaround
----------
Wait for audit to correctly sync keystone data. I am not sure that this will resolve the issue in all cases.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue with handling of failure condition

tags: added: stx.distcloud
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0
Changed in starlingx:
assignee: nobody → Tao Liu (tliu88)
Frank Miller (sensfan22)
Changed in starlingx:
assignee: Tao Liu (tliu88) → Dan Voiculeasa (dvoicule)
Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

Using 2020-03-27 load.

In an initial sync only `users` and `projects` resources are synced which are handled by dbsync_client.

During investigation the only thing that I found is that a resource[2] which is not used during initial sync does not produce a proper error message during an artificially induced retry[1], although an exception is still thrown.

Other attemps to reproduce an uncaught Exception:
1) Constantly rotating fernet keys on subcloud
 - Gives one Unauthorized, the retry mechanism[1] in get_master/subcloud_resource fixes it and the retry passes

2) Power off DC just before `manage` command
 - Caught by InitialSyncManager

3) Artificially raised exceptions all caught by InitialSyncManager
 - dbsync_exceptions.Unauthorized + on retry[1]: dbsync_exceptions.Unauthorized
 - dbsync_exceptions.Unauthorized + on retry[1]: custom
 - custom

[1]when dbsync_exceptions.Unauthorized in get_master/subcloud_resource, then retry once

[2]RESOURCE_TYPE_IDENTITY_PROJECT_ROLE_ASSIGNMENTS = "project_role_assignments"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/716951

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/716951
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=bd0d0c54e4c5d6fad368970622ecbb6b356648aa
Submitter: Zuul
Branch: master

commit bd0d0c54e4c5d6fad368970622ecbb6b356648aa
Author: Dan Voiculeasa <email address hidden>
Date: Wed Apr 1 19:29:16 2020 +0300

    Improve handling of token expiry

    Extract common code.

    Reauthenticate ks_client.

    Retry once when Unauthorized, using the correct exception.
      dbs_client <-> dbsync_exceptions.Unauthorized
      ks_client <-> keystone_exceptions.Unauthorized

    Partial-Bug: 1862946

    Change-Id: Ife84b84a26ce9b82c2d9e9ca140f2a5fac000b7c
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729815

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (19.6 KiB)

Reviewed: https://review.opendev.org/729815
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=a39415ff84d71a54580eb8cbe885647e69279306
Submitter: Zuul
Branch: f/centos8

commit 5c8377047ba679ce88a0360315df494a74208bbf
Author: Tao Liu <email address hidden>
Date: Tue May 5 08:59:59 2020 -0500

    Move subcloud audit to separate process

    Remove subcloud audit from dcmanager-manager process.
    Create dcmanager-audit process & associated files.
    Add new RPC calls for dcmanager-audit to notify dcmanager
    subcloud availability and sync endpoint type changes.
    Update dcmanager to handle availability and sync endpoint
    type updates from dcmanager-audit.
    Subcloud audit interval will be reduced to 20 seconds.
    Create/update unit tests, to verify the implementation
    changes.

    Story: 2007267
    Task: 39637

    Change-Id: Iff408166753f22ce3616d34e267ca1155ac43042
    Signed-off-by: Tao Liu <email address hidden>

commit d46516c46d24f8fea6ea71fdbe2f3fa2d296eb4d
Author: albailey <email address hidden>
Date: Wed May 13 14:00:12 2020 -0500

    Enable python3 unit tests as part of zuul

    The existing py27 unit tests were not all passing in py36,
    however now they are and so the zuul check and gate for py36
    have been added.

    Change-Id: Ie293ec69a04e6fd657f960aa9a135c428138b4b4
    Story: 2004515
    Task: 39768
    Signed-off-by: albailey <email address hidden>

commit dbf603b4c45e865e56a21d3b14d94bcc8d5f455c
Author: albailey <email address hidden>
Date: Mon May 11 10:29:45 2020 -0500

    Reduce the number of suppressed pylint warnings

    All pylint warnings were being suppressed by a wildcard.
    This commit only suppresses the warnings that are failing and
    prevents checks that would pass from being broken in later commits.

    The warnings being suppressed can be resolved individually
    by later submissions based on priority where appropriate.

    This commit also specifies python3 for pylint which has
    more recent checks.

    Change-Id: Ie29aeb0ea3e9dcb671af67f38e9a3f919ea7111e
    Story: 2004515
    Task: 39734
    Signed-off-by: albailey <email address hidden>

commit 15fb58f45c0f552eccd9c27ba023dbea560f27f2
Author: albailey <email address hidden>
Date: Fri May 8 13:20:23 2020 -0500

    Enhance Upgrade strategy to use endpoint audit status

    The distributed cloud audit was updated to include 'load'
    endpoint status, so the upgrade strategy is now able to
    make use of that information when constructing a strategy.

    Change-Id: I69eb4d98b9abf38b329e13fb116fc098db2bd736
    Story: 2007403
    Task: 39736
    Signed-off-by: albailey <email address hidden>

commit acc710093bb1c9581670d20b62c68ad669b3a3ad
Author: MCamp859 <email address hidden>
Date: Mon May 11 14:28:28 2020 -0400

    Minor edits to test docs promote issue

    Change-Id: I08ffc5e57b5b04c59102a6491f2fcdc256f16e0f
    Signed-off-by: MCamp859 <email address hidden>

commit 6d4fa855462cc6faf2e962f9d825b832f2885aa3
Author: Tee Ngo <email address hidden>
Date: Mon May 4 23:53:20 2020 -0400

    Extend subcloud audit...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.