Distributed Cloud - 5 of 9 subclouds can't sync on identity after parallel manage

Bug #1855359 reported by Gerry Kopec on 2019-12-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Dariush Eslimi

Bug Description

Brief Description
-----------------
Attempted to manage 9 subclouds in parallel. All 9 were able to sync on platform but only 5 of 9 on identity (note that it was expected that patching wouldn't sync)

Thu Dec 5 00:07:17 UTC 2019
| name | subcloud1 |
| identity_sync_status | out-of-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud2 |
| identity_sync_status | unknown |
| patching_sync_status | unknown |
| platform_sync_status | unknown |
| name | subcloud3 |
| identity_sync_status | out-of-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud4 |
| identity_sync_status | out-of-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud5 |
| identity_sync_status | in-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud6 |
| identity_sync_status | out-of-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud7 |
| identity_sync_status | out-of-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud8 |
| identity_sync_status | in-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud9 |
| identity_sync_status | in-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |
| name | subcloud10 |
| identity_sync_status | in-sync |
| patching_sync_status | out-of-sync |
| platform_sync_status | in-sync |

Severity
--------
Major

Steps to Reproduce
------------------
Setup DC system with 9 subclouds all in unmanaged/online/complete state.
Do dcmanager subcloud manage for all 9 subclouds one after another.

Expected Behavior
------------------
Expect all to go in-sync on platform and identity.

Actual Behavior
----------------
5 or 9 subclouds fail to go in-sync on identity.

Reproducibility
---------------
first attempt

System Configuration
--------------------
All-in-one duplex plus worker, DC system controller

Branch/Pull Time/Commit
-----------------------
2019-11-06_10-52-51

Last Pass
---------
first attempt

Timestamp/Logs
--------------
Manage commands were run from Wed Dec 4 23:02:14 UTC 2019 to 23:05:02
Attached logs for active system controller and subcloud1.

Test Activity
-------------
DC engineering

Gerry Kopec (gerry-kopec) wrote :
Gerry Kopec (gerry-kopec) wrote :
Ghada Khalil (gkhalil) on 2019-12-06
tags: added: stx.distcloud
Changed in starlingx:
assignee: nobody → Andy (andy.wrs)
Ghada Khalil (gkhalil) wrote :

Marking as stx.4.0 / medium priority given this is an issue with parallel operations, so it can be avoided by managing the clouds serially

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.4.0
Andy (andy.wrs) wrote :
Download full text (3.7 KiB)

From dcorch log we can see that dcorch got the following sync requests from its audit during the period of time in concern. Based on the algorithm, these logging points will trigger sync status update requests to dcmanager to update identity endpoint status.

[root@controller-0 log(keystone_admin)]# zgrep "subcloud1\/identity: Got .* sync request" /home/sysadmin/andy_test/dcorch.log.17.gz
2019-12-04 23:02:28.244 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 0 sync request(s)
2019-12-04 23:02:29.586 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 1 sync request(s)
2019-12-04 23:02:29.737 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 6 sync request(s)
2019-12-04 23:02:30.402 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 0 sync request(s)

2019-12-04 23:09:07.830 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 1 sync request(s)
2019-12-04 23:09:07.992 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 0 sync request(s)
2019-12-04 23:09:08.142 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 1 sync request(s)
2019-12-04 23:09:08.411 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 2 sync request(s)
2019-12-04 23:09:08.775 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 2 sync request(s)
2019-12-04 23:09:09.119 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 0 sync request(s)
2019-12-04 23:09:09.457 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 1 sync request(s)
2019-12-04 23:09:10.027 2661230 INFO dcorch.engine.sync_thread [-] subcloud1/identity: Got 0 sync request(s)

Then from dcmanager log, we can see that the first 4 in dcorch match the first 3 Updatings in dcmanager. But the rest 8 requests in dcorch doesn't match the rest of 5Updatings in dcmanager. The sync states sequence in dcmanager should be:

out-of-sync -> in-sync -> out-of-sync -> in-sync -> out-of-sync -> in-sync

And there are 2 "out-of-sync" updatings in dcmanager which shouldn't happend at all.

[root@controller-0 log(keystone_admin)]#
[root@controller-0 log(keystone_admin)]# grep "Updating subcloud:subcloud1 endpoint:identity sync" /home/sysadmin/andy_test/dcmanager.log
2019-12-04 23:02:28.260 728803 INFO dcmanager.manager.subcloud_manager [req-fbcd0a36-5137-41a0-8bb1-91062b72f8b5 - - - - -] Updating subcloud:subcloud1 endpoint:identity sync:in-sync
2019-12-04 23:02:29.601 728803 INFO dcmanager.manager.subcloud_manager [req-fbcd0a36-5137-41a0-8bb1-91062b72f8b5 - - - - -] Updating subcloud:subcloud1 endpoint:identity sync:out-of-sync
2019-12-04 23:02:30.419 728803 INFO dcmanager.manager.subcloud_manager [req-fbcd0a36-5137-41a0-8bb1-91062b72f8b5 - - - - -] Updating subcloud:subcloud1 endpoint:identity sync:in-sync

2019-12-04 23:09:09.303 728803 INFO dcmanager.manager.subcloud_manager [req-fbcd0a36-5137-41a0-8bb1-91062b72f8b5 - - - - -] Updating subcloud:subcloud1 endpoint:identity sync:out-of-sync
2019-12-04 23:09:09.346 728803 INFO dcmanager.manager.subcloud_manager [req-fbcd0a36-5137-41a0-8bb1-91062b72f8b5 - - - - -] Updating subcloud:subcloud1 endpoint:identity sync:...

Read more...

Andy (andy.wrs) wrote :

Here is some input from Bart Wensley:

I took a look at the logs attached to LP 1855359 and I think I know what the problem is. The dcmanager (and dcorch) is based on oslo_service and is being created with two workers (greenthreads):

2019-11-21 15:48:59.028 122427 INFO dcmanager.api [-] Server on http://fd01:6::2:8119 with 2

2019-11-21 15:48:59.057 122427 INFO oslo.service.wsgi [-] DCManager listening on fd01:6::2:8119

2019-11-21 15:48:59.058 122427 INFO oslo_service.service [-] Starting 2 workers

In the case Kevin raised below, the trigger for the issue was the problems we’re seeing with the subcloud_audit taking too long - in this case the audit ran for 90 seconds:

2019-12-04 23:08:43.853 728803 WARNING oslo.service.loopingcall [-] Function 'dcmanager.manager.service.DCManagerService.subcloud_audit' run outlasted interval by 90.96 sec

2019-12-04 23:08:43.854 728803 INFO dcmanager.manager.subcloud_audit_manager [-] Triggered subcloud audit.

I assume the audit tied up both workers somehow. That resulted in a large number of update_subcloud_endpoint RPCs being queued by the dcorch to be processed by dcmanager. Once dcmanager was unblocked, it processed more than 50 of these requests in 3 seconds. The code that handles these requests (_update_endpoint_status_for_subcloud in dcmanager/manager/subcloud_manager.py) does multiple calls to the DB and to the FM API - any of these calls can cause the eventlet to swap out and another one to run. That can result in the update_subcloud_endpoint RPCs being processed in the wrong order (or in parallel).

Note:
The issue raised by Kevin Smith is:
The dcmanager checks first if the status it is asked to write is the same as the status in the database, and if so does not update the database (only a debug log is produced). It should be _impossible_ to see 2 consecutive logs like the below in red updating the same endpoint status on the same subcloud to the same value.

Andy (andy.wrs) wrote :

With some debugging logs in a newly installed DC lab, here is another summary of the findings:

1. After managed, among the 10 subclouds:

   identity: 3 in-sync, 3 our-of-sync, 4 unknown

   platform: 6 in-sync, 4 unknown (same subclouds as identity)

   patching: 10 in-sync

2. From debugging log we can see all the sync requests from dcorch reached dcmanager. No requests get lost.

3. For those with identity out-of-sync, I see similar pattern as discussed before in terms of requests sent from dcorch, and handled by dcmanager. Still see 2 "out-of-sync" update in a row.

4. For those with identity and platform in unknown.

   After some rounds of dcorch audit, dcmanager notified dcorch subcloud is offline so dcorch stopped auditing. Later on when dcmanager notified dcorch is online again, the notification failed so dcorch audit remain stopped.

Ghada Khalil (gkhalil) wrote :

Assigning to Dariush Eslimi (DC PL) as the scope of this is large and needs to be scheduled accordingly.

Changed in starlingx:
assignee: Andy (andy.wrs) → Dariush Eslimi (deslimi)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers