During installation, subcloud1 load_sync_status didn't reach in-sync within 15 minutes after it got managed

Bug #1891770 reported by Difu Hu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

Brief Description
-----------------
During installation, subcloud1 load_sync_status didn't reach in-sync within 15 minutes after it got managed

Severity
--------
Major

Steps to Reproduce
------------------
During subcloud1 installation, dcmanager subcloud manage subcloud1

Expected Behavior
------------------
subcloud1 load_sync_status reached in-sync within 15 minutes

Actual Behavior
----------------
subcloud1 load_sync_status was still out-of-sync after 15 minutes

Reproducibility
---------------
1/1

System Configuration
--------------------
Lab-name: DC-3 subcloud1

Branch/Pull Time/Commit
-----------------------
2020-06-27_18-35-20 with PATCH_0002

Last Pass
---------
2020-08-10_20-00-00

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud manage subcloud1
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 3 |
| name | subcloud1 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:12::0/64 |
| management_start_ip | fd01:12::2 |
| management_end_ip | fd01:12::11 |
| management_gateway_ip | fd01:12::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 1 |
| created_at | 2020-08-15T18:18:56.697508 |
| updated_at | 2020-08-15T18:54:40.594590 |
+-----------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud1
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 3 |
| name | subcloud1 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:12::0/64 |
| management_start_ip | fd01:12::2 |
| management_end_ip | fd01:12::11 |
| management_gateway_ip | fd01:12::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 1 |
| created_at | 2020-08-15 18:18:56.697508 |
| updated_at | 2020-08-15 18:54:40.594590 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | in-sync |
| identity_sync_status | in-sync |
| load_sync_status | unknown |
| patching_sync_status | unknown |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+

Test Activity
-------------
Functional Testing

Revision history for this message
Difu Hu (difuhu) wrote :
summary: During installation, subcloud1 load_sync_status didn't reach in-sync
- within 15 minutes it got managed
+ within after 15 minutes it got managed
summary: During installation, subcloud1 load_sync_status didn't reach in-sync
- within after 15 minutes it got managed
+ within 15 minutes after it got managed
Revision history for this message
Difu Hu (difuhu) wrote :

Similar issue was seen during FPGA fw orchestration.
After a subcloud was image updated, lock/unlock and went online again, mostly, it took 15 minutes for load_sync_status changed from unknown to in-sync. But sometimes, it took 30 minutes for load_sync_status to become in-sync.

Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The audit runs every 30mins to sync the load status only and every 60mins to sync the f/w status. So this is currently expected behavior.

Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
Revision history for this message
Bart Wensley (bartwensley) wrote :

Taking a closer look at the code, we do trigger an audit for all resources each time a subcloud transitions from unmanaged to managed. In this case, it looks like the problem was that the patch/load audits were run before the initial sync for the subcloud was done and the initial sync (which updates keystone data in the subcloud) caused the token for the subcloud to be invalidated.

Here is where the subcloud was managed:
2020-08-15T18:54:39.000 controller-0 -sh: info HISTORY: PID=3334493 UID=42425 dcmanager --os-username 'admin' --os-password 'Li69nux*' --os-tenant-name admin --os-auth-url http://[fd01:11::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL subcloud manage subcloud1

The initial sync was done here:
2020-08-15 18:54:50.185 204600 INFO dcorch.engine.initial_sync_manager [-] Initial sync for subcloud subcloud1
2020-08-15 18:54:54.750 204600 INFO dcorch.engine.generic_sync_manager [-] updating state for subcloud subcloud1 - management_state: None availability_status: None initial_sync_state: completed

The patch audit started while the initial sync was being done and failed due to a 401 Unauthorized response from keystone in the subcloud:
2020-08-15 18:54:52.533 204590 INFO dcmanager.audit.patch_audit [-] Triggered patch audit for subcloud: subcloud1.
2020-08-15 18:54:52.653 204590 ERROR dccommon.drivers.openstack.patching_v1 [-] query failed with RC: 401
2020-08-15 18:54:52.653 204590 WARNING dcmanager.audit.patch_audit [-] Cannot retrieve patches for subcloud: subcloud1, skip patch audit: Exception: query failed with RC: 401

Looking at subcloud2, which was managed later, the same problem doesn't happen.

Here is where the subcloud was managed:
2020-08-15T19:34:24.000 controller-0 -sh: info HISTORY: PID=3818089 UID=42425 dcmanager --os-username 'admin' --os-password 'Li69nux*' --os-tenant-name admin --os-auth-url http://[fd01:11::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL subcloud manage subcloud2

The initial sync was done here:
2020-08-15 19:34:26.165 204600 INFO dcorch.engine.initial_sync_manager [-] Initial sync for subcloud subcloud2
2020-08-15 19:34:30.652 204600 INFO dcorch.engine.generic_sync_manager [-] updating state for subcloud subcloud2 - management_state: None availability_status: None initial_sync_state: completed

The patch audit was done after the initial sync completed - the audit was successful:
2020-08-15 19:34:30.677 204590 INFO dcmanager.audit.patch_audit [-] Triggered patch audit for subcloud: subcloud2.

The solution here may be to delay the patch/load/firmware audits that are triggered after the manage, to give the initial sync time to complete.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - the subcloud goes in-sync, but is slow. Would be nice to fix in stx.5.0

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Jessica Castelino (jcasteli)
Changed in starlingx:
assignee: Jessica Castelino (jcasteli) → Bart Wensley (bartwensley)
Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Chris Friesen (cbf123)
Revision history for this message
Chris Friesen (cbf123) wrote :

As of https://review.opendev.org/c/starlingx/distcloud/+/777208 we will trigger all the sub-audits for a subcloud when it transitions from offline to online, so this should no longer be an issue.

Changed in starlingx:
status: Triaged → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.