delete host route from subcloud and add back, subcloud didn't reach in-sync within 660s

Bug #1891102 reported by Difu Hu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

Brief Description
-----------------
delete host route from subcloud and add back, subcloud didn't reach in-sync within 660s

Severity
--------
Major

Steps to Reproduce
------------------
system --os-region-name subcloud4 host-route-list 1
system --os-region-name subcloud4 host-route-delete f52f8a45-d168-4e68-b50e-485a6e862066
(wait subcloud offline)
system --os-region-name subcloud4 host-route-add 1 mgmt0 fd01:11:: 64 fd01:304::1

Expected Behavior
------------------
subcloud should reach in-sync within 660s

Actual Behavior
----------------
subcloud didn't reach in-sync within 660s

Reproducibility
---------------
3/3

System Configuration
--------------------
Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-08-07_20-00-00

Last Pass
---------
2020-07-17_20-00-00

Timestamp/Logs
[2020-08-09 07:30:00,564] 314 DEBUG MainThread ssh.send :: Send 'system --os-region-name subcloud4 host-route-list 1'
[2020-08-09 07:30:01,708] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+--------+-----------+--------+-------------+--------+
| uuid | ifname | network | prefix | gateway | metric |
+--------------------------------------+--------+-----------+--------+-------------+--------+
| f52f8a45-d168-4e68-b50e-485a6e862066 | mgmt0 | fd01:11:: | 64 | fd01:304::1 | 1 |
+--------------------------------------+--------+-----------+--------+-------------+--------+

[2020-08-09 07:30:01,813] 314 DEBUG MainThread ssh.send :: Send 'system --os-region-name subcloud4 host-route-delete f52f8a45-d168-4e68-b50e-485a6e862066'
[2020-08-09 07:30:05,518] 436 DEBUG MainThread ssh.expect :: Output:
Deleted Route: f52f8a45-d168-4e68-b50e-485a6e862066

[2020-08-09 07:31:39,059] 314 DEBUG MainThread ssh.send :: Send 'dcmanager subcloud list'
[2020-08-09 07:31:40,034] 436 DEBUG MainThread ssh.expect :: Output:
+----+-----------+------------+--------------+---------------+---------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+---------------+---------+
| 1 | subcloud3 | managed | online | complete | in-sync |
| 2 | subcloud4 | managed | offline | complete | unknown |
| 3 | subcloud1 | managed | online | complete | in-sync |
| 4 | subcloud2 | managed | online | complete | in-sync |
+----+-----------+------------+--------------+---------------+---------+

[2020-08-09 07:31:47,565] 314 DEBUG MainThread ssh.send :: Send 'system --os-region-name subcloud4 host-route-add 1 mgmt0 fd01:11:: 64 fd01:304::1'
[2020-08-09 07:31:51,886] 436 DEBUG MainThread ssh.expect :: Output:
+----------------+--------------------------------------+
| Property | Value |
+----------------+--------------------------------------+
| uuid | 0fab61e1-d58a-48ce-a01a-1fe63f822d98 |
| interface_uuid | e149d59d-5c46-4085-9032-68036cacb1c9 |
| ifname | mgmt0 |
| forihostid | 1 |
| network | fd01:11:: |
| prefix | 64 |
| gateway | fd01:304::1 |
| metric | 1 |
+----------------+--------------------------------------+

[2020-08-09 07:43:16,759] 314 DEBUG MainThread ssh.send :: Send 'dcmanager subcloud list'
[2020-08-09 07:43:17,753] 436 DEBUG MainThread ssh.expect :: Output:
+----+-----------+------------+--------------+---------------+-------------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+---------------+-------------+
| 1 | subcloud3 | managed | online | complete | in-sync |
| 2 | subcloud4 | managed | online | complete | out-of-sync |
| 3 | subcloud1 | managed | online | complete | in-sync |
| 4 | subcloud2 | managed | online | complete | in-sync |
+----+-----------+------------+--------------+---------------+-------------+

Test Activity
-------------
Regression Testing

Revision history for this message
Difu Hu (difuhu) wrote :
summary: - delete host route to subcloud and add back, subcloud didn't reach in-
+ delete host route from subcloud and add back, subcloud didn't reach in-
sync within 660s
description: updated
Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (3.8 KiB)

The issue seen here is that it can take up to 15 minutes for the patch audit to run after a subcloud goes online, so the patching endpoint can stay out of sync for longer than the 10 minutes (600s) expected in the testcase. A change was recently made in this area that caused this failure:
https://review.opendev.org/#/c/740963

Prior to the above change, the frequency of the patch audit was based on the number of subclouds configured. With a small number of subclouds, the patch audit could happen as often as every 5 minutes, but with a large number of subclouds (e.g. 100 or more), the frequency would drop to 13 minutes or more. The change above made the audit frequency predictable and set it to every 15 minutes. Note that this testcase would have failed even without the above change, if it was run in a system with a large number of subclouds.

The solution here is probably to trigger the patch audit (and other dcmanager based audits like the firmware audit) whenever a subcloud goes online. This would line up with the way the dcorch-engine triggers its audits when a subcloud goes online. One limitation is that our current trigger mechanism for the patch audit results in all subclouds being audited. This should be OK though - in the case where multiple subclouds go online at roughly the same time, the trigger will only result in one (or possibly two) patch audits being triggered as the audits happen at a fixed interval and multiple triggers in the same audit interval only result in a single audit being done.

Here are the key logs...

# Patch audit done here
2020-08-09 07:31:03.625 128417 INFO dcmanager.audit.subcloud_audit_manager [-] Trigger patch audit

# subcloud4 goes offline
2020-08-09 07:31:24.769 128417 INFO dcmanager.audit.subcloud_audit_manager [-] Setting new availability status: offline on subcloud: subcloud4

# subcloud4 goes online
2020-08-09 07:31:58.718 128417 INFO dcmanager.audit.subcloud_audit_manager [-] Setting new availability status: online on subcloud: subcloud4
2020-08-09 07:31:58.721 128090 INFO dcmanager.manager.service [req-80835abd-8876-4ec3-a7bb-eb9b2e5a7177 - - - - -] Handling update_subcloud_availability request for: subcloud4
2020-08-09 07:31:58.922 128090 INFO dcmanager.manager.subcloud_manager [req-80835abd-8876-4ec3-a7bb-eb9b2e5a7177 - - - - -] Updating subcloud:subcloud4 endpoint:dc-cert sync:in-sync

# dcorch is notified that subcloud4 is online
2020-08-09 07:31:59.107 128090 INFO dcmanager.manager.subcloud_manager [req-80835abd-8876-4ec3-a7bb-eb9b2e5a7177 - - - - -] Notifying dcorch, subcloud:subcloud4 management: managed, availability:online

# dcorch updates endpoints to in-sync
2020-08-09 07:32:08.897 128090 INFO dcmanager.manager.service [req-598387ff-692e-4236-9aaa-e4ae32f521ed - - - - -] Handling update_subcloud_endpoint_status request for: subcloud4
2020-08-09 07:32:08.912 128090 INFO dcmanager.manager.subcloud_manager [req-598387ff-692e-4236-9aaa-e4ae32f521ed - - - - -] Updating subcloud:subcloud4 endpoint:identity sync:in-sync
2020-08-09 07:32:09.063 128090 INFO dcmanager.manager.service [req-8f6695b3-44fa-42c6-b267-da79ef90d5d1 - - - - -] Handling update_subcloud_endpoint_status request for: subclo...

Read more...

tags: added: stx.distcloud
Revision history for this message
Difu Hu (difuhu) wrote :

On build 2020-08-28_20-00-00, observed DC-4 subcloud1 was not in-sync even after 22 minutes.
log: https://files.starlingx.kube.cengn.ca/launchpad/1891102

[2020-08-29 23:28:29,653] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:82::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud1 host-route-add 1 mgmt0 fd01:81:: 64 fd01:82::1'

[2020-08-29 23:51:03,283] 314 DEBUG MainThread ssh.send :: Send 'dcmanager --os-username 'admin' --os-password 'Li69nux*' --os-tenant-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL subcloud list'
| 10 | subcloud1 | managed | online | complete | out-of-sync |

Revision history for this message
Bart Wensley (bartwensley) wrote :

Difu - please show the full subcloud to show what endpoing is out of sync "dcmanager subcloud show subcloud1". Also - did the subcloud go back in sync eventually?

Revision history for this message
Difu Hu (difuhu) wrote :

The log is from automation log where it didn't execute "dcmanager subcloud show subcloud1". Can the info be found from log?
And yes, the subcloud eventually went back to in-sync.

Revision history for this message
Bart Wensley (bartwensley) wrote :

Yes - the info can be found from the logs. If it isn't too much effort, would be nice if the automation could be extended to do a "dcmanager subcloud show" for the subcloud that had the failure.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium priority - the subcloud eventually goes in-sync, but is slow. Would be nice to fix in stx.5.0

tags: added: stx.5.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Bart Wensley (bartwensley)
Revision history for this message
Difu Hu (difuhu) wrote :

On build 2020-11-13_20-00-10

dcmanager subcloud show subcloud6
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 8 |
| name | subcloud6 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:15::0/64 |
| management_start_ip | fd01:15::2 |
| management_end_ip | fd01:15::11 |
| management_gateway_ip | fd01:15::1 |
| systemcontroller_gateway_ip | fd01:1::1 |
| group_id | 1 |
| created_at | 2020-11-14 18:06:34.833667 |
| updated_at | 2020-11-15 01:36:36.702453 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | unknown |
| identity_sync_status | in-sync |
| load_sync_status | unknown |
| patching_sync_status | in-sync |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+

Revision history for this message
Bart Wensley (bartwensley) wrote :

Thanks Difu. That shows that the firmware and load endpoint sync status hasn't been updated yet. That is because these are only audited every 30 minutes (for the load endpoint) and every 60 minutes (for the firmware). We have plans to improve this in stx.5.0.

Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Chris Friesen (cbf123)
Revision history for this message
Chris Friesen (cbf123) wrote :

As of https://review.opendev.org/c/starlingx/distcloud/+/777208 we will trigger all of the various sub-audits when a subcloud transitions from offline to online, so this should no longer be an issue.

Changed in starlingx:
status: Triaged → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.