Route creation requests timed out

Bug #1964267 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
John Kung

Bug Description

Brief Description

About 10% of the subclouds failed batch subcloud deployment trial in DC lab due to route creation request timing out

Severity

Major

Steps to Reproduce

Deploy large number of virtual subclouds (non-Redfish) in parallel

Expected Behavior

Batch deployment completes

Actual Behavior

10% failed pre-deploy-prep stage with timeout error such as the following

2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager [req-0c6cd884-4440-4de3-aa70-dd6595363454 2a5c12fdc5cb4bceb829feb719f5998c - - default default] Failed to create subcloud subcloud260: CommunicationError: timed out
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager Traceback (most recent call last):
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/dcmanager/manager/subcloud_manager.py", line 327, in add_subcloud
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager 1)
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python2.7/site-packages/dccommon/drivers/openstack/sysinv_v1.py", line 256, in create_route
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager metric=metric)
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager File "/usr/lib64/python2.7/site-packages/cgtsclient/v1/route.py", line 53, in create
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager return self._create(path, new)
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager File "/usr/lib64/python2.7/site-packages/cgtsclient/common/base.py", line 51, in _create
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager _, body = self.api.json_request('POST', url, body=body)
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager File "/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py", line 269, in json_request
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager method, **kwargs)
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager File "/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py", line 218, in _cs_request
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager raise exceptions.CommunicationError(str(e))
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager CommunicationError: timed out
2021-09-01 20:49:54.246 1979526 ERROR dcmanager.manager.subcloud_manager

Another route creation request closed to the one for subcloud260 went through but that request took almost 108s

sysinv 2021-09-01 20:49:54.978 1979172 INFO sysinv.api.hooks.auditor [req-6998ad34-1389-4de3-bee9-f0ace8c6c8f7 7f150088e71e47fcab2e987dd0d4a72c 2c3a5e53a1f9441dbb837ded3498dd0f] fd01:6::2 "POST /v1/routes HTTP/1.0" status: 200 len: 299 time: 107.269939899 POST: {u'prefix': 64, u'interface_uuid': u'7d53443a-bd3f-48e2-8c32-b7a02fa5dec4', u'network': u'fd01:241::', u'metric': 1, u'gateway': u'fd01:6::1'} host:[fd01:6::2]:6385 agent:Python-httplib2/0.9.2 (gzip) user: dcmanager tenant: services domain: Default

Reproducibility

reproducible in some DC labs

System Configuration

Distributed cloud

Branch/Pull Time/Commit

stx6.0
BUILD_ID="2021-08-30_00-00-07"
SRC_BUILD_ID="994"

Last Pass
N/A This is the first time we run batch deployment of 100 virtual subclouds.

Timestamp/Logs: See logs above.

Test Activity: Evaluation

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/832869

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/832869
Committed: https://opendev.org/starlingx/config/commit/3163318694fb4ae5136856918a6b911961988705
Submitter: "Zuul (22348)"
Branch: master

commit 3163318694fb4ae5136856918a6b911961988705
Author: John Kung <email address hidden>
Date: Wed Mar 9 16:28:30 2022 -0600

    Config scale: improve route config scalability

    As runtime puppet manifests apply can take significant time to complete,
    when there are multiple runtime puppet manifest apply required,
    the time required to enqueue and handle the manifests apply
    can overwhelm the runtime config handling in larger DC system.

    In order to handle this more efficiently, the runtime config system is
    updated to allow for filter on whether a certain runtime class is
    under apply and queue for its completion accordingly.
    Duplicate config are discarded, since the config will be
    generated with the latest hieradata.

    The route api semantic checks factor out the non-critical
    region checks to allow better concurrent processing.
    The config applied update is made more efficient by also including
    the config status update when requested.

    Test Plan:
    PASSED bootstrap and deploy SystemControllers
    PASSED stress test create multiple routes on multiple hosts
    PASSED verify created routes against ip route
    PASSED check alarms config out of date
    PASSED DC scaling add and manage large number of subclouds

    Closes-Bug: 1964267
    Signed-off-by: John Kung <email address hidden>
    Change-Id: I9d6cf49a8a2b266c74aa1635010a11ca26e839b9

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → John Kung (john-kung)
tags: added: stx.7.0 stx.config stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.