Subcloud add failure due to "Secret for certificate admin endpoint cert is not ready"

Bug #2037298 reported by Manoel Benedito Neto
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jerry Sun

Bug Description

Brief Description
-----------------
Parallel subcloud adds is blocked due to failure in creating admin endpoint cert.

Severity
--------
Critical

Steps to Reproduce
------------------
Deploy 250 subclouds

Expected Behavior
-----------------
250 subclouds are deployed successfully

Actual Behavior
---------------
A large number of the subclouds failed to be created due to the following error

2023-09-06 21:53:40.671 1645420 ERROR dcmanager.manager.subcloud_manager [-] Failed to create subcloud subcloud248: Exception: Secret for certificate subcloud248-adminep-ca-certificate is not ready.
2023-09-06 21:53:40.671 1645420 ERROR dcmanager.manager.subcloud_manager Traceback (most recent call last):
2023-09-06 21:53:40.671 1645420 ERROR dcmanager.manager.subcloud_manager   File "/usr/lib/python3/dist-packages/dcmanager/manager/subcloud_manager.py", line 956, in subcloud_deploy_create
2023-09-06 21:53:40.671 1645420 ERROR dcmanager.manager.subcloud_manager     self._create_intermediate_ca_cert(payload)
2023-09-06 21:53:40.671 1645420 ERROR dcmanager.manager.subcloud_manager   File "/usr/lib/python3/dist-packages/dcmanager/manager/subcloud_manager.py", line 227, in _create_intermediate_ca_cert
2023-09-06 21:53:40.671 1645420 ERROR dcmanager.manager.subcloud_manager     raise Exception("Secret for certificate %s is not ready." % cert_name)

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Distributed Cloud

Load info (eg: 2022-03-10_20-00-07)
-----------------------------------
2023-08-24_18-00-10

Last Pass
---------
N/A

Timestamp/Logs
--------------
N/A

Alarms
------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
N/A

Changed in starlingx:
assignee: nobody → Manoel Benedito Neto (mbenedit)
Changed in starlingx:
status: New → In Progress
information type: Private Security → Public Security
information type: Public Security → Public
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Manoel Benedito Neto (mbenedit) → Jerry Sun (jerry-sun-u)
Revision history for this message
Jerry Sun (jerry-sun-u) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/895866
Committed: https://opendev.org/starlingx/distcloud/commit/c3f8d82d9b2ef3d78e59aca9b72c05e53768e7b5
Submitter: "Zuul (22348)"
Branch: master

commit c3f8d82d9b2ef3d78e59aca9b72c05e53768e7b5
Author: Manoel Benedito Neto <email address hidden>
Date: Tue Sep 19 20:16:52 2023 -0300

    Increase the timeout to retrieve subcloud cert secret

    The currently 1 second delay between the maximum of 20 attempts to
    retrieve subcloud's certificate secret infos could further aggravate
    stress scenarios due to consequents API requests at Kubernetes.

    This commit implements a pseudorandomic time delay and an exponential
    back-off retry before attempts to request the Kubernetes API to
    minimize the number of sequential requests issued to the API during
    stress scenarios involving the parallel addition of many subclouds.

    The exponential back-off retry is designed here to be executed in a
    total maximum wait of 210s with the maximum wait of ~38s in the last
    attempt of the loop in the worst case scenario and the minimun wait
    of 2s in the first loop for the best case scenario. This is intended
    in a matter to adapt the increase of the wait time per request to a
    possible stress scenario in the system.

    Test Plan:
    PASS: Full build, system install, bootstrap and unlock DC system w/
          unlocked enabled available status. Add a SX subcloud via the
          SystemController and wait until the deploy is complete.
          Observe that the subcloud is online and with in-sync status.
    PASS: Deploy 250 subclouds and ensure no failures due to
          certificate creation

    Closes-Bug: 2037298

    Change-Id: Idddba12cc08c98bda5ef1c44511e525d0188d048
    Signed-off-by: Manoel Benedito Neto <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud stx.security
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.