DC subcloud group restore: Keystone token expired for subclouds that were in the queue for more than an hour

Bug #2045139 reported by Victor Romano
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Victor Romano

Bug Description

Brief Description

DC subcloud group restore: Keystone token expired for subclouds that were in the line to be restored after the first 100 ones get restored

CMD: 2023-11-25T13:42:27.053 controller-0 -bash: info HISTORY: PID=2959710 UID=42425 dcmanager subcloud-backup restore --group upgrade-group --sysadmin-password Li69nux* --sysadmin-password Li69nux* --release 21.12 --with-install

Number of subclouds under testing:

dcmanager subcloud-group list-subclouds upgrade-group | grep subcloud | wc -l
106

Severity

<Major: System/Feature is usable but degraded>

Note:

Subcloud group restore with more than 100 subclouds is affected only.
The issue is only reproducible if the restore of the first 100 subclouds take more than 01h00. The likelihood to reproduce this issue for subclouds running previous release 21.12 is higher than we have for 22.12 as the restore operation for 21.12 is much slower.
The issue was reproduced twice. Here it is the other logs timestamp we got from another test run:
restore command timestamp:
2023-11-24T23:08:32.867 controller-0 -bash: info HISTORY: PID=937163 UID=42425 dcmanager subcloud-backup restore --group upgrade-group --sysadmin-password Li69nux* --sysadmin-password Li69nux* --release 21.12 --with-install

Keystone related exception:
2023-11-25 00:09:29.257 151476 INFO dcmanager.manager.subcloud_manager [-] Preparing remote install of subcloud2041, version: 21.12
2023-11-25 00:09:29.270 151476 INFO dccommon.endpoint_cache [-] The cached keystone token for RegionOne will expire soon 2023-11-25T00:08:37.000000Z
2023-11-25 00:09:31.193 151476 INFO dccommon.endpoint_cache [-] Generated Master keystone client and master token as they are expiring soon
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager [-] Failed to validate token (HTTP 404) (Request-ID: req-15e45603-6260-4f40-8231-5cb8143b29a5): keystoneauth1.exceptions.http.NotFound: Failed to validate token (HTTP 404) (Request-ID: req-15e45603-6260-4f40-8231-5cb8143b29a5)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager Traceback (most recent call last):
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/dcmanager/manager/subcloud_manager.py", line 1452, in _run_subcloud_install
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager install = SubcloudInstall(context, subcloud.name)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/dccommon/subcloud_install.py", line 112, in __init__
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager self.sysinv_client = SysinvClient(consts.CLOUD_0,
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/dccommon/drivers/openstack/sysinv_v1.py", line 142, in __init__
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager token = session.get_token()
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/session.py", line 1213, in get_token
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager return (self.get_auth_headers(auth) or {}).get('X-Auth-Token')
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/session.py", line 1191, in get_auth_headers
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager return auth.get_headers(self, **kwargs)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/plugin.py", line 95, in get_headers
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager token = self.get_token(session)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/identity/base.py", line 88, in get_token
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager return self.get_access(session).auth_token
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/identity/base.py", line 134, in get_access
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager self.auth_ref = self.get_auth_ref(session)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/identity/generic/base.py", line 208, in get_auth_ref
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager return self._plugin.get_auth_ref(session, **kwargs)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/identity/v3/base.py", line 187, in get_auth_ref
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager resp = session.post(token_url, json=body, headers=headers,
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/session.py", line 1139, in post
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager return self.request(url, 'POST', **kwargs)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager File "/usr/lib/python3/dist-packages/keystoneauth1/session.py", line 976, in request
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager raise exceptions.from_response(resp, method, url)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager keystoneauth1.exceptions.http.NotFound: Failed to validate token (HTTP 404) (Request-ID: req-15e45603-6260-4f40-8231-5cb8143b29a5)
2023-11-25 00:09:31.308 151476 ERROR dcmanager.manager.subcloud_manager
2023-11-25 00:09:31.323 151476 ERROR dcmanager.manager.subcloud_manager [-] Failed to validate token (HTTP 404) (Request-ID: req-15e45603-6260-4f40-8231-5cb8143b29a5): keystoneauth1.exceptions.http.NotFound: Failed to validate token (HTTP 404) (Request-ID: req-15e45603-6260-4f40-8231-5cb8143b29a5)
Steps to Reproduce

Deploy systemcontroller and subclouds with 21.12

Upgrade systemcontroller

Create backup for subclouds

Upgrade subcloud to 22.12

Restore subclouds back to 21.12

Expected Behavior

All subclouds should be restored successfully

Actual Behavior

The subclouds exceeding the first 100 ones, failed to have the keystone token validaded.

Reproducibility

100 %

2 out of 2 attempts

System Configuration

DC

Load info (eg: 2022-03-10_20-00-07)

 cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-12-19_02-22-00"
SRC_BUILD_ID="38"JOB="wrcp-22.12-debian"
BUILD_BY="jenkins"
BUILD_NUMBER="50"
BUILD_HOST="yow-wrcp-lx.wrs.com"
BUILD_DATE="2022-12-19 07:22:00 +0000"

Last Pass

This issue is only reproduced when we have ALL 100 subclouds taking more than one hour to complete.

Alarms

no alarms

Test Activity

Regression Testing

Workaround

Create a subcloud group for restore operation with a maximum of 100 subclouds

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/902140

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/902140
Committed: https://opendev.org/starlingx/distcloud/commit/37b0ef020c5ae64548be21a03f65486a0f664f58
Submitter: "Zuul (22348)"
Branch: master

commit 37b0ef020c5ae64548be21a03f65486a0f664f58
Author: Victor Romano <email address hidden>
Date: Wed Nov 29 09:42:05 2023 -0300

    Change SubcloudInstall class to use admin context

    The SubcloudInstall class have the context as a parameter, using it
    to access the keystone and sysinv clients. If the context expired,
    which can happen in a group restore with a group with more than 100
    subclouds, an invalid token is loaded from endpoint cache and the
    SysinvClient initialization fails. This commit changes the
    SubcloudInstall class to always use the admin context to allow
    installation on subclouds that have been on queue for more time than
    the context expiry time.

    Additionally, a small bug regarding install cleanup was fix. If there
    was an error while initializing the SubcloudInstall class, the
    cleanup would be executed regardless and fail, because the class
    wasn't initialized correctly. A conditional was added to only run
    cleanup if it was initialized correctly.

    Test plan:
      - Note: upgrade from stx8 to master is currently not functional,
              but since the changes were made on the installation part,
              the test could still be performed.
      - PASS: Run "dcmanager subcloud deploy install" normally and verify
              the installation completed successfully.
      - PASS: Run "dcmanager subcloud deploy install" adding a 2h sleep
              between the request and SubcloudInstall initialization and
              verify the installation completed successfully.
      - PASS: Upgrade a subcloud from stx8 to master normally and verify
              the installation part completed sucessfully.
      - PASS: Upgrade a subcloud from stx8 to master adding a 2h sleep
              between the request and SubcloudInstall initialization and
              verify the installation part completed sucessfully.

    Closes-Bug: 2045139

    Change-Id: Idcdf8faf755e655066927f543b75071afa9c43fc
    Signed-off-by: Victor Romano <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud
Changed in starlingx:
assignee: nobody → Victor Romano (vgluzrom)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/903423

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/903423
Committed: https://opendev.org/starlingx/distcloud/commit/153e2bac90cd94640eaa578c45b37d25ff09275f
Submitter: "Zuul (22348)"
Branch: master

commit 153e2bac90cd94640eaa578c45b37d25ff09275f
Author: Li Zhu <email address hidden>
Date: Mon Dec 11 18:52:57 2023 -0500

    Use keystone client cache in SubcloudInstall class

    A minor enhancement of previous commit:
    https://review.opendev.org/c/starlingx/distcloud/+/902140
    for supporting parallel subcloud installation.
    This enhancement aims to prevent the repeated creation of Keystone
    client and endpoint caches for each subcloud installation.

    Test plan:
      - PASS: Successful parallel subcloud installation without any keystone
              connection failured in a scale lab by executing the
              "dcmanager subcloud deploy install" command.
      - PASS: Successful parallel subcloud installation without any keystone
              connection failure in a scale lab by executing the
              "dcmanager subcloud add" command.

    Closes-Bug: 2045139

    Change-Id: I87cf4aca0c120d4c7c4bcc0e9162ea1372ac4c7d
    Signed-off-by: lzhu1 <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.