Orchestrated subcloud upgrade failed at unlocking host step

Bug #1965170 reported by Gabriel Silva Trevisan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Gabriel Silva Trevisan

Bug Description

Brief Description
-----------------
While performing the unlock_simplex step during a subcloud upgrade, the subcloud platform endpoints became unavailable, causing the step to fail.

Retrying the requests on the host unlock step may be required to work around the flakiness of platform services post data migration.

Severity
--------
Major

Steps to Reproduce
------------------
Prepare a distributed cloud system for orchestrated subcloud upgrade
Create and apply an upgrade strategy for a large number of simplex subclouds in parallel

Expected Behavior
------------------
The upgrade strategy completes successfully.

Actual Behavior
----------------
Upgrade failed for one subcloud at the unlocking-host step as it was unable to retrieve host data from the subcloud.

Reproducibility
---------------
Seen once

System Configuration
--------------------
Distributed cloud

Branch/Pull Time/Commit
-----------------------
2022-01-19 18:42:01

Last Pass
---------
Not sure if an upgrade test with a large number of subclouds was previously performed for the stx/6.0 load.

Timestamp/Logs
--------------

2022-03-10 21:10:16.655 3236316 INFO dcmanager.orchestrator.states.base [req-925ded40-e63f-4167-9f2e-a2d05f85d810 - - - - -] Stage: 1, State: migrating data, Subcloud: subcloud2030, Details: Host: controller-0 is now: unlocked enabled
2022-03-10 21:10:16.667 3236316 INFO dcmanager.orchestrator.states.base [req-925ded40-e63f-4167-9f2e-a2d05f85d810 - - - - -] Stage: 1, State: migrating data, Subcloud: subcloud2030, Details: Data migration completed.
2022-03-10 21:10:17.167 3236316 ERROR dccommon.drivers.openstack.sdk_platform [req-cbd71502-0865-4e10-8d19-a2469cb7d30c - - - - -] keystone_client region subcloud2008 error: Service Unavailable (HTTP 503): ServiceUnavailable: Service Unavailable (HTTP 503)
2022-03-10 21:10:17.170 3236316 WARNING dcmanager.orchestrator.states.base [req-cbd71502-0865-4e10-8d19-a2469cb7d30c - - - - -] Failure initializing KeystoneClient for region: subcloud2008: ServiceUnavailable: Service Unavailable (HTTP 503)
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread [req-cbd71502-0865-4e10-8d19-a2469cb7d30c - - - - -] (upgrade) Failed! Stage: 1, State: unlocking controller-0, Subcloud: subcloud2008: ServiceUnavailable: Service Unavailable (HTTP 503)
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread Traceback (most recent call last):
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/orch_thread.py", line 512, in perform_state_action
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread next_state = state_operator.perform_state_action(strategy_step)
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/states/unlock_host.py", line 63, in perform_state_action
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread sysinv_client = self.get_sysinv_client(strategy_step.subcloud.name)
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/states/base.py", line 107, in get_sysinv_client
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread keystone_client = self.get_keystone_client(region_name)
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/states/base.py", line 96, in get_keystone_client
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread region_clients=['sysinv'])
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dccommon/drivers/openstack/sdk_platform.py", line 101, in _init_
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread raise exception
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread ServiceUnavailable: Service Unavailable (HTTP 503)
2022-03-10 21:10:17.171 3236316 ERROR dcmanager.orchestrator.orch_thread

Test Activity
-------------
Feature Testing

Workaround
Delete the failed strategy, create a new one and reapply

description: updated
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/834067

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Gabriel Silva Trevisan (g-trevisan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/834067
Committed: https://opendev.org/starlingx/distcloud/commit/651fabec40c806eb07ecc7e9e24ac7116ba4ef97
Submitter: "Zuul (22348)"
Branch: master

commit 651fabec40c806eb07ecc7e9e24ac7116ba4ef97
Author: Gabriel Silva Trevisan <email address hidden>
Date: Wed Mar 16 17:42:22 2022 -0300

    Add retry for platform requests on host unlock

    Similarly to completing state [1], subcloud platform services may become
    temporarily unavailable during host unlock, following data migration.

    Add retry wrapper to sysinv calls when retrieving host information
    before unlock. No wrapping needed for the subsequent calls, since the
    endpoint should already be stable by then.

    Move retry constants to dcmanager constants module, to be reusable by
    other states.

    Test Plan:

    PASS:
    - Apply upgrade orchestration to a large number of simplex subclouds in
    parallel

    [1] Previous change for completing state
    https://opendev.org/starlingx/distcloud/commit/7c88723d0624f9fcea0e47aa626f335731c79ffc

    Closes-bug: 1965170
    Signed-off-by: Gabriel Silva Trevisan <email address hidden>
    Change-Id: If413525ffd0eae3edc5aabc70b75054aa1383a4e

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.