DC Upgrade Orchestration cannot recover from failed subcloud lock or failed upgrade activate

Bug #1924774 reported by Tee Ngo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tee Ngo

Bug Description

Brief Description
-----------------
Unable to retry DC upgrade orchestration following a failed subcloud lock or failed upgrade activation.

Severity
--------
Critical

Steps to Reproduce
------------------
Bring up a DC system with at least one subcloud
Upgrade the system controller
On the subcloud unmanage and shutdown vim to fail a host lock
Perform subcloud upgrade using dcmanager orchestration
Subcloud upgrade would fail at host-lock step
Restore vim service on the subcloud
Delete the failed upgrade strategy and try again by creating and a apply a new one

Expected Behavior
------------------
Upgrade orchestration can be recovered and completes

Actual Behavior
----------------
Upgrade orchestration retry failed

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
April 5th StartingX master build

Last Pass
---------
These 2 particular failed scenario was

Timestamp/Logs
--------------
orchestartor log on the system controller

2021-04-07 20:48:04.579 554197 ERROR dcmanager.orchestrator.sw_upgrade_orch_thread [req-2875a182-d78c-4781-aa7b-0c576ce0a811 - - - - -] Failed! Stage: 2, State: pre check, Subcloud: subcloud8: PreCheckFailedException: Subcloud: subcloud8 upgrade precheck failed: System health check failed. Please run 'system health-query' command on the subcloud for more details.
2021-04-07 20:48:04.579 554197 ERROR dcmanager.orchestrator.sw_upgrade_orch_thread Traceback (most recent call last):

On subcloud:
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list

------------------------------------+-------------------------+------------

Alarm ID Reason Text Entity ID Severity Time Stamp
------------------------------------+-------------------------+------------

900.005 System Upgrade in progress. host=controller minor 2021-04-12T
                                                                  22:13:40.
                                                                  073699

------------------------------------+-------------------------+------------

Test Activity
-------------
Developer Testing

Workaround
----------
None

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/786688

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/786688
Committed: https://opendev.org/starlingx/distcloud/commit/a9157c51d5562eb395815d4653d8e7d39b38920e
Submitter: "Zuul (22348)"
Branch: master

commit a9157c51d5562eb395815d4653d8e7d39b38920e
Author: Tee Ngo <email address hidden>
Date: Fri Apr 16 11:00:26 2021 -0400

    Filter out skippable alarms in precheck

    Subcloud online checks are now skipped if the subcloud is
    already in the 'migrated' state. The precheck also skips
    upgrade alarm as well as the host lock alarm if upgrade
    has started.

    In addition, the commit includes the code to handle bmc_password
    with None value which the previous commit
    (69a74499884a8a73a3b705ce074f46202c4aa278) did not handle.

    Closes-Bug: 1924774
    Change-Id: Ifdfe1e44a34c2c2561c299a82d7178dce6063daf
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Tee, Given you marked this issue as critical, please cherrypick this change to the r/stx.5.0 release branch once it's open for submissions.

tags: added: stx.distcloud
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
importance: Undecided → High
tags: added: stx.5.0
Ghada Khalil (gkhalil)
tags: added: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/distcloud/+/788765

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/788765
Committed: https://opendev.org/starlingx/distcloud/commit/de0fef663a556b1d4977132efce5bfab79f37e1d
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit de0fef663a556b1d4977132efce5bfab79f37e1d
Author: Tee Ngo <email address hidden>
Date: Fri Apr 16 11:00:26 2021 -0400

    Filter out skippable alarms in precheck

    Subcloud online checks are now skipped if the subcloud is
    already in the 'migrated' state. The precheck also skips
    upgrade alarm as well as the host lock alarm if upgrade
    has started.

    In addition, the commit includes the code to handle bmc_password
    with None value which the previous commit
    (69a74499884a8a73a3b705ce074f46202c4aa278) did not handle.

    Closes-Bug: 1924774
    Change-Id: Ifdfe1e44a34c2c2561c299a82d7178dce6063daf
    Signed-off-by: Tee Ngo <email address hidden>
    (cherry picked from commit a9157c51d5562eb395815d4653d8e7d39b38920e)

Ghada Khalil (gkhalil)
tags: added: in-r-stx50
removed: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.