Unlock failed during distributed cloud orchestrated upgrade

Bug #1914836 reported by Adriano Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Adriano Oliveira

Bug Description

Brief Description
-----------------
During distributed cloud orchestrated upgrade, worker node unlock failed.
Output indicates sriov_numvfs configuration might need more time to be applied.

Severity
--------
Major

Steps to Reproduce
------------------
Follow upgrade procedure as per upgrade orchestration.
The issue is seen when orchestration attempts to unlock worker-0.

Expected Behavior
------------------
No failure on host unlock on any node during upgrade orchestration.

Actual Behavior
----------------
Unlock of worker-0 fails.

Reproducibility
---------------
Intermitent

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
stx4.0 as of "2020-06-27_18-35-20"

Last Pass
---------

Timestamp/Logs
--------------

Alarm ID Reason Text Entity ID Severity Time Stamp
------------------------------------------------------------------------------------------+----------------------------------------------+------------------

900.203 Software upgrade auto-apply failed orchestration=sw-upgrade critical 2020-07-02T18:24:
                                                                                                                                              53.413091

800.001 Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. cluster=04a34a69-8c18-4494-a97f- warning 2020-07-02T18:11:
                                                                                              5035682d7427 32.397315

200.001 worker-0 was administratively locked to take it out-of-service. host=worker-0 warning 2020-07-02T18:10:
                                                                                                                                              53.991463

750.006 A configuration change requires a reapply of the oidc-auth-apps application. k8s_application=oidc-auth-apps warning 2020-07-02T17:31:
                                                                                                                                              13.243285

750.006 A configuration change requires a reapply of the platform-integ-apps application. k8s_application=platform-integ-apps warning 2020-07-02T17:31:
                                                                                                                                              13.059851

900.005 System Upgrade in progress. host=controller minor 2020-07-02T17:30:

                                                                                                                                       08.041908

[2020-11-16 23:55:01,419] 314 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'
[2020-11-16 23:55:07,956] 436 DEBUG MainThread ssh.expect :: Output:
Expecting number of interface sriov_numvfs=32. Please wait a few minutes for inventory update and retry host-unlock.

[2020-11-16 23:56:15,366] 314 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'
[2020-11-16 23:56:24,457] 436 DEBUG MainThread ssh.expect :: Output:
+-----------------------+--------------------------------------------+
| Property | Value |
+-----------------------+--------------------------------------------+
| action | none |
| administrative | locked |
| availability | online |
| bm_ip | None |
| bm_type | none |
| bm_username | None |
| boot_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-5.0 |
| capabilities | {u'stor_function': u'monitor'} |
| clock_synchronization | ntp |
| config_applied | 1d1484c5-dd15-49b3-ab87-0ed0fc4c4a3d |
| config_status | None |
| config_target | 1d1484c5-dd15-49b3-ab87-0ed0fc4c4a3d |
| console | ttyS0,115200n8 |
| created_at | 2020-11-16T14:59:44.949842+00:00 |
| device_image_update | None |
| hostname | controller-0 |
| id | 1 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | abcd:204::2 |
| mgmt_mac | 3c:fd:fe:a0:16:78 |
| operational | disabled |
| personality | controller |
| reboot_needed | False |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:00:1f.2-ata-5.0 |
| serialid | None |
| software_load | stx 4.0 |
| subfunction_avail | online |
| subfunction_oper | disabled |
| subfunctions | controller,worker,lowlatency |
| task | Unlocking |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2020-11-16T23:55:38.864171+00:00 |
| uptime | 111 |
| uuid | 20167fc8-c125-4b72-aace-fc10bb8de147 |
| vim_progress_status | services-disabled |
+-----------------------+--------------------------------------------+

Test Activity
-------------
Regression Testing

Workaround
----------
Wait a couple of minutes and try to unlock the node again.

Revision history for this message
Adriano Oliveira (aoliveir) wrote :

Issue initially seen in worker node, but also reproduced in controller node.

description: updated
description: updated
Changed in starlingx:
assignee: nobody → Adriano Oliveira (aoliveir)
status: New → In Progress
description: updated
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / medium - issue affecting the upgrade operations

tags: added: stx.5.0 stx.networking stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
Al Bailey (albailey1974) wrote :
Revision history for this message
Al Bailey (albailey1974) wrote :
Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/792239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nfv (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/792239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796327

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nfv (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/nfv/+/796295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (f/centos8)
Download full text (14.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/796327
Committed: https://opendev.org/starlingx/nfv/commit/96fa4281d73e701e58388228c8e8e85491785c38
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 73c683d5337beff6062b40f011f3b775f3c70107
Author: Eric MacDonald <email address hidden>
Date: Fri May 21 17:25:38 2021 -0400

    Update fw-update-strategy steps to load wait_time from_dict

    The sw-manager fw-update-strategy feature is seen
    to fail in a traceback.

    The __wait_time member of the FwUpdateHostsStep and
    FwUpdateAbortHostsStep objects are not de-serialized
    from the DB using the ‘from_dict’ methods. This means
    it does not run the ‘init’ method for those classes,
    but instead attempts to re-constitute the object
    directly which can lead to an exception\traceback.

    This update adds the _wait_time member to each of these
    fw-update-strategy class objects' 'from_dict' function.

    This update also removes another object member, this one
    currently unused, that would also not be de-serialized
    if it were to be put to use as is in the future.

    Test Plan:

    PASS: Verify end-to-end orchestrated fw update (x2)

    Closes-Bug: 1929251
    Change-Id: I4540d1712f4dfee74e592c4f3ebce9c7cc913ab2
    Signed-off-by: Eric MacDonald <email address hidden>

commit 5ff24cf13f9d8cacab9ec15ff193fc8c819d31f4
Author: albailey <email address hidden>
Date: Fri May 21 17:51:38 2021 -0500

    Specify the nodeset for zuul jobs

    The py2.7 jobs need to specify xenial
    Changed py37 to py36 and specify bionic.

    The un-specified python3 jobs work fine on either
    focal or bionic.

    zuul is not setup to trigger off code changes in this repo
    so no source code changes are required to trigger the zuul
    jobs

    Partial-Bug: 1928978
    Signed-off-by: albailey <email address hidden>
    Change-Id: Iab9c8727a0f16fa7ff02c20ca3bec5622abe7bd7

commit 98d66c7f3bc46e1a990907db1c8f498f9841c885
Author: albailey <email address hidden>
Date: Thu May 6 12:03:15 2021 -0500

    Fix swact issue when deserializing an old patch strategy

    If a patch strategy in a previous release is de-serialized
    in the vim running a load that contains this commit
    https://review.opendev.org/c/starlingx/nfv/+/780310

    the vim would fail to startup due to key errors as it
    expected fields that did not exist in the previous release.

    Closes-Bug: 1927526
    Signed-off-by: albailey <email address hidden>
    Change-Id: Ia72463feb50f7d6a2491242ec865f7c854c75419

commit e5856549e51f10ae6818ec1d0ec43568225e9bd9
Author: albailey <email address hidden>
Date: Thu May 6 12:46:29 2021 -0500

    Increase the patching apply_patch REST API timeout

    During a kubernetes upgrade orchestration, the kubernetes
    patch needs to be applied. The default timeout was 20 seconds
    but a lab took 24 seconds.

    Thi update increases the timeout for that API call.

    Closes-Bug: 1927532
    Signed-off-by: albailey <email address hidden>
    Change-Id: I63a6c5616f6abf7a5b6879e5ebd458a8ecc52ba7

commit 4ffec1...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.