fw orchestration on two subcloud-groups, group 1 updated complete but stayed out-of-sync until group 2 complete

Bug #1890952 reported by Difu Hu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Al Bailey

Bug Description

Brief Description
-----------------
fw orchestration on two subcloud-groups, group one updated complete but firmware_sync_status stayed out-of-sync until group two updated complete.
Then both groups firmware_sync_status became in-sync together.

Severity
--------
Major

Steps to Reproduce
------------------
dcmanager subcloud-group add --name group007
dcmanager subcloud update --group=1 subcloud1
dcmanager subcloud update --group=2 subcloud3
system --os-auth-url https://[fd01:12::2]:5001/v3 --os-region-name subcloud1 host-device-label-assign controller-0 0000:b2:00.0 subcloud1=29
system --os-auth-url https://[fd01:303::2]:5001/v3 --os-region-name subcloud3 host-device-label-assign controller-0 0000:b4:00.0 subcloud3=29
system --os-region-name SystemController device-image-upload 5gldpc_1x2x25g_20ww2.3_swap_ddr4_2xrefresh-signed-ssl-csk1.bin functional
system --os-region-name SystemController device-image-apply 07588e41-bf0f-44f3-a969-fc5ffa75bf30 subcloud1=29 subcloud3=29 nonsubcloud=abc
dcmanager fw-update-strategy create
dcmanager fw-update-strategy apply

Expected Behavior
------------------
group 1 updated complete, then group 2 started to update.
group 1 subcloud1 became firmware_sync_status in-sync

Actual Behavior
----------------
group 1 updated complete, then group 2 started to update.
group 1 subcloud1 firmware_sync_status stayed out-of-sync, until group 2 updated complete.

Reproducibility
---------------
1/1

System Configuration
--------------------
Lab-name: DC-3

Branch/Pull Time/Commit
-----------------------
2020-08-07_20-00-00

Last Pass
---------
N/A

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ date
Sat Aug 8 23:55:08 UTC 2020
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+-----------+-------+-----------------------------+-----------------------------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+-----------+-------+-----------------------------+-----------------------------+----------------------------+----------------------------+
| subcloud1 | 2 | complete | | 2020-08-08 22:26:21.267595 | 2020-08-08 23:11:05.067935 |
| subcloud3 | 3 | applying fw update strategy | apply phase is 83% complete | 2020-08-08 23:11:13.650140 | None |
+-----------+-------+-----------------------------+-----------------------------+----------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud1
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 3 |
| name | subcloud1 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:12::0/64 |
| management_start_ip | fd01:12::2 |
| management_end_ip | fd01:12::11 |
| management_gateway_ip | fd01:12::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 1 |
| created_at | 2020-08-08 12:01:37.611617 |
| updated_at | 2020-08-08 23:09:05.223035 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | unknown |
| identity_sync_status | in-sync |
| load_sync_status | in-sync |
| patching_sync_status | in-sync |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud3
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 2 |
| name | subcloud3 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:303::0/64 |
| management_start_ip | fd01:303::2 |
| management_end_ip | fd01:303::11 |
| management_gateway_ip | fd01:303::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 2 |
| created_at | 2020-08-08 11:37:58.039134 |
| updated_at | 2020-08-08 23:52:58.371479 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | unknown |
| identity_sync_status | in-sync |
| load_sync_status | unknown |
| patching_sync_status | unknown |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+

[sysadmin@controller-0 ~(keystone_admin)]$ date
Sat Aug 8 23:56:20 UTC 2020
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list
+-----------+-------+----------+---------+----------------------------+----------------------------+
| cloud | stage | state | details | started_at | finished_at |
+-----------+-------+----------+---------+----------------------------+----------------------------+
| subcloud1 | 2 | complete | | 2020-08-08 22:26:21.267595 | 2020-08-08 23:11:05.067935 |
| subcloud3 | 3 | complete | | 2020-08-08 23:11:13.650140 | 2020-08-08 23:55:57.241718 |
+-----------+-------+----------+---------+----------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud1
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 3 |
| name | subcloud1 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:12::0/64 |
| management_start_ip | fd01:12::2 |
| management_end_ip | fd01:12::11 |
| management_gateway_ip | fd01:12::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 1 |
| created_at | 2020-08-08 12:01:37.611617 |
| updated_at | 2020-08-08 23:09:05.223035 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | in-sync |
| identity_sync_status | in-sync |
| load_sync_status | in-sync |
| patching_sync_status | in-sync |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud3
+-----------------------------+----------------------------+
| Field | Value |
+-----------------------------+----------------------------+
| id | 2 |
| name | subcloud3 |
| description | None |
| location | None |
| software_version | 20.06 |
| management | managed |
| availability | online |
| deploy_status | complete |
| management_subnet | fd01:303::0/64 |
| management_start_ip | fd01:303::2 |
| management_end_ip | fd01:303::11 |
| management_gateway_ip | fd01:303::1 |
| systemcontroller_gateway_ip | fd01:11::1 |
| group_id | 2 |
| created_at | 2020-08-08 11:37:58.039134 |
| updated_at | 2020-08-08 23:52:58.371479 |
| dc-cert_sync_status | in-sync |
| firmware_sync_status | in-sync |
| identity_sync_status | in-sync |
| load_sync_status | unknown |
| patching_sync_status | unknown |
| platform_sync_status | in-sync |
+-----------------------------+----------------------------+

Test Activity
-------------
Functional Testing

Revision history for this message
Difu Hu (difuhu) wrote :
Revision history for this message
Al Bailey (albailey1974) wrote :

The existing code explicitly triggers a firmware audit of all subclouds once the strategy is done.
Aborted/Completed/Failed states only.

It does not currently re-trigger the audit when a subcloud or subcloud group completes, partially because triggering the audit causes it to audit all subclouds.

Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as stx.5.0 gating as this capability is being added in stx.5.0.

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Changed in starlingx:
assignee: nobody → Al Bailey (albailey1974)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/745717

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/745717
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=82fa2e9b2fa70712146de4ba563c91b74d31b2f4
Submitter: Zuul
Branch: master

commit 82fa2e9b2fa70712146de4ba563c91b74d31b2f4
Author: albailey <email address hidden>
Date: Tue Aug 11 11:44:02 2020 -0500

    Set subcloud firmware in-sync immediately after subcloud completes

    When a subcloud completed, it might show out-of-sync until
    the remaining subclouds were also orchestrated since that
    is when the audit is explicitly called.

    Now we update the status as each subcloud completes.

    Change-Id: I0694c7aacfb781aae1a7d22dd777fa75dff2bdfc
    Closes-Bug: 1890952
    Signed-off-by: albailey <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Revision history for this message
Difu Hu (difuhu) wrote :

Verified on build 2020-06-27_18-35-20 with PATCH_0002.

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.