GeoRedundancy: No option to update boostrap address when subcloud migration fails

Bug #2057981 reported by Li Zhu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Li Zhu

Bug Description

Brief Description
-----------------
If the subcloud rehome_data contains an incorrect bootstrap-address on site A and the user migrates the corresponding pee group to site B, the migration would fail. Subsequently, it will have the 'rehome-failed' deploy-status on site B and 'rehome-pending' deploy-status on site A. Then the user won't be able to update the bootstrap-address on Site B. Although they can update the bootstrap-address on site A, but the change will not be synced to site B.

Severity
--------
Major

Steps to Reproduce
------------------
1. Create the system peer from Site A to Site B
2. Create System peer from Site B to Site A
3. Create the subcloud peer group in the Site A
4. Add subcloud(s) to the peer group
5. Create peer group association to associate system peer and subcloud peer group - Site A
6. Check current sync status on Sites A and B. Verify they are 'in-sync'.
7. Update subcloud with an incorrect bootstrap-address on Site A and then sync to Site B
8. Run migration for the subcloud peer group from Site B
9. After rehome fails, update subcloud bootstrap-address on Site A and then sync to Site B, which is supposed to fail
10. update subcloud bootstrap-address on Site B, which is supposed to be rejected

Expected Behavior
------------------
The user can correct the subcloud bootstrap-address either on secondary site or on the primary site and then sync to the secondary site.

Actual Behavior
----------------
The user can't run another migrate, as it would still fail due to the incorrect bootstrap-address, and the bootstrap-address could not be corrected on the secondary site.

Reproducibility
---------------
100%

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
master (2024-03-12)

Last Pass
---------
New test scenario

Timestamp/Logs
--------------
Site A dcmanager.log:
2024-03-14 23:15:20.962 3294 ERROR dcmanager.manager.system_peer_manager [req-e6f45fc7-88a9-4acc-a9a8-85eee0c76eb8 f972b4432e5e41cea4e6f26edd51b641 - - default default] Failed to sync subcloud(s) in the Subcloud Peer Group sc1-subcloud-peer-group:

{"subcloud1-sc1": "Ignoring update Peer Site Subcloud subcloud1-sc1 (region_name: 55900716862f4369987c12d50029ce38) as is not in secondary state."}
Site B:
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud update --bootstrap-address 10.10.10.12 subcloud1-sc1
The server could not comply with the request since it is either malformed or otherwise incorrect. Subcloud update is only allowed when its peer group priority value is 0.
ERROR (app) Unable to update subcloud subcloud1-sc1

Test Activity
-------------
Feature Testing

Workaround
----------
None

Li Zhu (lzhu1)
description: updated
Li Zhu (lzhu1)
description: updated
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)
Download full text (3.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/913272
Committed: https://opendev.org/starlingx/distcloud/commit/7ad78ea2aea5b7adb1c3055d0f2cc4a708152b27
Submitter: "Zuul (22348)"
Branch: master

commit 7ad78ea2aea5b7adb1c3055d0f2cc4a708152b27
Author: Li Zhu <email address hidden>
Date: Thu Mar 14 17:57:46 2024 -0400

    Allow rehome related data update when subcloud migration fails

    If the subcloud rehome_data contains an incorrect bootstrap-address in
    site A and the user migrates the corresponding peer group to site B,
    the migration would fail. Subsequently, it will have the 'rehome-failed'
    deploy-status in site B and 'rehome-pending' deploy-status in site A.
    Then the user won't be able to update the bootstrap-address in either
    site due to the following restrictions:
    a) Primary site (site A) is not the current leader of the peer group;
    b) Update in non-primary site (site B) is not allowed.

    To fix this issue, the following changes are made:
    1. In the non-primary site, if the subcloud deploy-status is
    rehome-failed and the primary site is unavailable, updating
    the bootstrap-values and bootstrap-address will be allowed, and the PGA
    will be marked as out-of-sync.
    2. Modify audit to automatically sync the rehome_data from non-primary
    site to primary site if subcloud in the non-primary site is managed and
    online and the PGA is out-of-sync.

    Additional fix for the system_leader_id issue: When migrating SPG from
    one site to another, if all of the subclouds rehome fail, the leader id
    of the SPG in the target site has already been updated to the target
    site's UUID. However, in the source site, the leader id is not updated
    to the target UUID. The fix ensures that regardless of the migration's
    success, only if the migration completes, the leader id in both sites
    should be updated to the target UUID.

    Test plan:
    Pre-Steps: 1. Create the system peer from Site A to Site B
               2. Create System peer from Site B to Site A
               3. Create the subcloud peer group in the Site A
               4. Add a subcloud with an incorrect bootstrap-address
                  to the peer group
               5. Create peer group association to associate system peer
                  and subcloud peer group - Site A
               6. Check current sync status in sites A and B. Verify
                  they are 'in-sync'.
               7. Run migration for the subcloud peer group from Site B.
               8. Verify 'rehome-failed' deploy-status in both sites.
    PASS: Verify that the bootstrap-address can be updated in site B when
          site A is down, and the PGA sync status is set to out-of-sync
          in site B. Also, verify that the audit will sync the rehome_data
          to site A and change back the PGA to in-sync once the reattempt of
          migration is successful and site A is up.
    PASS: Verify that the bootstrap-values and bootstrap-address are
          the only fields that can be updated in site B when site A is down.
    PASS: Verify that the update of bootstrap-address was rejected in site...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/914087

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/914087
Committed: https://opendev.org/starlingx/distcloud/commit/be4062b08fcefec1d9f7f7e102cf3a59d2e7c5e4
Submitter: "Zuul (22348)"
Branch: master

commit be4062b08fcefec1d9f7f7e102cf3a59d2e7c5e4
Author: Li Zhu <email address hidden>
Date: Mon Mar 25 09:21:07 2024 -0400

    Add additional GEO-Redundancy unit tests

    Add unit tests for peer_group_audit_manager.py and subcloud
    bootstrap-address update.

    Closes-Bug: 2057981

    Change-Id: I7bbc9d26fb698bade7e955b303cdbd30f87c7776
    Signed-off-by: lzhu1 <email address hidden>

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.distcloud
Changed in starlingx:
assignee: nobody → Li Zhu (lzhu1)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.