StarlingX

Bug #2057981
Comment #1

Comment 1 for bug 2057981

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-20: Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/913272
Committed: https://opendev.org/starlingx/distcloud/commit/7ad78ea2aea5b7adb1c3055d0f2cc4a708152b27
Submitter: "Zuul (22348)"
Branch: master

commit 7ad78ea2aea5b7adb1c3055d0f2cc4a708152b27
Author: Li Zhu <email address hidden>
Date: Thu Mar 14 17:57:46 2024 -0400

Allow rehome related data update when subcloud migration fails

    If the subcloud rehome_data contains an incorrect bootstrap-address in
    site A and the user migrates the corresponding peer group to site B,
    the migration would fail. Subsequently, it will have the 'rehome-failed'
    deploy-status in site B and 'rehome-pending' deploy-status in site A.
    Then the user won't be able to update the bootstrap-address in either
    site due to the following restrictions:
    a) Primary site (site A) is not the current leader of the peer group;
    b) Update in non-primary site (site B) is not allowed.

    To fix this issue, the following changes are made:
    1. In the non-primary site, if the subcloud deploy-status is
    rehome-failed and the primary site is unavailable, updating
    the bootstrap-values and bootstrap-address will be allowed, and the PGA
    will be marked as out-of-sync.
    2. Modify audit to automatically sync the rehome_data from non-primary
    site to primary site if subcloud in the non-primary site is managed and
    online and the PGA is out-of-sync.

    Additional fix for the system_leader_id issue: When migrating SPG from
    one site to another, if all of the subclouds rehome fail, the leader id
    of the SPG in the target site has already been updated to the target
    site's UUID. However, in the source site, the leader id is not updated
    to the target UUID. The fix ensures that regardless of the migration's
    success, only if the migration completes, the leader id in both sites
    should be updated to the target UUID.

    Test plan:
    Pre-Steps: 1. Create the system peer from Site A to Site B
               2. Create System peer from Site B to Site A
               3. Create the subcloud peer group in the Site A
               4. Add a subcloud with an incorrect bootstrap-address
                  to the peer group
               5. Create peer group association to associate system peer
                  and subcloud peer group - Site A
               6. Check current sync status in sites A and B. Verify
                  they are 'in-sync'.
               7. Run migration for the subcloud peer group from Site B.
               8. Verify 'rehome-failed' deploy-status in both sites.
    PASS: Verify that the bootstrap-address can be updated in site B when
          site A is down, and the PGA sync status is set to out-of-sync
          in site B. Also, verify that the audit will sync the rehome_data
          to site A and change back the PGA to in-sync once the reattempt of
          migration is successful and site A is up.
    PASS: Verify that the bootstrap-values and bootstrap-address are
          the only fields that can be updated in site B when site A is down.
    PASS: Verify that the update of bootstrap-address was rejected in site B
          when site A is up.
    PASS: Verify that even if all of the subclouds in an SPG experience
          rehome failures, the system_leader_id in both sites is updated to
          the target's UUID.
    PASS: Verify that when site A is always online or recovered during
          the migration to site B, the subcloud deploy_status in both sites
          is "rehome-failed" after the migration completes. In this
          scenario, site A can migrate the subcloud back, even though it's
          still failed. However, after correcting the bootstrap-address in
          site A, the reattempt of migration in site A succeeds.

Closes-Bug: 2057981

Change-Id: I999dbf035e29950fd823e9cdb087160ce40fd4ca
Signed-off-by: lzhu1 <email address hidden>

Reviewed:  https://review.opendev.org/c/starlingx/distcloud/+/913272
Committed: https://opendev.org/starlingx/distcloud/commit/7ad78ea2aea5b7adb1c3055d0f2cc4a708152b27
Submitter: "Zuul (22348)"
Branch:    master

commit 7ad78ea2aea5b7adb1c3055d0f2cc4a708152b27
Author: Li Zhu <li.zhu@windriver.com>
Date:   Thu Mar 14 17:57:46 2024 -0400

Allow rehome related data update when subcloud migration fails
    
    If the subcloud rehome_data contains an incorrect bootstrap-address in
    site A and the user migrates the corresponding peer group to site B,
    the migration would fail. Subsequently, it will have the 'rehome-failed'
    deploy-status in site B and 'rehome-pending' deploy-status in site A.
    Then the user won't be able to update the bootstrap-address in either
    site due to the following restrictions:
    a) Primary site (site A) is not the current leader of the peer group;
    b) Update in non-primary site (site B) is not allowed.
    
    To fix this issue, the following changes are made:
    1. In the non-primary site, if the subcloud deploy-status is
    rehome-failed and the primary site is unavailable, updating
    the bootstrap-values and bootstrap-address will be allowed, and the PGA
    will be marked as out-of-sync.
    2. Modify audit to automatically sync the rehome_data from non-primary
    site to primary site if subcloud in the non-primary site is managed and
    online and the PGA is out-of-sync.
    
    Additional fix for the system_leader_id issue: When migrating SPG from
    one site to another, if all of the subclouds rehome fail, the leader id
    of the SPG in the target site has already been updated to the target
    site's UUID. However, in the source site, the leader id is not updated
    to the target UUID. The fix ensures that regardless of the migration's
    success, only if the migration completes, the leader id in both sites
    should be updated to the target UUID.
    
    Test plan:
    Pre-Steps: 1. Create the system peer from Site A to Site B
               2. Create System peer from Site B to Site A
               3. Create the subcloud peer group in the Site A
               4. Add a subcloud with an incorrect bootstrap-address
                  to the peer group
               5. Create peer group association to associate system peer
                  and subcloud peer group - Site A
               6. Check current sync status in sites A and B. Verify
                  they are 'in-sync'.
               7. Run migration for the subcloud peer group from Site B.
               8. Verify 'rehome-failed' deploy-status in both sites.
    PASS: Verify that the bootstrap-address can be updated in site B when
          site A is down, and the PGA sync status is set to out-of-sync
          in site B. Also, verify that the audit will sync the rehome_data
          to site A and change back the PGA to in-sync once the reattempt of
          migration is successful and site A is up.
    PASS: Verify that the bootstrap-values and bootstrap-address are
          the only fields that can be updated in site B when site A is down.
    PASS: Verify that the update of bootstrap-address was rejected in site B
          when site A is up.
    PASS: Verify that even if all of the subclouds in an SPG experience
          rehome failures, the system_leader_id in both sites is updated to
          the target's UUID.
    PASS: Verify that when site A is always online or recovered during
          the migration to site B, the subcloud deploy_status in both sites
          is "rehome-failed" after the migration completes. In this
          scenario, site A can migrate the subcloud back, even though it's
          still failed. However, after correcting the bootstrap-address in
          site A, the reattempt of migration in site A succeeds.
    
    Closes-Bug: 2057981
    
    Change-Id: I999dbf035e29950fd823e9cdb087160ce40fd4ca
    Signed-off-by: lzhu1 <li.zhu@windriver.com>