Unable to run DC orchestration after rehoming a subcloud that has an existing VIM strategy

Bug #2053065 reported by Gustavo Herzmann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Gustavo Herzmann

Bug Description

Brief Description
-----------------
If a subcloud that contains an existing VIM strategy is rehomed, the DC orchestration will fail when trying to apply a different strategy type because it can't create e new strategy while another one already exists.
It doesn't have context about the existing strategy, so the strategy needs to be automatically deleted after the rehome operation is complete.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
Create a VIM strategy on the subcloud;
Rehome the subcloud;
Try to run DC orchestrator by applying a different type of strategy.

Expected Behavior
------------------
Subcloud should not contain any existing strategies after a successful rehome;
The DC orchestrator should be able to create the new strategy without conflicts

Actual Behavior
----------------
Subcloud has an existing strategy after rehoming. DC orchestrator fails due to conflict with existing strategy.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Distributed cloud system

Branch/Pull Time/Commit
-----------------------
Master (2024-02-13)

Last Pass
---------
The behavior always has been like this

Timestamp/Logs
--------------
$ dcmanager strategy-step list

| cloud | stage | state | details started_at | finished_at | | subcloud3 | 3 | failed | creating vim kube rootca update strategy: Operation failed: conflict detected. strategy already exists of type:kube-upgrade | 2024-02-05 13:12:40.429065 | 2024-02-05 13:13:02.345736 |
Test Activity
-------------
Feature Testing

Workaround
----------
Manually delete the existing VIM strategy on the subcloud using the sw-manager CLI

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/908945
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/b223c3a2674c7c6f16860838c06730d0ba3321b4
Submitter: "Zuul (22348)"
Branch: master

commit b223c3a2674c7c6f16860838c06730d0ba3321b4
Author: Gustavo Herzmann <email address hidden>
Date: Tue Feb 13 16:15:08 2024 -0300

    Handle existing VIM strategies during subcloud rehome

    Introduce a check to prevent rehoming subclouds with VIM strategies
    in 'abort' phase. Existing management affecting alarm check covers
    other ongoing states.

    This commit also auto-deletes VIM strategies in 'applied', 'ready-to-
    apply', 'build-failed', or 'build-timeout' states after rehoming,
    avoiding conflicts during DC orchestration on the new system
    controller.

    Test Plan:
    1. PASS - Attempt to rehome a subcloud with a VIM strategy in the
              'abort' phase, verify that it fails;
    2. PASS - Rehome a subcloud containing a VIM strategy in a valid state
              (applied, ready-to-apply, build-failed or build-timeout) and
              verify that the strategy is automatically deleted;
    3. PASS - Rehome a subcloud without existing VIM strategies and
              verify that it still works as expected;
    4. PASS - Create a DC orchestrator strategy, manually delete the VIM
              strategy in the subcloud and then try to delete the DC orch.
              strategy; verify that it's deleted successfully.

    Closes-Bug: 2053065

    Change-Id: Ia11d88714235845d8e3c274196d6f809400851ea
    Signed-off-by: Gustavo Herzmann <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud
Changed in starlingx:
assignee: nobody → Gustavo Herzmann (gherzman)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/911378
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/90ddfc82f6903ece63ada1ab60b0c6c1eed4c707
Submitter: "Zuul (22348)"
Branch: master

commit 90ddfc82f6903ece63ada1ab60b0c6c1eed4c707
Author: Gustavo Herzmann <email address hidden>
Date: Tue Mar 5 17:51:00 2024 -0300

    Add retry attempts to the strategy delete task during rehoming

    This commit adds retry attempts to the strategy delete task executed
    at the end of the rehoming playbook. The existing check ensuring the
    VIM service is 'enabled-active' is insufficient, as the service may
    still be starting (not ready) when the delete request is made.

    Test Plan:
    1. PASS - Rehome a subcloud containing a VIM strategy in a valid state
              (applied, ready-to-apply, build-failed or build-timeout) and
              verify that the strategy is automatically deleted;
    2. PASS - Introduce a 10s delay on the vim service startup function
              and repeat the previous test. Verify successful retry
              attempts leading to the eventual deletion of the strategy.

    Related commit:
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/908945

    Closes-Bug: 2053065

    Change-Id: Iff5ae7d7a6da6adf23dc0aaa07d1de46a244d8b2
    Signed-off-by: Gustavo Herzmann <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.