Backup & Restore: Distributed Cloud System (AIO-DX) active controller restore fails - Identified as SIMPLEX

Bug #1873617 reported by Senthil Mukundakumar
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------

In Distributed Cloud configuration, the active controller failed to restore and the system configuration is recognized as SIMPLEX.

E TASK [bootstrap/validate-config : Validate system type and system mode if distributed cloud role is system controller] ***********************************************************************************************************************************
E fatal: [localhost]: FAILED! => {"changed": false, "msg": "A simplex All-in-one controller cannot be configured as Distributed Cloud System Controller"}
E

localhost.yml file before restore:

system_mode: duplex
distributed_cloud_role: systemcontroller

management_start_address: abcd:204::2
management_end_address: abcd:204::ffff
management_subnet: abcd:204::/64
management_multicast_subnet: ff05::1b:0/124
...

Severity
--------
Critical: Unable to restore active controller in DC system

Steps to Reproduce
------------------
1. Bring up the Distributed cloud system (AIO-DX)
2. Backup the system using ansible locally
3. Re-install the controller with the same load
4. Restore the active controller from backup file
5. Unlock active controller

Expected Behavior
------------------
The active controller should be successfully restored and become active

Actual Behavior
----------------
Active controller failed to restore

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-DX(Distributed System) - r430_3_4

Logs
----
https://files.starlingx.kube.cengn.ca/launchpad/1873617

Branch/Pull Time/Commit
-----------------------
 BUILD_ID="2020-04-17_00-10-00"

Test Activity
-------------
Feature Testing

tags: added: stx.retestneeded
description: updated
description: updated
description: updated
Yang Liu (yliu12)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - specific B&R failure related to distcloud

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Dan Voiculeasa (dvoicule)
tags: added: stx. stx.update
tags: added: stx.4.0
removed: stx.
tags: added: stx.distcloud
Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

`system_mode: duplex` is missing from the localhost.yml generated for the backup

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721611

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/721611
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=4ccb11cb4019734e424362d677afb00dd6ecc4b6
Submitter: Zuul
Branch: master

commit 4ccb11cb4019734e424362d677afb00dd6ecc4b6
Author: Dan Voiculeasa <email address hidden>
Date: Tue Apr 21 11:47:37 2020 +0300

    Improve host-overrides

    Add missing variables for DC.

    Central+Subclod:
    system_mode
    location
    description

    Subcloud:
    region_config
    region_name
    system_controller_oam_subnet
    system_controller_oam_floating_address
    system_controller_subnet
    system_controller_floating_address

    Partial-Bug: 1870389
    Closes-Bug: 1873617
    Change-Id: Ieb12ffc0ad769dd6ca22eb4c15f9d6d55778fd4b
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/723849

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

Verified in r430_3_4 using 2020-04-27_20-00-00

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

The DC system did passed this failure point and hit a new issue 1875664

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/723849
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=acd84841d201f1d5777edd2996086732cb3a3104
Submitter: Zuul
Branch: master

commit acd84841d201f1d5777edd2996086732cb3a3104
Author: Dan Voiculeasa <email address hidden>
Date: Thu Apr 23 17:37:23 2020 +0300

    Fix SystemController filesystem at restore

    The filesystem `dc-vault` is created at unlock.
    It doesn't exist at restore time to be resized.
    It will be correctly sized during unlock.

    It is not mounted into /dev/cgts-vg/dc-vault-lv.

    Closes-Bug: 1873617
    Change-Id: Ia2748756eaa8109065af1848374cc058c447910e
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (37.5 KiB)

Reviewed: https://review.opendev.org/729812
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=539d476456277c22d0dcbc3cbbc832e623242264
Submitter: Zuul
Branch: f/centos8

commit 320cc40de8518787c2be234d7fdf88ec0a462df2
Author: Don Penney <email address hidden>
Date: Wed May 13 13:06:11 2020 -0400

    Add auto-versioning to starlingx/config packages

    This update makes use of the PKG_GITREVCOUNT variable to auto-version
    the packages in this repo.

    Change-Id: I3a2c8caeb4b4647608978b1f2ccfcf0661508803
    Depends-On: https://review.opendev.org/727837
    Story: 2006166
    Task: 39766
    Signed-off-by: Don Penney <email address hidden>

commit d9f2aea0fb228ed69eb9c9262e29041eedabc15d
Author: Sharath Kumar K <email address hidden>
Date: Wed Apr 22 16:22:22 2020 +0200

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch9 changes.

    Story: 2006387
    Task: 39524

    Change-Id: Ia1fe0f2baafb78c974551100f16e6a7d99882f15
    Signed-off-by: Sharath Kumar K <email address hidden>

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec file
    2. Rename TIS to StarlingX for .service files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch10 changes.

    Story: 2006387
    Task: 36202

    Change-Id: I404ce0da2621495175ad31489e9ad6f7b0211e26
    Signed-off-by: Sharath Kumar K <email address hidden>

commit d141e954fa6bbf688929ec90d1b6604a97792c43
Author: Teresa Ho <email address hidden>
Date: Tue Mar 31 10:08:57 2020 -0400

    Sysinv extensions for FPGA support

    This update adds cli and restapi to support FPGA device
    programming.

    CLI commands:
    system device-image-apply
    system device-image-create
    system device-image-delete
    system device-image-list
    system device-image-remove
    system device-image-show
    system device-image-state-list
    system device-label-list
    system host-device-image-update
    system host-device-image-update-abort
    system host-device-label-assign
    system host-device-label-list
    system host-device-label-remove

    Story: 2006740
    Task: 39498

    Change-Id: I556c2e7a51b3931b5a66ab27b67f51e3a8aebd9f
    Signed-off-by: Teresa Ho <email address hidden>

commit 491cca42ed854d2cb3ee3646b93c56a4f45f563c
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 11:25:26 2020 +0000

    Qcow2 conversion to raw can be done using 'image-conversion' filesystem

    1. Conversion filesystem can be added before/after
       stx-openstack is applied
    2. If conversion filesystem is added after stx-openstack
       is applied, changes to stx-openstack will only take effec...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

    Restore: disconnect etcd from ceph

    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

    Add playbook for updating static images

    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

    Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

    Change-Id: I83c43c52a77...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.