Backup & Restore: Subcloud AIO-DX active controller restore fails to connect to central registry

Bug #1870389 reported by Senthil Mukundakumar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------

Restoring the sub cloud active controller fails because of following error:

After reboot, I proceeded with restore and hit the following issue:ok: [localhost] => (item={u'replaced_url': u'registry.central:9001/docker.elastic.co', u'default_url': u'docker.elastic.co'})

TASK [common/push-docker-images : Log in k8s, gcr, quay, docker registries if credentials exist] *******
failed: [localhost] (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
failed: [localhost] (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
failed: [localhost] (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
failed: [localhost] (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
fatal: [localhost]: FAILED! => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}

PLAY RECAP *********************************************************************************************
localhost : ok=281 changed=127 unreachable=0 failed=1

Severity
--------
Major

Steps to Reproduce
------------------
1. Make sure the the sub cloud system is UP & ACTIVE
2. Do a backup from the active controller
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=Li69nux* admin_password=Li69nux*"
3. Unmanage subcloud
4. Bring down all the nodes in sub cloud and re-install active controller
5. scp the back up file to the controller
6. Execute config_management
7. Restore the active controller from backup file
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "backup_filename=localhost_platform_backup_2019_09_12_20_37_23.tgz admin_password=Li69nux* ansible_become_pass=Li69nux* initial_backup_dir=/home/sysadmin"

Actual Behavior
----------------

Subcloud Active controller failed to restore

Reproducibility
---------------
Tried backup & restore first time in a Subcloud environment

System Configuration
--------------------
Distributed Cloud/Subcloud

Branch/Pull Time/Commit
-----------------------
2020-03-29_16-39-59

LOGS:
-----
https://files.starlingx.kube.cengn.ca/launchpad/1870389
New logs has been attached:
post_restore_dc_system_logs.tar
post_restore_subcloud_log.tar
pre_backup_dc_system_logs.tar
pre_backup_subcloud.tar
localhost_platform_backup_2020_04_05_02_49_03.tgz

Last Pass
---------
Not executed previously in Subcloud

Test Activity
-------------
Regression

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

This is expected behavior if the setup had patches installed. This is waht the logs show. Senthil polease confirm. Thanks!

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

Yes the system was pre installed with patches. The restore did fail again after controller reboot. I have mentioned that above in description. Once I reproduce again, will update more logs.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - B&R failure with a patch applied

tags: added: stx.4.0 stx.update
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Dan Voiculeasa (dvoicule)
Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

The issue is reproduced in a different DC system - AIO-SX subcloud without any patches involved in the system.

description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

From the logs:
etc/resolv.conf on the failed subcloud controller shows
nameserver 2620:10a:a001:a103::2

The restore fails to do a docker login to registry.central:9001 because the name can't be resolved.

This is what would happen
nslookup registry.central 2620:10a:a001:a103::2
Server: 2620:10a:a001:a103::2
Address: 2620:10a:a001:a103::2#53

** server can't find registry.central: NXDOMAIN

Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

A task creating an entry in /etc/hosts pointing to the OAM of the central must be added by restore playbook.

`IP_OAM_CENTRAL registry.central`

This is not the only issue with restoring subclouds. Multiple tasks that will fail are hidden by the fact that the playbook fails at 'Log in k8s' task.

Also i think "-e distributed_cloud_role=subcloud" should be added to the ansible-playbook responsible for the restore. At least until proper detection and correct setting of that variable is done by the restore playbook.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/718749

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/718957

Frank Miller (sensfan22)
summary: - Backup & Restore: Subcloud AIO-DX active controller restore fails
+ Backup & Restore: Subcloud AIO-DX active controller restore fails to
+ connect to central registry
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/718749
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=9080db419d559d3d5d33c0a6459e9f5e8b7700e5
Submitter: Zuul
Branch: master

commit 9080db419d559d3d5d33c0a6459e9f5e8b7700e5
Author: Dan Voiculeasa <email address hidden>
Date: Thu Apr 9 16:07:30 2020 +0300

    Add registry.central host for DC subcloud restore

    During bootstrap management network is temporarly assigned on lo
    interface. Backup archive contains /etc/resolv.conf and /etc/hosts
    of an already unlocked controller. Before backup registry.central is
    resolved through dns (nameserver `floating central management`).

    During restore a temporary host for registry.central must be created.
    Since there is no reference of a backup/shadow management network that
    provides connectivity for such use cases the `floating central oam`
    can be used.

    Partial-Bug: 1870389

    Change-Id: I86166da31491736d6695e04fa287f79871975b55
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/718957
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=362d905dad25369bf116bb1e34a659f33b7260af
Submitter: Zuul
Branch: master

commit 362d905dad25369bf116bb1e34a659f33b7260af
Author: Dan Voiculeasa <email address hidden>
Date: Fri Apr 10 11:31:06 2020 +0300

    Improve host-overrides

    Add distributed cloud role information in the host overrides.
    The restore playbook needs this information.

    Partial-Bug: 1870389
    Change-Id: I278f19be32d1fe87687feb75e26b2898237de86f
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/719924

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/719924
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=40cfef7c417709c234e50a1a034fb4a11dbf180a
Submitter: Zuul
Branch: master

commit 40cfef7c417709c234e50a1a034fb4a11dbf180a
Author: Dan Voiculeasa <email address hidden>
Date: Tue Apr 14 14:18:29 2020 +0300

    Remove subcloud task from restore mode

    A task supposed to run only during bootstrap is running during restore.

    Keystone dc variables (dc_admin_user_id and dc_admin_project_id) are
    added during bootstrap to hieradata static.yaml file.
    When doing the restore the information is already present in the file in
    the backup archive.

    Partial-Bug: 1870389
    Change-Id: Iebab8dc059435c7e2b0f19947fedce88bd71bb65
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/720229

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/720579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/721611
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=4ccb11cb4019734e424362d677afb00dd6ecc4b6
Submitter: Zuul
Branch: master

commit 4ccb11cb4019734e424362d677afb00dd6ecc4b6
Author: Dan Voiculeasa <email address hidden>
Date: Tue Apr 21 11:47:37 2020 +0300

    Improve host-overrides

    Add missing variables for DC.

    Central+Subclod:
    system_mode
    location
    description

    Subcloud:
    region_config
    region_name
    system_controller_oam_subnet
    system_controller_oam_floating_address
    system_controller_subnet
    system_controller_floating_address

    Partial-Bug: 1870389
    Closes-Bug: 1873617
    Change-Id: Ieb12ffc0ad769dd6ca22eb4c15f9d6d55778fd4b
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/720579
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=36a01e8ba38f3e0d1e2ea7a2bce31edbedfde04e
Submitter: Zuul
Branch: master

commit 36a01e8ba38f3e0d1e2ea7a2bce31edbedfde04e
Author: Dan Voiculeasa <email address hidden>
Date: Tue Apr 21 17:54:53 2020 +0300

    B&R: Do keystone db backup for subcloud

    Keystone db backup file is missing for subclouds.
    Create the keystone db backup file when running the backup playbook on
    subcloud.

    Partial-Bug: 1870389
    Change-Id: I64c8b38a51bf04714931d70e126e0f63782deb20
    Signed-off-by: Dan Voiculeasa <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/720229
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=204641a5b3082c9873109169f93ae1845eb79813
Submitter: Zuul
Branch: master

commit 204641a5b3082c9873109169f93ae1845eb79813
Author: Dan Voiculeasa <email address hidden>
Date: Wed Apr 15 15:54:58 2020 +0300

    DC subcloud restore registry.central certs

    During restore a certificate is missing.
    Docker needs the certificate to connect to registry.central.
    Extract it from backup archive.

    Closes-Bug: 1870389

    Depends-On: I64c8b38a51bf04714931d70e126e0f63782deb20
    Depends-On: Ieb12ffc0ad769dd6ca22eb4c15f9d6d55778fd4b
    Depends-On: I86166da31491736d6695e04fa287f79871975b55
    Depends-On: Iebab8dc059435c7e2b0f19947fedce88bd71bb65
    Depends-On: I278f19be32d1fe87687feb75e26b2898237de86f

    Change-Id: Ief65a8963b81ef489171c264964d472a66fec282
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729809

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729812

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (37.5 KiB)

Reviewed: https://review.opendev.org/729812
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=539d476456277c22d0dcbc3cbbc832e623242264
Submitter: Zuul
Branch: f/centos8

commit 320cc40de8518787c2be234d7fdf88ec0a462df2
Author: Don Penney <email address hidden>
Date: Wed May 13 13:06:11 2020 -0400

    Add auto-versioning to starlingx/config packages

    This update makes use of the PKG_GITREVCOUNT variable to auto-version
    the packages in this repo.

    Change-Id: I3a2c8caeb4b4647608978b1f2ccfcf0661508803
    Depends-On: https://review.opendev.org/727837
    Story: 2006166
    Task: 39766
    Signed-off-by: Don Penney <email address hidden>

commit d9f2aea0fb228ed69eb9c9262e29041eedabc15d
Author: Sharath Kumar K <email address hidden>
Date: Wed Apr 22 16:22:22 2020 +0200

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch9 changes.

    Story: 2006387
    Task: 39524

    Change-Id: Ia1fe0f2baafb78c974551100f16e6a7d99882f15
    Signed-off-by: Sharath Kumar K <email address hidden>

    De-branding in starlingx/config: CGCS -> StarlingX

    1. Rename CGCS to StarlingX for .spec file
    2. Rename TIS to StarlingX for .service files

    Test:
    After the de-brand change, bootimage.iso has been built in the flock
    Layer and installed on the dev machine to validate the changes.

    Please note, doing de-brand changes in batches, this is batch10 changes.

    Story: 2006387
    Task: 36202

    Change-Id: I404ce0da2621495175ad31489e9ad6f7b0211e26
    Signed-off-by: Sharath Kumar K <email address hidden>

commit d141e954fa6bbf688929ec90d1b6604a97792c43
Author: Teresa Ho <email address hidden>
Date: Tue Mar 31 10:08:57 2020 -0400

    Sysinv extensions for FPGA support

    This update adds cli and restapi to support FPGA device
    programming.

    CLI commands:
    system device-image-apply
    system device-image-create
    system device-image-delete
    system device-image-list
    system device-image-remove
    system device-image-show
    system device-image-state-list
    system device-label-list
    system host-device-image-update
    system host-device-image-update-abort
    system host-device-label-assign
    system host-device-label-list
    system host-device-label-remove

    Story: 2006740
    Task: 39498

    Change-Id: I556c2e7a51b3931b5a66ab27b67f51e3a8aebd9f
    Signed-off-by: Teresa Ho <email address hidden>

commit 491cca42ed854d2cb3ee3646b93c56a4f45f563c
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 11:25:26 2020 +0000

    Qcow2 conversion to raw can be done using 'image-conversion' filesystem

    1. Conversion filesystem can be added before/after
       stx-openstack is applied
    2. If conversion filesystem is added after stx-openstack
       is applied, changes to stx-openstack will only take effec...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (22.6 KiB)

Reviewed: https://review.opendev.org/729809
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=73027425d4501a6b7785e91024c9e8ddbc03115d
Submitter: Zuul
Branch: f/centos8

commit 55c9afd075194f7669fa2a87e546f61034679b04
Author: Dan Voiculeasa <email address hidden>
Date: Wed May 13 14:19:52 2020 +0300

    Restore: disconnect etcd from ceph

    At the moment etcd is restored only if ceph data is kept.
    Etcd should be restored regardless if ceph data is kept or wiped.

    Story: 2006770
    Task 39751
    Change-Id: I9dfb1be0a83c3fdc5f1b29cbb974c5e0e2236ad3
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit 003ddff574c74adf11cf8e4758e93ba0eed45a6a
Author: Don Penney <email address hidden>
Date: Fri May 8 11:35:58 2020 -0400

    Add playbook for updating static images

    This commit introduces a new playbook, upgrade-static-images.yml, used
    for downloading updating images and pushing to the local registry.

    Change-Id: I8884440261a5a4e27b40398e5a75c9d03b09d4ba
    Story: 2006781
    Task: 39706
    Signed-off-by: Don Penney <email address hidden>

commit 26fd273cf5175ba4bdd31d6b6b777814f1a6c860
Author: Matt Peters <email address hidden>
Date: Thu May 7 14:29:02 2020 -0500

    Add kube-apiserver port to calico failsafe rules

    An invalid GlobalNetworkPolicy or NetworkPolicy may prevent
    calico-node from communicating with the kube-apiserver.
    Once the communication is broken, calico-node is no longer
    able to update the policies since it cannot communicate to
    read the updated policies. It can also prevent the pod
    from starting since the policies will prevent it from
    reading the configuration.

    To ensure that this scenario does not happen, the kube-apiserver
    port is being added to the failsafe rules to ensure communication
    is always possible, regardless of the network policy configuration.

    Change-Id: I1b065a74e7ad0ba9b1fdba4b63136b97efbe98ce
    Closes-Bug: 1877166
    Related-Bug: 1877383
    Signed-off-by: Matt Peters <email address hidden>

commit bd0f14a7dfb206ccaa3ce0f5e7d9034703b3403c
Author: Robert Church <email address hidden>
Date: Tue May 5 15:11:15 2020 -0400

    Provide an update strategy for Tiller deployment

    In the case of a simplex controller configuration the current patching
    strategy for the Tiller environment will fail as the tiller ports will
    be in use when the new deployment is attempted to be applied. The
    resulting tiller pod will be stuck in a Pending state.

    This will be observed if the node becomes ready after 'helm init'
    installs the initial deployment and before the deployment is patched for
    environment checks.

    The deployment strategy provided by 'helm init' is unspecified. This
    change will allow one additional pod (current + new) and one unavailable
    pod (current) during an update. The maxUnavailable setting allows the
    tiller pod to be deleted which will release its ports, thus allowing the
    patch deployment to spin up an new pod to a Running state.

    Change-Id: I83c43c52a77...

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

Verified in DC3/subcloud1 using 2020-06-24_22-16-59

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.