StarlingX

DOCS: Backup & restore procedure for StarlingX

Bug #1871065 reported by Mihnea Saracin on 2020-04-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	M Camp

Bug Description

The information in this launchpad is for the following StoryBoard: https://storyboard.openstack.org/#!/story/2006770

Brief Description
-----------------

This feature provides us with a last resort disaster recovery option in cases where the StarlingX software and/or data are compromised. The feature provides a backup utility to create a snapshot with the deployment state. This snapshot contains all that's needed to restore the deployment to a previously good working state.

There are 3 main options of Backup and Restore:

A. Full system restore where both the platform data and applications are re-initialized. (i.e. wipe_ceph_osds=true)

B. Platform restore where the platform data is re-initialized but the applications are preserved – including Openstack, if previously installed. (i.e. wipe_ceph_osds=false)

C. Openstack application B&R where only the Openstack application is restored. I.e.: Delete the Openstack application, re-apply Openstack application and restore data from off-box copies (glance, Ceph volumes, database)

Right below we are going to describe every restore option including the backup procedure.

Backing up

============

Local play method

~~~~~~~~~~~~~~~~~~~

Run:

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/backup.yml -e "ansible_become_pass=<sysadmin password> admin_password=<sysadmin password>"

The <admin_password> and <ansible_become_pass> need to be set correctly via the “-e” option on the command line or an override file or in the ansible secret file.

This will output a file named in this format: <inventory_hostname>_platform_backup_<timestamp>.tgz. The prefix <platform_backup_filename_prefix> and <openstack_backup_filename_prefix> can be overridden via the “-e” option on the command line or an override file.

The generated backup tar files will look like this: localhost_platform_backup_2019_08_08_15_25_36.tgz and localhost_openstack_backup_2019_08_08_15_25_36.tgz. They are located in /opt/backups directory on controller-0.

Remote play method

~~~~~~~~~~~~~~~~~~~~

1. Login to the host where ansible is installed and clone the playbook code from opendev at https://opendev.org/starlingx/ansible-playbooks.git

2. Provide an inventory file, either a customized one that is specified via the ‘-i’ option or the default one which resides in Ansible configuration directory (i.e. /etc/ansible/hosts), must specify the IP of the controller host. For example, if the host-name is my_vbox, the inventory-file should have an entry called my_vbox :

---

all:

hosts:

wc68:

ansible_host: 128.222.100.02

my_vbox:

ansible_host: 128.224.141.74

3. Run ansible:

ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>

The generated backup tar files can be found in <host_backup_dir> which is $HOME by default. It can be overridden by “-e” option on the command line or in an override file.

The generated backup tar have naming convention as in a local play.

Example:

ansible-playbook /localdisk/designer/repo/cgcs-root/stx/stx-ansible-playbooks/playbookconfig/src/playbooks/backup-restore/backup.yml --limit my_vbox -i $HOME/br_test/hosts -e "host_backup_dir=$HOME/br_test ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* ansible_ssh_pass=Li69nux*"

Detailed information of the contents of the backup

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- Postgresql config: Backup roles, table spaces and schemas for databases

- Postgresql data:

o template1, sysinv, barbican db data, fm db data,

o keystone db for primary region,

o dcmanager db for dc controller,

o dcorch db for dc controller

- ETCD database

- LDAP db

- Ceph crushmap

- DNS server list

- System Inventory network overrides. These are needed at restore to correctly set up the OS configuration:

o addrpool

o pxeboot_subnet

o management_subnet

o management_start_address

o cluster_host_subnet

o cluster_pod_subnet

o cluster_service_subnet

o external_oam_subnet

o external_oam_gateway_address

o external_oam_floating_address

- Docker registries on controller

- Docker no=proxy

- Backup up data:

o OS configuration

ok: [localhost] => (item=/etc) - note although everything here is backed up, not all of the content will be restored.

o Home directory ‘sysadmin’ user and all LDAP user accounts

ok: [localhost] => (item=/home)

o Geberated platform configuration

ok: [localhost] => (item=/opt/platform/config/19.09)

ok: [localhost] => (item=/opt/platform/puppet/19.09/hieradata) - All the hieradata under is backed-up. However only the static hieradata (static.yaml and secure_static.yaml) will be restored to bootstrap controller-0.

o Keyring

ok: [localhost] => (item=/opt/platform/.keyring/19.09)

o Patching and package repositories

ok: [localhost] => (item=/opt/patching)

ok: [localhost] => (item=/www/pages/updates)

o Extension filesystem

ok: [localhost] => (item=/opt/extension)

o atch-vault filesystem for distributed cloud system-controller

ok: [localhost] => (item=/opt/patch-vault)

o Armada manifests

ok: [localhost] => (item=/opt/platform/armada/19.09)

o Helm charts

ok: [localhost] => (item=/opt/platform/helm_charts)

Restoring

=============

A. Full system restore

------------------------

No user data is preserved but platform configuration is restored from archive.(i.e. wipe_ceph_osds=true)

Steps:

Backup: User runs the backup.yml playbook and it gets a platform backup tarball that he moves outside of cluster for safekeeping

Restore:

a. Power down all nodes.

b. Reinstall controller-0

c. Run ansible restore_platform.yml playbook to restore a full system from the platform tarball archive. For this step, similar to the backup procedure, we have two options: local and remote play.

~~~~~~~~~~~~

Local play

~~~~~~~~~~~~

First download the backup to the controller (you can also use an external storage device, e.g. an USB drive). Then run the command:

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "wipe_ceph_osds=true initial_backup_dir=<location_of_tarball> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>"

~~~~~~~~~~~~

Remote play

~~~~~~~~~~~~

1. Login to the host where ansible is installed and clone the playbook code from OpenDev at https://opendev.org/starlingx/ansible-playbooks.git

---

all:

hosts:

wc68:

ansible_host: 128.222.100.02

my_vbox:

ansible_host: 128.224.141.74

3. Run ansible:

ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>

Where optional-extra-vars can be:

o <wipe_ceph_osds> set to wipe_ceph_osds=true (start with an empty ceph cluster)

o The <backup_filename> is the platform backup tar file. It must be provided via the “-e” option on the command line, e.g. -e “backup_filename=localhost_platform_backup_2019_07_15_14_46_37.tgz”

o The <initial_backup_dir> is the location on the Ansible control machine where the platform backup tar file is placed to restore the platform. It must be provided via “-e” option on the command line.

o The <admin_password> , <ansible_become_pass> and <ansible_ssh_pass> need to be set correctly via the “-e” option on the command line or in the ansible secret file. <ansible_ssh_pass> is the password to the sysadmin user on controller-0.

o The <ansible_remote_tmp> should be set to a new directory (no need to create it ahead of time) under /home/sysadmin on controller-0 via the “-e” option on the command line

e.g.

ansible-playbook /localdisk/designer/jenkins/tis-stx-dev/cgcs-root/stx/ansible-playbooks/playbookconfig/src/playbooks/restore_platform.yml --limit my_vbox -i $HOME/br_test/hosts -e " wipe_ceph_osds=true ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* initial_backup_dir=$HOME/br_test backup_filename=my_vbox_system_backup_2019_08_08_15_25_36.tgz ansible_remote_tmp=/home/sysadmin/ansible-restore"

d. After ansible is exectued then the following steps are based on the deployment mode:

AIO-SX

~~~~~~~~~

1. Unlock controller-0 & wait for it to boot

AIO-DX

~~~~~~~~~

1. Unlock controller-0 & wait for it to boot

2. Reinstall controller-1 (boot it from PXE, wait for it to become 'online')

3. Unlock controller-1

Standard (with and w/o controller storage)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Unlock controller-0 & wait for it to boot. After unlock you will see all nodes, including storage nodes as offline.

2. Reinstall controller-1, all storage and compute nodes (boot them from PXE, wait for them to become 'online')

3. Unlock controller-1 and wait for it to be available

4. (optional – if system has controller storage) Unlock storage nodes and wait for them to be available

5. Unlock compute nodes and wait for them to be available

B. Platform restore

---------------------

User data and configuration is preserved in both k8s (i.e. Etcd) and Ceph. Thus, k8s PODs and their configuration is restored and PVC content is preserved.(i.e. wipe_ceph_osds=false)

Steps:

Backup: User runs the backup.yml playbook and it gets a platform backup tarball that he moves outside of cluster for safekeeping

Restore:

a. Power down all the nodes except the storage ones; Note that it is mandatory for the Ceph cluster to remain functional during restore.

b. Reinstall controller-0

c. Run ansible restore_platform.yml playbook to restore a full system from the platform tarball archive. For this step, similar to the backup procedure, we have two options: local and remote play.

~~~~~~~~~~~~

Local play

~~~~~~~~~~~~

First download the backup to the controller (you can also use an external storage device, e.g. an USB drive). Then run the command:

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir=<location_of_tarball> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>"

~~~~~~~~~~~~

Remote play

~~~~~~~~~~~~

1. Login to the host where ansible is installed and clone the playbook code from OpenDev at https://opendev.org/starlingx/ansible-playbooks.git

---

all:

hosts:

wc68:

ansible_host: 128.222.100.02

my_vbox:

ansible_host: 128.224.141.74

3. Run ansible:

ansible-playbook <path-to-backup-playbook-entry-file> --limit host-name -i <inventory-file> -e <optional-extra-vars>

Where optional-extra-vars can be:

o <wipe_ceph_osds> set to wipe_ceph_osds=false (default to false) (keep Ceph data intact)

o The <ansible_remote_tmp> should be set to a new directory (no need to create it ahead of time) under /home/sysadmin on controller-0 via the “-e” option on the command line

e.g.

ansible-playbook /localdisk/designer/jenkins/tis-stx-dev/cgcs-root/stx/ansible-playbooks/playbookconfig/src/playbooks/restore_platform.yml --limit my_vbox -i $HOME/br_test/hosts -e " ansible_become_pass=Li69nux* admin_password=Li69nux* ansible_ssh_pass=Li69nux* initial_backup_dir=$HOME/br_test backup_filename=my_vbox_system_backup_2019_08_08_15_25_36.tgz ansible_remote_tmp=/home/sysadmin/ansible-restore"

d. After ansible is exectued then the following steps are based on the deployment mode:

AIO-SX

~~~~~~~~~

1. Unlock controller-0 & wait for it to boot

AIO-DX

~~~~~~~~~

1. Unlock controller-0 & wait for it to boot

2. Reinstall controller-1 (boot it from PXE, wait for it to become 'online')

3. Unlock controller-1

Standard w/o controller storage

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Unlock controller-0 & wait for it to boot. After unlock you will see all nodes, including storage nodes as offline.

2. Reinstall controller-1 and compute nodes (boot them from PXE, wait for them to become 'online')

3. Unlock controller-1 and wait for it to be available

4. Unlock compute nodes and wait for them to be available

Standard with controller storage

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Unlock controller-0 & wait for it to boot. After unlock you will see all nodes except storage nodes as offline. Storage nodes have to be powered on and in ‘available’ state.

2. Reinstall controller-1 and compute nodes (boot them from PXE, wait for them to become 'online')

3. Unlock controller-1 and wait for it to be available

5. Unlock compute nodes and wait for them to be available

6. (optional) reinstall storage nodes.

e. Re-apply applications (e.g. Openstack) to force pods to restart.

C.Openstack application B&R

---------------------

In this procedure, only the Openstack application will be restored.

Steps:

Backup: Customer runs the same backup.yml playbook as for #A and #B. Backup tarballs have to be moved outside of cluster for safekeeping.
Note: The backup.yaml playbook generates a platform backup tarball and an Openstack backup tarball.
When Openstack is running, the backup.yml playbook has two output tarballs instead of one.

Restore:

a. Delete the old Openstack application [note that images and volumes will remain in Ceph] and upload the application again

system application-remove stx-openstack

system application-delete stx-openstack

system application-upload stx-openstack-<ver>.tgz

b. (optional – if the user wants to delete the data) remove old glance images and cinder volumes Ceph pool

c. Runs restore_openstack.yml ansible playbook to restore the Openstack tarball.

~~~~~~~~~

If you don't want to manipulate the ceph data, you should execute:

~~~~~~~~~

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'

e.g.

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'initial_backup_dir=/opt/backups ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz'

~~~~~~~~~

If you want to restore glance images and cinder volumes from external storage(step #b was executed) or reconcile newer data in glance and cinder volumes pool with older data, the following steps must be executed:

~~~~~~~~~

- Run restore_openstack playbook with the 'restore_cinder_glance_data' flag enabled. This step will bring up MariaDB services, restore Mariadb data and bring up Cinder and Glance services.:

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'

e.g.

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_cinder_glance_data=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'

- Restore Glance images and Cinder volumes using image-backup.sh and tidy_storage_post_restore helper scripts.

The tidy storage script is used detect any discrepancy between Cinder/Glance DB and rbd pools.

Discrepancies between the Glance images DB and rbd images pool are handled in the following ways:

- If an image is in Glance images DB but not in rbd images pool,

list the image and suggested actions to take in a log file.

- If an image is in rbd images pool but not in Glance images DB,

create a Glance image in Glance images DB to associate with the

backend data. Also, list the image and suggested actions to take in a log file.

Discrepancies between Cinder volumes DB and rbd cinder-volumes pool are handled in the following ways:

- If a volume is in Cinder volumes DB but not in rbd cinder-volumes

pool, set the volume state to "error". Also, list the volume and suggested

actions to take in a log file.

- If a volume is in rbd cinder-volumes pool but not in Cinder volumes

DB, remove any snapshot(s) associated with this volume in rbd pool and

create a volume in Cinder volumes DB to associate with the backend

data. List the volume and suggested actions to take in a log file.

- If a volume is in both Cinder volumes DB and rbd cinder-volumes pool

and it has snapshot(s) in the rbd pool, re-create the snapshot in

Cinder if it doesn't exist.

- If a snapshot is in Cinder DB but not in the rbd pool, it will be deleted.

Usage:

tidy_storage_post_restore <log_file>

The image-backup.sh script is used to backup and restore glance images from ceph image pool.

Usage:

image-backup export <uuid> - export the image with <uuid> into backup file /opt/backups/image_<uuid>.tgz

image-backup import image_<uuid>.tgz - import the image from the backup source file at /opt/backups/image_<uuid>.tgz

- Run the playbook again with 'restore_openstack_continue' set to true to bring up the remaining openstack services:

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true initial_backup_dir=<location_of_backup_filename> ansible_become_pass=<admin_password> admin_password=<admin_password> backup_filename=<backup_filename>'

e.g.

ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_openstack.yml -e 'restore_openstack_continue=true ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_openstack_backup_2019_12_13_12_43_17.tgz initial_backup_dir=/opt/backups'

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-06:

stx.4.0 doc change

tags:	added: stx. stx.docs
tags:	added: stx.4.0 removed: stx.
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → M Camp (mcamp859)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-22: Fix proposed to docs (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721772

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-11: Fix merged to docs (master)

Reviewed: https://review.opendev.org/721772
Committed: https://git.openstack.org/cgit/starlingx/docs/commit/?id=2061492e122097ef9905740f209ec8971d14dacd
Submitter: Zuul
Branch: master

commit 2061492e122097ef9905740f209ec8971d14dacd
Author: MCamp859 <email address hidden>
Date: Tue Apr 21 23:14:29 2020 -0400

Add backup & restore guide

    Story: 2006770
    Task: 38487
    Closes-Bug: 1871065

Change-Id: I9de2b9a2d925f34291f67fb0ab1c8d46945995d6
Signed-off-by: MCamp859 <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.