DOCS: Backup & restore procedure for StarlingX
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
M Camp |
Bug Description
The information in this launchpad is for the following StoryBoard: https:/
Brief Description
-----------------
This feature provides us with a last resort disaster recovery option in cases where the StarlingX software and/or data are compromised. The feature provides a backup utility to create a snapshot with the deployment state. This snapshot contains all that's needed to restore the deployment to a previously good working state.
There are 3 main options of Backup and Restore:
A. Full system restore where both the platform data and applications are re-initialized. (i.e. wipe_ceph_
B. Platform restore where the platform data is re-initialized but the applications are preserved – including Openstack, if previously installed. (i.e. wipe_ceph_
C. Openstack application B&R where only the Openstack application is restored. I.e.: Delete the Openstack application, re-apply Openstack application and restore data from off-box copies (glance, Ceph volumes, database)
Right below we are going to describe every restore option including the backup procedure.
Backing up
============
Local play method
~~~~~~~~~~~~~~~~~~~
Run:
ansible-playbook /usr/share/
The <admin_password> and <ansible_
This will output a file named in this format: <inventory_
The generated backup tar files will look like this: localhost_
Remote play method
~~~~~~~
1. Login to the host where ansible is installed and clone the playbook code from opendev at https:/
2. Provide an inventory file, either a customized one that is specified via the ‘-i’ option or the default one which resides in Ansible configuration directory (i.e. /etc/ansible/
---
all:
hosts:
wc68:
my_vbox:
3. Run ansible:
ansible-playbook <path-to-
The generated backup tar files can be found in <host_backup_dir> which is $HOME by default. It can be overridden by “-e” option on the command line or in an override file.
The generated backup tar have naming convention as in a local play.
Example:
ansible-playbook /localdisk/
Detailed information of the contents of the backup
~~~~~~~
- Postgresql config: Backup roles, table spaces and schemas for databases
- Postgresql data:
o template1, sysinv, barbican db data, fm db data,
o keystone db for primary region,
o dcmanager db for dc controller,
o dcorch db for dc controller
- ETCD database
- LDAP db
- Ceph crushmap
- DNS server list
- System Inventory network overrides. These are needed at restore to correctly set up the OS configuration:
o addrpool
o pxeboot_subnet
o management_subnet
o management_
o cluster_host_subnet
o cluster_pod_subnet
o cluster_
o external_oam_subnet
o external_
o external_
- Docker registries on controller
- Docker no=proxy
- Backup up data:
o OS configuration
ok: [localhost] => (item=/etc) - note although everything here is backed up, not all of the content will be restored.
o Home directory ‘sysadmin’ user and all LDAP user accounts
ok: [localhost] => (item=/home)
o Geberated platform configuration
ok: [localhost] => (item=/
ok: [localhost] => (item=/
o Keyring
ok: [localhost] => (item=/
o Patching and package repositories
ok: [localhost] => (item=/
ok: [localhost] => (item=/
o Extension filesystem
ok: [localhost] => (item=/
o atch-vault filesystem for distributed cloud system-controller
ok: [localhost] => (item=/
o Armada manifests
ok: [localhost] => (item=/
o Helm charts
ok: [localhost] => (item=/
Restoring
=============
A. Full system restore
-------
No user data is preserved but platform configuration is restored from archive.(i.e. wipe_ceph_
Steps:
Backup: User runs the backup.yml playbook and it gets a platform backup tarball that he moves outside of cluster for safekeeping
Restore:
a. Power down all nodes.
b. Reinstall controller-0
c. Run ansible restore_
Local play
First download the backup to the controller (you can also use an external storage device, e.g. an USB drive). Then run the command:
Remote play
1. Login to the host where ansible is installed and clone the playbook code from OpenDev at https:/
2. Provide an inventory file, either a customized one that is specified via the ‘-i’ option or the default one which resides in Ansible configuration directory (i.e. /etc/ansible/
---
all:
hosts:
3. Run ansible:
Where optional-extra-vars can be:
o <wipe_ceph_osds> set to wipe_ceph_osds=true (start with an empty ceph cluster)
o The <backup_filename> is the platform backup tar file. It must be provided via the “-e” option on the command line, e.g. -e “backup_
o The <initial_
o The <admin_password> , <ansible_
o The <ansible_
e.g.
d. After ansible is exectued then the following steps are based on the deployment mode:
AIO-SX
~~~~~~~~~
1. Unlock controller-0 & wait for it to boot
AIO-DX
~~~~~~~~~
1. Unlock controller-0 & wait for it to boot
2. Reinstall controller-1 (boot it from PXE, wait for it to become 'online')
3. Unlock controller-1
Standard (with and w/o controller storage)
1. Unlock controller-0 & wait for it to boot. After unlock you will see all nodes, including storage nodes as offline.
2. Reinstall controller-1, all storage and compute nodes (boot them from PXE, wait for them to become 'online')
3. Unlock controller-1 and wait for it to be available
4. (optional – if system has controller storage) Unlock storage nodes and wait for them to be available
5. Unlock compute nodes and wait for them to be available
B. Platform restore
-------
User data and configuration is preserved in both k8s (i.e. Etcd) and Ceph. Thus, k8s PODs and their configuration is restored and PVC content is preserved.(i.e. wipe_ceph_
Steps:
Backup: User runs the backup.yml playbook and it gets a platform backup tarball that he moves outside of cluster for safekeeping
Restore:
a. Power down all the nodes except the storage ones; Note that it is mandatory for the Ceph cluster to remain functional during restore.
b. Reinstall controller-0
c. Run ansible restore_
Local play
First download the backup to the controller (you can also use an external storage device, e.g. an USB drive). Then run the command:
Remote play
1. Login to the host where ansible is installed and clone the playbook code from OpenDev at https:/
2. Provide an inventory file, either a customized one that is specified via the ‘-i’ option or the default one which resides in Ansible configuration directory (i.e. /etc/ansible/
---
all:
hosts:
3. Run ansible:
Where optional-extra-vars can be:
o <wipe_ceph_osds> set to wipe_ceph_
o The <backup_filename> is the platform backup tar file. It must be provided via the “-e” option on the command line, e.g. -e “backup_
o The <initial_
o The <admin_password> , <ansible_
o The <ansible_
e.g.
d. After ansible is exectued then the following steps are based on the deployment mode:
AIO-SX
~~~~~~~~~
1. Unlock controller-0 & wait for it to boot
AIO-DX
~~~~~~~~~
1. Unlock controller-0 & wait for it to boot
2. Reinstall controller-1 (boot it from PXE, wait for it to become 'online')
3. Unlock controller-1
Standard w/o controller storage
1. Unlock controller-0 & wait for it to boot. After unlock you will see all nodes, including storage nodes as offline.
2. Reinstall controller-1 and compute nodes (boot them from PXE, wait for them to become 'online')
3. Unlock controller-1 and wait for it to be available
4. Unlock compute nodes and wait for them to be available
Standard with controller storage
1. Unlock controller-0 & wait for it to boot. After unlock you will see all nodes except storage nodes as offline. Storage nodes have to be powered on and in ‘available’ state.
2. Reinstall controller-1 and compute nodes (boot them from PXE, wait for them to become 'online')
3. Unlock controller-1 and wait for it to be available
5. Unlock compute nodes and wait for them to be available
6. (optional) reinstall storage nodes.
e. Re-apply applications (e.g. Openstack) to force pods to restart.
C.Openstack application B&R
-------
In this procedure, only the Openstack application will be restored.
Steps:
Backup: Customer runs the same backup.yml playbook as for #A and #B. Backup tarballs have to be moved outside of cluster for safekeeping.
Note: The backup.yaml playbook generates a platform backup tarball and an Openstack backup tarball.
When Openstack is running, the backup.yml playbook has two output tarballs instead of one.
Restore:
a. Delete the old Openstack application [note that images and volumes will remain in Ceph] and upload the application again
system application-remove stx-openstack
system application-delete stx-openstack
system application-upload stx-openstack-
b. (optional – if the user wants to delete the data) remove old glance images and cinder volumes Ceph pool
c. Runs restore_
~~~~~~~~~
If you don't want to manipulate the ceph data, you should execute:
~~~~~~~~~
e.g.
~~~~~~~~~
If you want to restore glance images and cinder volumes from external storage(step #b was executed) or reconcile newer data in glance and cinder volumes pool with older data, the following steps must be executed:
~~~~~~~~~
- Run restore_openstack playbook with the 'restore_
e.g.
- Restore Glance images and Cinder volumes using image-backup.sh and tidy_storage_
The tidy storage script is used detect any discrepancy between Cinder/Glance DB and rbd pools.
- If an image is in Glance images DB but not in rbd images pool,
- If an image is in rbd images pool but not in Glance images DB,
- If a volume is in Cinder volumes DB but not in rbd cinder-volumes
- If a volume is in rbd cinder-volumes pool but not in Cinder volumes
DB, remove any snapshot(s) associated with this volume in rbd pool and
- If a volume is in both Cinder volumes DB and rbd cinder-volumes pool
and it has snapshot(s) in the rbd pool, re-create the snapshot in
- If a snapshot is in Cinder DB but not in the rbd pool, it will be deleted.
Usage:
The image-backup.sh script is used to backup and restore glance images from ceph image pool.
Usage:
- Run the playbook again with 'restore_
e.g.
stx.4.0 doc change