Fuel for OpenStack

Reset environment must be less brutal

Bug #1597359 reported by Alexander Bozhenko on 2016-06-29

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	Medium	Fuel Sustaining	Fuel for OpenStack 10.0
6.1.x	Won't Fix	Medium	MOS Maintenance	Fuel for OpenStack 6.1-updates
7.0.x	Won't Fix	Medium	MOS Maintenance	Fuel for OpenStack 7.0-updates
8.0.x	Won't Fix	Medium	MOS Maintenance	Fuel for OpenStack 8.0-updates
Mitaka	Invalid	Critical	Sergii Rizvan	Fuel for OpenStack 9.1

Bug Description

So,
1) customer accidentally reset the environment.
2) half of the support team spending tens of hours on webex in attempts to restore it, because fuel wipes first several megabytes of disk during reset.
It is not the first time when customer's production environment was reset.

This code in all versions 6.0-8.0 erases several megabytes of disk, which cause data loss.
Current code wipes only mbr
https://github.com/openstack/fuel-astute/blob/master/mcagents/erase_node.rb
(after this commit https://github.com/openstack/fuel-astute/commit/e770d4ec7d302e958ffae8db87e633e9d5e3db91)
and it leaves possibility to restore data (see comment #6)

Ask:
1) to backport this code to previous releases using MU, to prevent possibility of data loss for existing customers
2) to create backups of MBR and store them on fuel master before environment erase(https://unix.stackexchange.com/a/87657)

Related conversation:
http://lists.openstack.org/pipermail/openstack-dev/2016-March/090105.html

See original description

Tags:

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-06-29:

On 8.0 I have tried to make backup of gpt and lvm metadata, and even with those backups restorations works not very well
Make gpt backup:
[root@fuel ~]# for i in $(seq 1 5); do ssh node-$i 'sgdisk /dev/vda --backup=/tmp/$(hostname -s)_vda_backup' ; done
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
Copy it to master node:
for i in $(seq 1 5); do scp node-$i:/tmp/*_vda_backup . ; done

Make lvm metadata backup:
for i in $(seq 1 5); do scp -r node-$i:/etc/lvm/backup node-${i}_lvm_backup/ ; done

[root@fuel ~]# ls -lad *backup
drwx------ 2 root root 4096 Jun 29 12:10 node-1_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-1_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-2_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-2_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-3_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-3_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-4_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-4_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-5_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-5_vda_backup

After that I reset environment, copied backups back:
[root@fuel ~]# for i in $(seq 1 5); do scp node-${i}_vda_backup node-${i}:/root/ ; done
[root@fuel ~]# for i in $(seq 1 5); do scp -r node-${i}_lvm_backup node-${i}:/etc/lvm/backup ; done

After that I restored gpt:
[root@fuel ~]# for i in $(seq 1 5); do ssh node-$i "sgdisk --load-backup=/root/node-${i}_vda_backup /dev/vda" ; done

Lvm restoration was a little bit harder, and I managed to restore root partiotion, which is ext4. But data on xfs partitions /var/lib/nova and ceph osds was lost.

description:

updated

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-06-29:

Ok, on 8.0 it is doing the following:
http://paste.openstack.org/show/524104/

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-06-29:

With latest code from here:
https://raw.githubusercontent.com/openstack/fuel-astute/master/mcagents/erase_node.rb
Pated to /usr/share/mcollective/plugins/mcollective/agent/erase_node.rb
on the target node it is doing the following:
http://paste.openstack.org/show/524105/

Dmitry Klenov (dklenov) on 2016-06-29

Changed in fuel:
milestone:	none → 10.0
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
tags:	added: area-library
Changed in fuel:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-06-29:

Upd to comment #3:
this is what actually executed with new code:
http://paste.openstack.org/show/524110/

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-06-29:

And /dev/mapper contains symlinks to partitions that contain actual data:
root@node-2:/usr/share/mcollective/plugins/mcollective/agent# ll /dev/mapper/
total 0
drwxr-xr-x 2 root root 160 Jun 29 17:15 ./
drwxr-xr-x 17 root root 4260 Jun 29 17:44 ../
crw------- 1 root root 10, 236 Jun 29 17:16 control
lrwxrwxrwx 1 root root 7 Jun 29 17:44 horizon-horizontmp -> ../dm-0
lrwxrwxrwx 1 root root 7 Jun 29 17:44 logs-log -> ../dm-2
lrwxrwxrwx 1 root root 7 Jun 29 17:44 mysql-root -> ../dm-1
lrwxrwxrwx 1 root root 7 Jun 29 17:44 os-root -> ../dm-3
lrwxrwxrwx 1 root root 7 Jun 29 17:44 os-swap -> ../dm-4

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-06-29:

Ok, actually with upstream code, that erases just 446+2 byte is much better.
After resetting a node using it I was able to restore node using
sgdisk --load-backup=/root/node-3_vda_backup /dev/vda

After that rebooted to node's disk, and it booted like it was never reset!

Alexander Bozhenko (alexbozhenko) on 2016-06-30

summary:

- Reset environment should be less brutal
+ Reset environment must be less brutal

Alexander Bozhenko (alexbozhenko) on 2016-06-30

description:

updated

Dmitry Pyzhov (dpyzhov) on 2016-07-21

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-09-02:

Closing as Won't Fix as this is a medium importance non-customer-found bug.

Revision history for this message

Alexander Bozhenko (alexbozhenko) wrote on 2016-09-02:

Oh, no-no. It IS customer found bug. Customer found it by destroying prod cluster.
I do not know, if it must be back-ported in a MU though. If it is not risky, than I suppose to do.

Roman Rufanov (rrufanov) on 2017-06-01

tags:	added: customer-found
tags:	added: support

Revision history for this message

Roman Rufanov (rrufanov) wrote on 2017-06-01:

It is customer found; we had 3rd instance of this in Prod; needs to be fixed;

Alexander Bozhenko (alexbozhenko) on 2017-06-01

description:

updated

Revision history for this message

Sergii Rizvan (srizvan) wrote on 2017-06-07:

#10

patch for dissabling environment reset feature in the Nailgun API Edit (1.2 KiB, text/plain)

I'm about to close this bug as Invalid because:
> Ask:
> 1) to backport this code to previous releases using MU, to prevent possibility of data loss for existing customers

Actually, this code is already in 9.1 and other releases (7.0, 8.0).

> 2) to create backups of MBR and store them on fuel master before environment erase(https://unix.stackexchange.com/a/87657)
This is actually feature request.

As a some kind of a workaround, I can propose to disable reset feature in the Nailgun API with the attached patch. In order to do this, just apply the patch and restart the nailgun service on a Fuel master node:

patch -p1 -d /usr/lib/python2.7/site-packages/ < nailgun_api_reset.patch
systemctl restart nailgun.service