Reset environment must be less brutal

Bug #1597359 reported by Alexander Bozhenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Fuel Sustaining
6.1.x
Won't Fix
Medium
MOS Maintenance
7.0.x
Won't Fix
Medium
MOS Maintenance
8.0.x
Won't Fix
Medium
MOS Maintenance
Mitaka
Invalid
Critical
Sergii Rizvan

Bug Description

So,
1) customer accidentally reset the environment.
2) half of the support team spending tens of hours on webex in attempts to restore it, because fuel wipes first several megabytes of disk during reset.
It is not the first time when customer's production environment was reset.

This code in all versions 6.0-8.0 erases several megabytes of disk, which cause data loss.
Current code wipes only mbr
https://github.com/openstack/fuel-astute/blob/master/mcagents/erase_node.rb
(after this commit https://github.com/openstack/fuel-astute/commit/e770d4ec7d302e958ffae8db87e633e9d5e3db91)
and it leaves possibility to restore data (see comment #6)

Ask:
1) to backport this code to previous releases using MU, to prevent possibility of data loss for existing customers
2) to create backups of MBR and store them on fuel master before environment erase(https://unix.stackexchange.com/a/87657)

Related conversation:
http://lists.openstack.org/pipermail/openstack-dev/2016-March/090105.html

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

On 8.0 I have tried to make backup of gpt and lvm metadata, and even with those backups restorations works not very well
Make gpt backup:
[root@fuel ~]# for i in $(seq 1 5); do ssh node-$i 'sgdisk /dev/vda --backup=/tmp/$(hostname -s)_vda_backup' ; done
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
The operation has completed successfully.
Copy it to master node:
for i in $(seq 1 5); do scp node-$i:/tmp/*_vda_backup . ; done

Make lvm metadata backup:
for i in $(seq 1 5); do scp -r node-$i:/etc/lvm/backup node-${i}_lvm_backup/ ; done

[root@fuel ~]# ls -lad *backup
drwx------ 2 root root 4096 Jun 29 12:10 node-1_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-1_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-2_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-2_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-3_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-3_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-4_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-4_vda_backup
drwx------ 2 root root 4096 Jun 29 12:10 node-5_lvm_backup
-rw-r--r-- 1 root root 17920 Jun 29 12:07 node-5_vda_backup

After that I reset environment, copied backups back:
[root@fuel ~]# for i in $(seq 1 5); do scp node-${i}_vda_backup node-${i}:/root/ ; done
[root@fuel ~]# for i in $(seq 1 5); do scp -r node-${i}_lvm_backup node-${i}:/etc/lvm/backup ; done

After that I restored gpt:
[root@fuel ~]# for i in $(seq 1 5); do ssh node-$i "sgdisk --load-backup=/root/node-${i}_vda_backup /dev/vda" ; done

Lvm restoration was a little bit harder, and I managed to restore root partiotion, which is ext4. But data on xfs partitions /var/lib/nova and ceph osds was lost.

description: updated
Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Ok, on 8.0 it is doing the following:
http://paste.openstack.org/show/524104/

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

With latest code from here:
https://raw.githubusercontent.com/openstack/fuel-astute/master/mcagents/erase_node.rb
Pated to /usr/share/mcollective/plugins/mcollective/agent/erase_node.rb
on the target node it is doing the following:
http://paste.openstack.org/show/524105/

Dmitry Klenov (dklenov)
Changed in fuel:
milestone: none → 10.0
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
tags: added: area-library
Changed in fuel:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Upd to comment #3:
this is what actually executed with new code:
http://paste.openstack.org/show/524110/

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

And /dev/mapper contains symlinks to partitions that contain actual data:
root@node-2:/usr/share/mcollective/plugins/mcollective/agent# ll /dev/mapper/
total 0
drwxr-xr-x 2 root root 160 Jun 29 17:15 ./
drwxr-xr-x 17 root root 4260 Jun 29 17:44 ../
crw------- 1 root root 10, 236 Jun 29 17:16 control
lrwxrwxrwx 1 root root 7 Jun 29 17:44 horizon-horizontmp -> ../dm-0
lrwxrwxrwx 1 root root 7 Jun 29 17:44 logs-log -> ../dm-2
lrwxrwxrwx 1 root root 7 Jun 29 17:44 mysql-root -> ../dm-1
lrwxrwxrwx 1 root root 7 Jun 29 17:44 os-root -> ../dm-3
lrwxrwxrwx 1 root root 7 Jun 29 17:44 os-swap -> ../dm-4

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Ok, actually with upstream code, that erases just 446+2 byte is much better.
After resetting a node using it I was able to restore node using
sgdisk --load-backup=/root/node-3_vda_backup /dev/vda

After that rebooted to node's disk, and it booted like it was never reset!

summary: - Reset environment should be less brutal
+ Reset environment must be less brutal
description: updated
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Closing as Won't Fix as this is a medium importance non-customer-found bug.

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Oh, no-no. It IS customer found bug. Customer found it by destroying prod cluster.
I do not know, if it must be back-ported in a MU though. If it is not risky, than I suppose to do.

Roman Rufanov (rrufanov)
tags: added: customer-found
tags: added: support
Revision history for this message
Roman Rufanov (rrufanov) wrote :

It is customer found; we had 3rd instance of this in Prod; needs to be fixed;

description: updated
Revision history for this message
Sergii Rizvan (srizvan) wrote :

I'm about to close this bug as Invalid because:
> Ask:
> 1) to backport this code to previous releases using MU, to prevent possibility of data loss for existing customers

Actually, this code is already in 9.1 and other releases (7.0, 8.0).

> 2) to create backups of MBR and store them on fuel master before environment erase(https://unix.stackexchange.com/a/87657)
This is actually feature request.

As a some kind of a workaround, I can propose to disable reset feature in the Nailgun API with the attached patch. In order to do this, just apply the patch and restart the nailgun service on a Fuel master node:

patch -p1 -d /usr/lib/python2.7/site-packages/ < nailgun_api_reset.patch
systemctl restart nailgun.service

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.