LVM Snapshot removal causes intermittent kernel panic
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Expired
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: lvm2
We have a script that automates the creation and removal of a LVM snapshots on our VMware servers. Three times now we have had machines go down when the snapshot was removed. I have logs showing the machine going down immediately after the snapshot removal script has fired (triggered and logged by our backup software).
This has occurred on both an HP Pavilion Desktop (uniprocessor, single disk) and a Sun Sunfire V60X (SMP, md raid1).
The crash leaves the LV in an inconsistent state with device nodes and snapshot names completely out of sync. On all occasions I have been able to recover the volume by following the steps below:
Ubuntu 6.06 LTS, LVM Hard Crash repair
-------
observe kernel oops.
perform hard reset.
machine comes back up with md2 array dirty, starting background reconstruction.
fails on mounting partitions,boots to single.
login at console.
mount /usr
vi /etc/fstab
comment out snapshotted lvm partition (/vmware)
exit. System boots to multi user.
open ssh shell to system
**** some info before we begin
root@anvil:~# lvscan
ACTIVE '/dev/vg_
ACTIVE '/dev/vg_
ACTIVE '/dev/vg_
ACTIVE '/dev/vg_
inactive Original '/dev/vg_
inactive Snapshot '/dev/vg_
root@anvil:~# pvscan
PV /dev/md2 VG vg_sys lvm2 [67.33 GB / 340.00 MB free]
Total: 1 [67.33 GB] / in use: 1 [67.33 GB] / in no VG: 0 [0 ]
root@anvil:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[1]
70605568 blocks [2/2] [UU]
[
md1 : active raid1 sda2[0] sdb2[1]
979840 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
96256 blocks [2/2] [UU]
unused devices: <none>
root@anvil:~# uname -a
Linux anvil 2.6.15-27-server #1 SMP Sat Sep 16 02:57:21 UTC 2006 i686 GNU/Linux
root@anvil:~# ls /dev/mapper/
control vg_sys-lv_swap vg_sys-lv_tmp vg_sys-lv_usr vg_sys-lv_var vg_sys-lv_vmware vg_sys-
root@anvil:~# ls /dev/vg_sys/
lv_swap lv_tmp lv_usr lv_var
** lets repair the system
* create some missing device nodes
root@anvil:~# vgmknodes
* fix up the device mapper mess
root@anvil:~# mv /dev/mapper/
root@anvil:~# mv /dev/mapper/
* check that our fs still exists
root@anvil:~# fsck /dev/mapper/
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/mapper/
/dev/mapper/
* remove the snapshot
root@anvil:~# lvremove /dev/vg_
Logical volume "lv_vmware_snap" successfully removed
* renable vmware lvm partition
root@anvil:~# vi /etc/fstab
root@anvil:~# touch /forcefsck
root@anvil:~# reboot
** system fscks, and boots normally.
description: | updated |
Changed in linux-source-2.6.15: | |
status: | New → Confirmed |
This is 100% reproducible for us. It seems that lvremove is not syncing before removal of the snapshot. (or perhaps it is wrong to --force removal of snapshot.) Calling 'sync' immediately before removing the snapshot seems to mitigate the chance of a crash.
Tested with Ubuntu 6.06.1 i386, clean install Ubuntu server with
/dev/hda1 8G /,
/dev/hda2 32G VG vg_sys
and one lv, lv_vmware 20G