LVM Snapshot removal causes intermittent kernel panic

Bug #71567 reported by acutler
14
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Binary package hint: lvm2

We have a script that automates the creation and removal of a LVM snapshots on our VMware servers. Three times now we have had machines go down when the snapshot was removed. I have logs showing the machine going down immediately after the snapshot removal script has fired (triggered and logged by our backup software).

This has occurred on both an HP Pavilion Desktop (uniprocessor, single disk) and a Sun Sunfire V60X (SMP, md raid1).

The crash leaves the LV in an inconsistent state with device nodes and snapshot names completely out of sync. On all occasions I have been able to recover the volume by following the steps below:

Ubuntu 6.06 LTS, LVM Hard Crash repair
--------------------------------------

observe kernel oops.
perform hard reset.
machine comes back up with md2 array dirty, starting background reconstruction.
fails on mounting partitions,boots to single.
login at console.
mount /usr
vi /etc/fstab
comment out snapshotted lvm partition (/vmware)
exit. System boots to multi user.
open ssh shell to system

**** some info before we begin

root@anvil:~# lvscan
  ACTIVE '/dev/vg_sys/lv_tmp' [4.00 GB] inherit
  ACTIVE '/dev/vg_sys/lv_swap' [4.00 GB] inherit
  ACTIVE '/dev/vg_sys/lv_var' [1.00 GB] inherit
  ACTIVE '/dev/vg_sys/lv_usr' [1.00 GB] inherit
  inactive Original '/dev/vg_sys/lv_vmware' [52.00 GB] contiguous
  inactive Snapshot '/dev/vg_sys/lv_vmware_snap' [5.00 GB] inherit

root@anvil:~# pvscan
  PV /dev/md2 VG vg_sys lvm2 [67.33 GB / 340.00 MB free]
  Total: 1 [67.33 GB] / in use: 1 [67.33 GB] / in no VG: 0 [0 ]

root@anvil:~# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sda3[0] sdb3[1]
      70605568 blocks [2/2] [UU]
      [=====>...............] resync = 26.8% (18967232/70605568) finish=18.0min speed=47688K/sec

md1 : active raid1 sda2[0] sdb2[1]
      979840 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      96256 blocks [2/2] [UU]

unused devices: <none>

root@anvil:~# uname -a
Linux anvil 2.6.15-27-server #1 SMP Sat Sep 16 02:57:21 UTC 2006 i686 GNU/Linux

root@anvil:~# ls /dev/mapper/
control vg_sys-lv_swap vg_sys-lv_tmp vg_sys-lv_usr vg_sys-lv_var vg_sys-lv_vmware vg_sys-lv_vmware-real

root@anvil:~# ls /dev/vg_sys/
lv_swap lv_tmp lv_usr lv_var

** lets repair the system

* create some missing device nodes
root@anvil:~# vgmknodes

* fix up the device mapper mess
root@anvil:~# mv /dev/mapper/vg_sys-lv_vmware /dev/mapper/vg_sys-lv_vmware_snap
root@anvil:~# mv /dev/mapper/vg_sys-lv_vmware-real /dev/mapper/vg_sys-lv_vmware

* check that our fs still exists
root@anvil:~# fsck /dev/mapper/vg_sys-lv_vmware
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/dev/mapper/vg_sys-lv_vmware: recovering journal
/dev/mapper/vg_sys-lv_vmware: clean, 75/6815744 files, 10826547/13631488 blocks

* remove the snapshot
root@anvil:~# lvremove /dev/vg_sys/lv_vmware_snap
  Logical volume "lv_vmware_snap" successfully removed

* renable vmware lvm partition
root@anvil:~# vi /etc/fstab
root@anvil:~# touch /forcefsck
root@anvil:~# reboot
** system fscks, and boots normally.

Revision history for this message
acutler (acutler) wrote :
Revision history for this message
acutler (acutler) wrote :
acutler (acutler)
description: updated
Revision history for this message
acutler (acutler) wrote : 100% reproducible, see script

This is 100% reproducible for us. It seems that lvremove is not syncing before removal of the snapshot. (or perhaps it is wrong to --force removal of snapshot.) Calling 'sync' immediately before removing the snapshot seems to mitigate the chance of a crash.

Tested with Ubuntu 6.06.1 i386, clean install Ubuntu server with
/dev/hda1 8G /,
/dev/hda2 32G VG vg_sys
and one lv, lv_vmware 20G

Revision history for this message
acutler (acutler) wrote : Red Hat ES 4.4 unaffected

For what its worth it seems Red Hat ES 4 is unaffected. The test script works as expected on RHES4 with lvm2-2.02.06-6.0.RHEL4, kernel 2.6.9-42.EL.

Revision history for this message
Ian Jackson (ijackson) wrote :

Since the machine crashes, this is a kernel bug. Can you confirm that you're using a standard ubuntu kernel and which version ?

I don't think the --force is relevant. It's possible that running sync before lvremove would work around the problem - hav eyou tried that ?

Revision history for this message
acutler (acutler) wrote :

Hi Ian,
I've reproduced the crash with
Linux anvil 2.6.15-27-server #1 SMP Sat Sep 16 02:57:21 UTC 2006 i686 GNU/Linux
and
Linux ws75 2.6.15-26-386 #1 PREEMPT Thu Aug 3 02:52:00 UTC 2006 i686 GNU/Linux
both are bog standard 6.06 Ubuntu kernels (no custom funny business). Other versions are probably affected as well.

Adding sync to the script does seem to mitigate the chance of a crash, however I suspect it can still occur if the machine is sufficiently loaded.

Revision history for this message
Ben Collins (ben-collins) wrote :

Can I get a full dmesg for the machine?

Also need some information here, you mentioned vmware server. Is this machine acting as a vmware host, and if so what sort of vmware is it (ESX, Workstation, etc.).

If in fact you are using vmware, is this bug reproducible without vmware loaded?

Changed in linux-source-2.6.17:
status: Unconfirmed → Needs Info
Revision history for this message
acutler (acutler) wrote :

I think this is somehow related to
https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.17/+bug/58627

I've been able to reproduce that bug as well. Recovery protocol is exactly the same.

Revision history for this message
acutler (acutler) wrote :

Hi Ben,

I've been able to reproduce this on three different machines including one machine I specially prepared for testing. This bug is reproducible without vmware being installed.

If you install dapper with one PV, VG vg_sys and one lv lv_vmware (ext3) then you should be able to run my crash.sh script unmodified.

Let me know if you still require any logs, but I'm sure if you try you'll be able to reproduce this one easily.

Revision history for this message
acutler (acutler) wrote :

dmesg irrelevant as this is reproducible on multiple machines, multiple kernels.

Changed in linux-source-2.6.15:
status: Needs Info → Unconfirmed
Revision history for this message
acutler (acutler) wrote :

Please find attached a full kernel trace showing "kernel BUG at drivers/md/kcopyd.c:145!" (Kernel 2.6.15-23-server)

I've built a 153M VMware machine which reproduces this panic everytime. If anyone would like to take a look at it I will make the VM available for download.

(Please note this bug has nothing to do with VMware)

Revision history for this message
acutler (acutler) wrote :

I've reproduced this panic with a virgin (kernel.org) 2.6.15.7 (I reused the ubuntu 2.6.15-23-server config with make oldconfig). I could NOT reproduce the panic with 2.6.19.1. So logically methinks something was fixed along the way.

Any pointers for what to do next? Anyone want to lend a hand in nailing this bug?

Revision history for this message
Jared (ubuntu-redjar) wrote : Similar Bug on Red Hat 4.4 and CentOS 4

I just experienced this bug on Ubuntu Dapper. While searching, I came across these similar sounding bugs for Red Hat:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=204791

 and CentOS:
http://bugs.centos.org/view.php?id=1634

Revision history for this message
Dagfinn Ilmari Mannsåker (ilmari) wrote : Oops when removing snapshot

I got the attached oops when removing a snapshot of a MySQL binary log volume, it might be related.

kernel version 2.6.15-27-amd64-server #1 SMP Sat Sep 16 02:04:37 UTC 2006 x86_64 GNU/Linux

Revision history for this message
nentis (krisa-opensourcery) wrote :
Download full text (6.4 KiB)

I encountered this bug with the new Dapper kernel that was pushed last week. I have also attached the script I created which does the snapshots.

------------[ cut here ]------------
kernel BUG at drivers/md/kcopyd.c:145!
invalid operand: 0000 [#1]
SMP
Modules linked in: ipt_ULOG ip_tables vmxnet vmhgfs dm_snapshot dm_mod lp ipv6 tsdev parport_pc floppy parport i2c_piix4 i2c_core psmouse serio_raw pcnet32 mii pcspkr intel_agp shpchp pci_hotplug agpgart sg evdev reiserfs ide_generic sd_mod mptspi mptscsih mptbase scsi_mod ide_cd cdrom piix generic thermal processor fan capability commoncap vga16fb vgastate fbcon tileblit font bitblit softcursor
CPU: 0
EIP: 0060:[pg0+946189859/1069167616] Tainted: P VLI
EFLAGS: 00010287 (2.6.15-28-server)
EIP is at client_free_pages+0x33/0x40 [dm_mod]
eax: 00000100 ebx: df84b900 ecx: c2378ac0 edx: 00000000
esi: f8b38080 edi: 00000000 ebp: 00000000 esp: d9cbfec0
ds: 007b es: 007b ss: 0068
Process lvremove (pid: 3380, threadinfo=d9cbe000 task=f3eb0a90)
Stack: f8ab5b3f df84b900 f8ab71ab df84b900 f4840bc0 f8a74a8e df84b900 c2378900
       f8b38080 f1076980 f8ab10f7 f8b38080 f1076980 f4012ec0 00000004 d9cbe000
       f8ab3619 f1076980 00000000 f8abdbe0 f8a84000 f8ab3ec7 f4012ec0 00000000
Call Trace:
 [pg0+946187071/1069167616] dm_io_put+0xf/0x30 [dm_mod]
 [pg0+946192811/1069167616] kcopyd_client_destroy+0x1b/0x32 [dm_mod]
 [pg0+945920654/1069167616] snapshot_dtr+0x6e/0x80 [dm_snapshot]
 [pg0+946168055/1069167616] table_destroy+0x47/0xa0 [dm_mod]
 [pg0+946177561/1069167616] __hash_remove+0x79/0xa0 [dm_mod]
 [pg0+946179783/1069167616] dev_remove+0x47/0xd0 [dm_mod]
 [pg0+946185980/1069167616] ctl_ioctl+0x10c/0x160 [dm_mod]
 [pg0+946179712/1069167616] dev_remove+0x0/0xd0 [dm_mod]
 [do_ioctl+147/160] do_ioctl+0x93/0xa0
 [vfs_ioctl+107/560] vfs_ioctl+0x6b/0x230
 [sys_ioctl+136/160] sys_ioctl+0x88/0xa0
 [sysenter_past_esp+84/117] sysenter_past_esp+0x54/0x75
Code: 8b 43 10 39 43 14 75 23 8b 43 0c 89 04 24 e8 35 ff ff ff c7 43 0c 00 00 00 00 c7 43 10 00 00 00 00 c7 43 14 00 00 00 00 58 5b c3 <0f> 0b 91 00 e4 74 ab f8 eb d3 8d 76 00 83 ec 18 31 c0 89 44 24
 <1>Unable to handle kernel NULL pointer dereference at virtual address 00000010
 printing eip:
c01527b9
*pde = 00401001
Oops: 0000 [#2]
SMP
Modules linked in: ipt_ULOG ip_tables vmxnet vmhgfs dm_snapshot dm_mod lp ipv6 tsdev parport_pc floppy parport i2c_piix4 i2c_core psmouse serio_raw pcnet32 mii pcspkr intel_agp shpchp pci_hotplug agpgart sg evdev reiserfs ide_generic sd_mod mptspi mptscsih mptbase scsi_mod ide_cd cdrom piix generic thermal processor fan capability commoncap vga16fb vgastate fbcon tileblit font bitblit softcursor
CPU: 0
EIP: 0060:[mempool_alloc+41/256] Tainted: P VLI
EFLAGS: 00010206 (2.6.15-28-server)
EIP is at mempool_alloc+0x29/0x100
eax: 00000000 ebx: 00000001 ecx: 00000010 edx: 00011200
esi: 00000000 edi: 00011210 ebp: 0000001c esp: c6b7de88
ds: 007b es: 007b ss: 0068
Process kcopyd (pid: 12062, threadinfo=c6b7c000 task=f3dfca90)
Stack: 00000000 ee04f460 e6a9e394 00000001 f8ab5c80 f8ab5cb0 00000000 ee2e5620
       00000001 e6a9e484 00000001 f8ab67b0 f8ab61d6 00000000 00000010 c200d060...

Read more...

Revision history for this message
Serge van Ginderachter (svg) wrote :

Having the same issue..

Two separate hosts, runing Dapper, with LVM over MD, running Zimbra on a separate /opt LVM partition.
Crashes are sometimes hard panics, sometimes soft (dmesg shows a panic, some processes keep running, some get defunct,, ...)
Crash happens at the lvremove state, in a script simular to the ones posted above.

It got better (happens less) when syncing disks just before the lvremove, but it didn't stop happening. Sometimes the crash happens after the sync command and before the lvremove command.

One relmevant dmesg output after a soft panic:

[43045674.210000] Unable to handle kernel paging request at virtual address f8ab3004
[43045674.250000] printing eip:
[43045674.260000] f8aa84e2
[43045674.260000] *pde = 00000000
[43045674.280000] Oops: 0000 [#1]
[43045674.280000] SMP
[43045674.280000] Modules linked in: dm_snapshot usb_storage af_packet ppdev ipv6 dm_mod lp ide_floppy i2c_piix4 serio_raw parport_pc psmous
e parport i2c_core pcspkr floppy e100 sworks_agp mii agpgart sg evdev ext3 jbd raid1 md_mod ide_generic ehci_hcd ohci_hcd uhci_hcd usbcore 3
w_xxxx sd_mod aic7xxx scsi_transport_spi2 scsi_mod ide_cd cdrom serverworks generic thermal processor fan capability commoncap vga16fb vgast
ate fbcon tileblit font bitblit softcursor
[43045674.280000] CPU: 0
[43045674.280000] EIP: 0060:[<f8aa84e2>] Not tainted VLI
[43045674.280000] EFLAGS: 00010202 (2.6.15-29-server)
[43045674.280000] EIP is at persistent_commit+0x102/0x130 [dm_snapshot]
[43045674.280000] eax: 00000001 ebx: d20bf3e0 ecx: 00000000 edx: f8ab3000
[43045674.280000] esi: 00000001 edi: 00000000 ebp: f8aa6e00 esp: d11c7ed4
[43045674.280000] ds: 007b es: 007b ss: 0068
[43045674.280000] Process kcopyd (pid: 6553, threadinfo=d11c6000 task=ce7fba90)
[43045674.280000] Stack: d20bf3e0 00000001 00000001 0000120e 00000000 00000374 00000000 d205077c
[43045674.280000] 00000000 d357feb8 f8aa6e33 d2b990fc d357feb8 f8aa6df0 d357feb8 f8a7f793
[43045674.280000] 00000000 00000000 d357feb8 00000000 00000246 d205077c f8a86d60 00000000
[43045674.280000] Call Trace:
[43045674.280000] [<f8aa6e33>] copy_callback+0x33/0x50 [dm_snapshot]
[43045674.280000] [<f8aa6df0>] commit_callback+0x0/0x10 [dm_snapshot]
[43045674.280000] [<f8a7f793>] run_complete_job+0x63/0x80 [dm_mod]
[43045674.280000] [<f8a7f9b7>] process_jobs+0x17/0xe0 [dm_mod]
[43045674.280000] [<f8a7fa98>] do_work+0x18/0x50 [dm_mod]
[43045674.280000] [<f8a7f730>] run_complete_job+0x0/0x80 [dm_mod]
[43045674.280000] [<c0136fa3>] worker_thread+0x1b3/0x270
[43045674.280000] [<f8a7fa80>] do_work+0x0/0x50 [dm_mod]
[43045674.280000] [<c011f850>] default_wake_function+0x0/0x20
[43045674.280000] [<c0136df0>] worker_thread+0x0/0x270
[43045674.280000] [<c013bc18>] kthread+0xc8/0xd0
[43045674.280000] [<c013bb50>] kthread+0x0/0xd0
[43045674.280000] [<c0101505>] kernel_thread_helper+0x5/0x10
[43045674.280000] Code: c4 1c 5b 5e 5f c3 8d 76 00 8b 53 28 c7 43 08 00 00 00 00 85 d2 74 b7 31 f6 8b 43 2c 8d 14 f0 31 c0 85 ff 0f 94 c0 46
 89 44 24 04 <8b> 42 04 89 04 24 ff 12 39 73 28 77 e1 8b 43 10 39 43 20 c7 43
[43045674.280000]

Revision history for this message
Serge van Ginderachter (svg) wrote :

As this issue probaly will never be resolved, I tried out another idea: installing the latest edgy kernel on dapper.

~# wget http://be.archive.ubuntu.com/ubuntu/pool/main/l/linux-source-2.6.17/linux-image-2.6.17-12-server_2.6.17.1-12.42_i386.deb
~# dpkg -i linux-image-2.6.17-12-server_2.6.17.1-12.42_i386.deb

This installs and boots without any problems. Hopefully this LVM bug is resolved in this kernel.

Julius Bloch (jbloch)
Changed in linux-source-2.6.15:
status: New → Confirmed
Revision history for this message
Serge van Ginderachter (svg) wrote :

Some update, 4 months later, the problem occurred just one or two times since then, which is a whole lot less.
I can't confirm precisely if those crashes are the same bug.

Revision history for this message
Launchpad Janitor (janitor) wrote : This bug is now reported against the 'linux' package

Beginning with the Hardy Heron 8.04 development cycle, all open Ubuntu kernel bugs need to be reported against the "linux" kernel package. We are automatically migrating this linux-source-2.6.15 kernel bug to the new "linux" package. We appreciate your patience and understanding as we make this transition. Also, if you would be interested in testing the upcoming Intrepid Ibex 8.10 release, it is available at http://www.ubuntu.com/testing . Please let us know your results. Thanks!

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Wow, this bug was reported quite some time ago. I'd be curious to know if anyone still experiences these panics with any of the newer Ubuntu kernels. For example the upcoming Karmic release contains a 2.6.31 based kernel - http://cdimage.ubuntu.com/releases/karmic/ .

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
nentis (krisa-opensourcery) wrote :

Leann,

I use LVM on every host build and I have not had issues with any version after Dapper. This bug should probably be closed with a 'will not fix' tag.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu release http://www.ubuntu.com/getubuntu/download . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.