Ubuntu

lvm snapshot causes deadlock in 2.6.35

Reported by Phillip Susi on 2010-06-17
110
This bug affects 21 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
High
Phillip Susi
Lucid
High
Stefan Bader
Maverick
High
Phillip Susi

Bug Description

Attempting to snapshot the root lv causes a deadlock in 2.6.35 when it suspends the root lv device to replace the table. The lvcreate -n snap -s -L 1g lv/root command hangs, can not be killed, and no further IO is possible, and the system must be hard booted with magic-sysrq.

Started with Ubuntu 2.6.35-rc3-3 kernel, then tested mainline kernels 2.6.35 rc1, rc2, and rc3 all have the issue. Mainline 2.6.34 does not have the problem. Hopefully I'll be able to do some git bisecting tonight.

Phillip Susi (psusi) on 2010-06-17
tags: added: regression-potential
Jeremy Foshee (jeremyfoshee) wrote :

Hi Phillip,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/daily-live/current/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 595489

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Phillip Susi (psusi) wrote :

Located the cause of the problem to commit 6b0310fbf087ad6: ext4: don't return to userspace after freezing the fs with a mutex held. Discussing exact nature of bug and fix for it with upstream authors.

Changed in linux (Ubuntu):
status: Incomplete → In Progress
importance: Undecided → High
assignee: nobody → Phillip Susi (psusi)
tags: removed: needs-kernel-logs needs-upstream-testing
Jeremy Foshee (jeremyfoshee) wrote :

Phillip,
   Is there an upstream bug number we can use to watch the progress of the discussion?

Thanks!

~JFo

On 6/22/2010 3:38 PM, Jeremy Foshee wrote:
> Phillip, Is there an upstream bug number we can use to watch the
> progress of the discussion?

No, and the discussion seems to have stalled. The original author of
the commit said that he was able to reproduce it when snapshotting a non
root lv, but only if there was some io activity going on at the time of
the snapshot. He got a sysrq-w output showing the lvcreate process and
a kernel mode thread stuck with their stack traces, but no idea what the
actual problem is.

Problem found and patch posted to the mailing lists by the original author. I'll give it a few days and hopefully it gets applied to linus's tree and merged into Ubuntu's, if not, we'll try to get it applied ourselves in time for alpha 2.

Changed in linux (Ubuntu):
milestone: none → maverick-alpha-2
summary: - lvm snapshot of root lv causes deadlock in 2.6.35
+ lvm snapshot causes deadlock in 2.6.35
Phillip Susi (psusi) wrote :

This is the proposed patch from upstream. It looks like Linus hasn't been merging the last few days and today is kernel freeze so I think we're going to need to apply it ourselves.

Phillip Susi (psusi) wrote :

Oh, and I tested it last night and it seemed to work.

Tim Gardner (timg-tpi) on 2010-06-25
Changed in linux (Ubuntu):
milestone: maverick-alpha-2 → maverick-alpha-3
tags: added: patch
grouch (grouch) wrote :

Looks like this problem started with kernel 2.6.32-24 in the Lucid distro. See bug 604807 which seems to be a duplicate, as well as bug 605551, where I posted my experience with this bug.

Cheers,
Oscar

description: updated
Derek Chen-Becker (dchenbecker) wrote :

I can confirm that the bug starts with 2.6.32-24 in Lucid. I get a hard lock every night on that kernel when my incremental backup runs and snapshots /var. If I switch back to 2.6.32-23, everything works great. I also tried the latest nightly (vmlinuz-2.6.35-999-generic) and I get the same issue there. This is a pretty serious bug :(

It seems the patch from comment #6 is slowly making it's way upstream now:

https://patchwork.kernel.org/patch/107676/

http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=summary

Hopefully it'll hit the 2.6.36 merge window and we can then consider it for a straight cherrypick back into Maverick.

Martin Pitt (pitti) wrote :

This does not block installation, and thus is not an alpha-3 blocker. A new kernel is now out of reach for alpha-3 anyway, and it seems that it's easier to cherrypick from upstream a bit later on. Moving milestone.

Changed in linux (Ubuntu Maverick):
milestone: maverick-alpha-3 → ubuntu-10.10-beta
papukaija (papukaija) on 2010-08-04
tags: added: maverick
Stefan Bader (smb) wrote :

At this time the patch is in linux-next but not (yet) in Linus tree. As we plan to do a special limited change upload for Lucid I would propose the fix for SRU there (as it is a regression in Lucid). For Maverick we could think of picking it pre-stable as well. Greg is batching up a first 2.6.35.y release though unless the patch is not upstream he won't pick it.

Changed in linux (Ubuntu Lucid):
importance: Undecided → High
status: New → In Progress
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Stefan Bader (smb) wrote :

SRU Justification (Lucid):

Impact: Changes to ext4 which are part of 2.6.32.17 and we took in advance to fix other issues are causing a lockup regression when working with lvm snap-shots.

Fix: This patch, which is slowly making its way upstream, was verified to fix this problem and is also small and contained enough to be reasonably save. The patch has now hit upstream and is expected to be seen in stable soon.

Testcase:
Use lvm to create a LV, mount it and have it actively doing IO, then try to create a snapshot will hang without further IO being possible.

Changed in linux (Ubuntu Lucid):
status: In Progress → Fix Committed

I just wanted to add the following note for Maverick... It appears this patch is queued for the upcoming v2.6.35.2 upstream stable release [1]. We've already rebased Maverick to upstream stable v2.6.35.1 (see linux-2.6.35-15.21) and I fully expect that v2.6.35.2 will be released in time for us to rebase prior to Maverick Beta (Thurs Sept 02, 2010).

[1] http://lkml.org/lkml/2010/8/11/503

commit 437f88cc031ffe7f37f3e705367f4fe1f4be8b0f
Author: Eric Sandeen <email address hidden>
Date: Sun Aug 1 17:33:29 2010 -0400

    ext4: fix freeze deadlock under IO

Chris Demetriou (cgd) wrote :

Are you sure this is fixed for Lucid?

I'm running the kernel: 2.6.32-24.40~pre201008060902
from https://launchpad.net/~kernel-ppa/+archive/pre-proposed
(as discussed in bug 605551). Looking at its changelog, it appears to contain the fix for this.

I'm still seeing the hang when creating an lvm snapshot volume.

basic info on my system config:

disks sda and sdb, each 2TB SATA.
each has 4 partitions. (bios_grub, raid for swap, raid for root, raid for lvm)

md0 -> RAID1 on sda2, sdb2 -- used for swap
md1 -> RAID1 on sda3, sdb3 -- used for root
md2 -> RAID1 on sda4, sdb4 -- used for lvm VG 'blue'

to repro, reliable for my config (5/5 so far), paste into a root shell:

lvcreate -n TEST -L 10G blue
mkfs.ext4 /dev/mapper/blue-TEST
mount /dev/mapper/blue-TEST /mnt
sleep 3
touch /mnt/foo
lvcreate -s -n TEST2 -L 10G /dev/blue/TEST &

After a couple of minutes, I get the hang messages in dmesg from a flush task and from lvcreate.

see attached dmesg.

BeergutXL (ajmahoney) on 2010-08-16
Changed in linux (Ubuntu Lucid):
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed

Accepted linux into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-needed
Jarrett Miller (spook) wrote :

The Lucid SRU update worked for me. I just tested with the server kernel build on AMD64.
I am now to:
sudo telinit S
lvcreate -L 500m -n rootsnap -s /dev/vg0/root

that used to hang but now works as expected. Nice job guys :)
I appreciate all the hard work in getting this fixed for Lucid.

The most recent Maverick 2.6.35-16.22 linux kernel has been rebased with the latest 2.6.35.2 upstream stable kernel. As noted in comment #14, 2.6.35.2 contained the patch which should resolve this issue. As a result I'm marking this Fix Released for Maverick. Thanks.

https://edge.launchpad.net/ubuntu/+source/linux/2.6.35-16.22

Changed in linux (Ubuntu Maverick):
status: In Progress → Fix Released
Martin Pitt (pitti) on 2010-08-18
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.32-24.41

---------------
linux (2.6.32-24.41) lucid-security; urgency=low

  [ Upstream Kernel Changes ]

  * (pre-stable) ext4: fix freeze deadlock under IO
    - LP: #595489
  * drm: Initialize ioctl struct when no user data is present
    - CVE-2010-2803
  * can: add limit for nframes and clean up signed/unsigned variables
    - CVE-2010-2959
  * mm: keep a guard page below a grow-down stack segment
    - CVE-2010-2240
  * mm: fix missing page table unmap for stack guard page failure case
    - CVE-2010-2240
  * mm: fix page table unmap for stack guard page properly
    - CVE-2010-2240
  * mm: fix up some user-visible effects of the stack guard page
    - CVE-2010-2240
  * x86: don't send SIGBUS for kernel page faults
    - CVE-2010-2240
 -- Stefan Bader <email address hidden> Wed, 18 Aug 2010 14:24:07 +0200

Changed in linux (Ubuntu Lucid):
status: Fix Committed → Fix Released
Patrick Pfeifer (patrick2000) wrote :

The fix is much appreciated - thank's everybody !

Thomas (tjustleft) wrote :

I am running Lucid with ext3. Is this fix even necessary for me? If not is there a way to keep update manager from retrieving ext4 patches on my system? This is all over my head but I did at least notice the difference in file systems before hitting Install :)

Thank you

@Thomas,

no. Both filesystems are implemented in separate drivers. You cannot prevent
updates to the ext4 driver but you do not need to care about that. As long as
you are not actually using (have a fs mounted that is ext4) it is doing nothing.

Thomas (tjustleft) wrote :

Thank you Stefan,

I have wondered about some of these updates that do not seem to apply to my system. Thanks for letting me rest at ease about it. Now I better go ahead and update :)

Changed in linux:
status: Unknown → Fix Released
Changed in linux:
importance: Unknown → Medium
pepa65 (peter-passchier) wrote :

I am experiencing this bug on Lucid 10.04.4 fully updated as of 2013-01-01, kernel 2.6.32-24-pae-latest.
Then I installed 2.6.35-pae thinking this would resolve it, but it still causes a hang.

Which kernel should I use if I want to make an lvm2 snapshot of an ext4 root? Would I avoid this bug if I mount the root filesystem as ext3 instead??

pepa65 (peter-passchier) wrote :

Sorry, wrong version for 2.6.32, it is 2.6.32-45-pae-latest.

Dimitri John Ledkov (xnox) wrote :

@pepa
Please give output of:
$ dpkg -l 'linux-image-*'

The bug is fixed in 2.6.32-24.41 lucid kernel. Can you check that you are booted into that kernel or later?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.