mdadm with Raid5 stuck in uninterruptable sleep

Bug #208551 reported by DesktopMan on 2008-03-28
32
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Hardy
Medium
Colin Ian King

Bug Description

Description: Ubuntu hardy (development branch)
Release: 8.04

Linux ubuntu-beta 2.6.24-12-server #1 SMP Wed Mar 12 22:58:36 UTC 2008 x86_64 GNU/Linux

mdadm:
  Installed: 2.6.3+200709292116+4450e59-3ubuntu3

xfsprogs:
  Installed: 2.9.4-2

Raid 5 on five 1TB drives, set up as follows:

mdadm --create /dev/md0 --level=5 --raid-devices=5 /dev/sd[b,c,d,e,f]
mkfs.xfs /dev/md0
mount /dev/md0 /mnt/drive

md0 : active raid5 sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
      3907049984 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

The drives are connected to a 5-1 port multiplier again connected to a 2-port SiI 3132 based pciexpress sata controller. Problem does not seem to be related to this, as I can write/read to the drives individually without any trouble.

Copying data do this partition results in a permanent lock on several processes related to it, getting stuck in the D(+) state. Happened four times in a row after 10-40 GB had been copied. I can't kill any of the processes, nor am I able to reboot, have to power cycle.

There are no messages related to it in dmesg or any of the logs, as far as the system is concerned nothing is wrong. After power cycling the array starts rebuilding (as it should), but this rebuild also stops because of the same error.

Problem seems very related to this:

http://www.issociate.de/board/post/471929/2.6.24-rc6_reproducible_raid5_hang.html

As suggested by this thread, I tried to increase stripe_cache_size. Setting it to 4096 seems to have solved my hang, as I have at the time of writing this copied 1.7TB without error.

http://kerneltrap.org/mailarchive/linux-kernel/2007/11/8/397727

If this is the same problem and I'm reading it right, it seems like it's supposed to be fixed already. Not sure though.

DesktopMan (christian-auby) wrote :

Sigh. Spoke too soon. Ran mdadm -D while it was beeing copied to, hanged again. 2TB transferred. Guess it's directly related to the number of processes that access to the device. Won't be able to restart it until tomorrow, but I can try any suggestions on the hanged system.

DesktopMan (christian-auby) wrote :

mdadm -D returned after a couple of minutes, at which point it started writing again. During the period it was running nothing was written.

DesktopMan (christian-auby) wrote :

Was copying from a file set up with losetup + cryptsetup on a raid5 array (the one above) to a raid6 array, all of which with XFS. During this copy I ran mdadm --examine --scan, and the raid5 crashed (the one I was reading from), giving me input/output errors. md device is fine on the other hand, and remounting (the read only) filesystem was no problem. dmesg output:

[131405.242868] xfs_force_shutdown(dm-1,0x1) called from line 420 of file /build /buildd/linux-2.6.24/fs/xfs/xfs_rw.c. Return address = 0xffffffff883cdf59
[131405.242892] Filesystem "dm-1": I/O Error Detected. Shutting down filesystem : dm-1
[131405.242932] Please umount the filesystem, and rectify the problem(s)
[131405.242958] xfs_force_shutdown(dm-1,0x1) called from line 420 of file /build /buildd/linux-2.6.24/fs/xfs/xfs_rw.c. Return address = 0xffffffff883cdf59

Not sure if it's related to the first post or not. Any input would be appreciated.

Twigathy (twigathy) wrote :

I'm not certain if I'm having the same trouble as you, but mdadm fell over pretty hard for me on 2.6.24-16-server, mdadm - v2.6.3 - 20th August 2007 when expanding 5x500GB -> 6x500GB. I lost all the data on the raid (oops).

Possibly this is a bug in sata_sil with lots of disk writes? 5 of the 6 disks were on siI 3512 based SATA cards (The other was an onboard mobo SATA port). Similarly, I can write to the disks individually fine, they check out okay with badblocks and smartctl.

Did you get any weirdness in dmesg? I had a couple of odd things about the SATA link going down... so possibly unrelated.

DesktopMan (christian-auby) wrote :

Not sure if it is related, might be. I honestly gave up on it after concluding that the problem was too erratic and virtually impossible for me to debug. If I remember correctly I also got messages about the SATA link going down, then reset and back up.

Twigathy (twigathy) wrote :

Hm, so what did you do instead? Buy new controller cards or give up on raid? ;)

Twigathy (twigathy) wrote :

Hi,

I googled a little further; looks like this is a bug in sata_siI after all

Check out http://www.ussg.iu.edu/hypermail/linux/kernel/0707.1/0024.html

Doesn't seem to be a fix for it! This isn't too good for me - I have 3 of these cards :-(

Carl Streeter (carl-linux) wrote :

I'm having the same issue pointed to in the thread mentioned above:
http://www.issociate.de/board/post/471929/2.6.24-rc6_reproducible_raid5_hang.html

It seems that this was fixed in kernel version 2.6.25. Would it be possible to backport this to ubuntu kernels? It's basically impossible to use XFS on SW raid without it:
http://marc.info/?l=linux-kernel&m=120027546428622&w=2

At least, it's impossible when dealing with multi terabyte raid5 arrays, which I don't think are particularly uncommon at this point.

I managed to patch the stock ubuntu kernel (2.6.24-18) with the patches I found on the second link on the above post on the LKML. Seems stable, I've been running it in production on two large raid5 arrays without issue. The patches didn't apply perfectly but they do work.

Hi Guys,

I just wanted to let you know the latest Alpha for the upcoming Intrepid Ibex 8.10 is available. The kernel for Intrepid is based on a 2.6.26 kernel at the moment. This 2.6.26 kernel has the patch which was referenced to have fixed this issue in 2.6.25. For more information regarding the latest Alpha for Intrepid refer to - http://www.ubuntu.com/testing. If anyone would be willing to test and confirm this is fixed with the Intrepid kernel that would be great. But based on the patch existing in the Intrepid kernel and the comment made from Andrew that this patch resolves the issue for him I'm tentatively marking this "Fix Released" against Intrepid.

I'll additionally open a Hardy SRU nomination but it is really a decision to be made by the kernel team if this fix will be backported. I've included below the upstream git commit id and patch description for the kernel team to reference. Thanks.

commit 6ed3003c19a96fe18edf8179c4be6fe14abbebbc
Author: NeilBrown <email address hidden>
Date: Wed Feb 6 01:40:00 2008 -0800

    md: fix an occasional deadlock in raid5

Changed in linux:
status: New → Fix Released
assignee: nobody → ubuntu-kernel-team
importance: Undecided → Medium
status: New → Triaged
Changed in linux:
assignee: ubuntu-kernel-team → colin-king
status: Triaged → In Progress
Colin Ian King (colin-king) wrote :

Hi,

I've applied commit 6ed3003c19a96fe18edf8179c4be6fe14abbebbc and built for testing linux - 2.6.24-20.39cking4 package - you can download the package from my PPA at: https://launchpad.net/~colin-king/+archive

Please can you test this fix and let me know if it works so that we can add it to the next release of Hardy.

To test, add the following lines to your apt sources.list:

deb http://ppa.launchpad.net/colin-king/ubuntu hardy main
deb-src http://ppa.launchpad.net/colin-king/ubuntu hardy main

alternatively, follow the instructions at: https://help.ubuntu.com/8.04/add-applications/C/extra-repositories-adding.html

Thanks, Colin

I had to swap to Debian as this bug made the server useless. I haven't
had any deadlocks here yet, but it might still apply for all I know.

Or is Debian using different code? I'm running testing, on 2.6.25-2

I am happy someone eventually identified the cause though.

Christian

Colin King wrote:
> Hi,
>
> I've applied commit 6ed3003c19a96fe18edf8179c4be6fe14abbebbc and built
> for testing linux - 2.6.24-20.39cking4 package - you can download the
> package from my PPA at: https://launchpad.net/~colin-king/+archive
>
> Please can you test this fix and let me know if it works so that we can
> add it to the next release of Hardy.
>
> To test, add the following lines to your apt sources.list:
>
> deb http://ppa.launchpad.net/colin-king/ubuntu hardy main
> deb-src http://ppa.launchpad.net/colin-king/ubuntu hardy main
>
> alternatively, follow the instructions at: https://help.ubuntu.com/8.04
> /add-applications/C/extra-repositories-adding.html
>
> Thanks, Colin
>

Since we don't have DesktopMan now to test this fix, marking it as "Won't Fix", unless anyone has the same hardware and is willing to test this for Hardy.

Changed in linux:
status: In Progress → Won't Fix

I used to have this bug, I'd be willing to test out your kernel. Right now I'm using my own kernel w/ patches applied. Would that work Colin? Have you done any testing yourself?

I'm running a 6x1TB raid5 array with XFS on top on a Dell Poweredge 1800 (An older 64 bit xeon).

Colin Ian King (colin-king) wrote :

Hi Andrew,

If you can try out the my kernel in the PPA just to verify this kernel with the single patch to fix this bug it would give us a clear indication that this issue is fixed against the current Hardy kernel sources. This allows us to the OK it for inclusion into the Hardy kernel for the next point release.

Much appreciated if you could test this.

Colin

I've tested Colin's patch and it's live on 2 production 64bit servers. Seems to work just fine.

Colin Ian King (colin-king) wrote :

SRU justification:

Impact: mdadm, Raid5 get stuck in uninterruptable sleep under heavy I/O
load. Copying data to a Raid 5 XFS partition results in a permanent lock
on several processes related to it, getting stuck in the D(+) state.
Occurs when large quantities of data (10-40 GB) is copied, resulting in
processes being unkillable, and the system cannot reboot and requires
power cycling the server.

Fix: The patch from commit 6ed3003c19a96fe18edf8179c4be6fe14abbebbc. The
fix is to not make any generic_make_request() calls in raid5
make_request until all waiting has been done. We do this by simply
setting STRIPE_HANDLE instead of calling handle_stripe(). This causes a
performance hit, so this patch also only calls raid5_activate_delayed()
at unplug time, never in raid5. This seems to bring back the
performance numbers. [quoting the commit message]

Testing: Without the patch, Raid 5 using md on an XFS filesystem locks
up under heavy data copying - this is repeatable. With the patch, the
lock up does not occur.

Patch tested from my PPA build by Andrew Cholakian (see previous message)

Changed in linux:
milestone: none → ubuntu-8.04.2
status: Won't Fix → Fix Committed
Martin Pitt (pitti) wrote :

linux 2.6.24-21 copied to hardy-updates.

Changed in linux:
status: Fix Committed → Fix Released
kpolberg (kpolberg) wrote :

I am still having deadlocks, only thing that will fix it is setting the stripe_cache_size on the md device higher.

echo 16384 > /sys/block/md0/md/stripe_cache_size

Linux sarah 2.6.24-21-generic #1 SMP Mon Aug 25 16:57:51 UTC 2008 x86_64 GNU/Linux

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sde1[0] sdc1[7] sdd1[6] sdf1[5] sdg1[4] sdb1[3] sda1[2] sdh1[1]
      3418686208 blocks level 5, 256k chunk, algorithm 2 [8/8] [UUUUUUUU]
      [========>............] resync = 44.4% (216937632/488383744) finish=84.5min speed=53527K/sec

unused devices: <none>

root@sarah:~# xfs_info /dev/md0
meta-data=/dev/md0 isize=256 agcount=75, agsize=11446528 blks
         = sectsz=4096 attr=1
data = bsize=4096 blocks=854671552, imaxpct=25
         = sunit=64 swidth=192 blks, unwritten=1
naming =version 2 bsize=4096
log =internal bsize=4096 blocks=32768, version=2
         = sectsz=4096 sunit=1 blks, lazy-count=0
realtime =none extsz=786432 blocks=0, rtextents=0

If you need some more information, please ask.

Martin Pitt (pitti) wrote :

New SRU fixes it harder apparently.

Changed in linux:
status: Fix Released → In Progress
Martin Pitt (pitti) wrote :

Accepted into intrepid-proposed, please test and give feedback here. Please see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Martin Pitt (pitti) wrote :

Accepted into hardy-proposed, please test and give feedback here. Please see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in linux:
milestone: ubuntu-8.04.2 → none
status: In Progress → Fix Committed
Martin Pitt (pitti) wrote :

Accepted linux into hardy-proposed, please test and give feedback here. Please see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Tony (tonybaca) wrote :

Martin,

Hope this doesn't repeat.

If you’re willing to help a newbe, I am willing to test this on my fresh intrepid install. I have a very similar system as described above showing the same kind of problems. I posted my problems in another bug (147464) but after reading this bug, I think this is closer to what I am seeing.

I have a ferash install of 8.10 (mythbuntu). I was able to stop this problem, but only if I set rsize=8092 in fstab. This killed throughput! I finally reformatted the array to ext3 and that fixed the problem. Right now I have been testing the array with JFS and so far have not had the system lock up.

I followed the instruction to enable proposed, but I don’t know what I need to update to test this fix. I am willing to any testing you need, my system is not a production system and there is no important data on the machine.

Tony

Tony [2008-12-03 23:36 -0000]:
> I followed the instruction to enable proposed, but I don’t know what I
> need to update to test this fix.

A normal system upgrade should pull in the new 2.6.27-10 kernel. I. e.
you shold get a couple of linux-image, linux-restricted-modules
packages with 2.6.27-10.20 version.

Tony (tonybaca) wrote :
Download full text (3.7 KiB)

I upgraded to 2.6.27-10. There where otehr upgrade that occured at that time too. I reformated my array to XFS. I tried to copy a large amount of data. It failed in the same mannor. After reboot, the array is rebuilding, but I found this inthe log:

Dec 6 10:44:54 Server kernel: [44205.953002] Call Trace:
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ac083>] ? find_get_pages+0x43/0x110
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802b6c74>] ? pagevec_lookup+0x24/0x30
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffffa0d9302d>] ? xfs_cluster_write+0xad/0x180 [xfs]
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffffa0d93598>] ? xfs_page_state_convert+0x498/0x760 [xfs]
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffffa0d939c1>] ? xfs_vm_writepage+0x71/0x120 [xfs]
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802b9554>] ? pageout+0x124/0x280
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ab1da>] ? page_waitqueue+0xa/0x90
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802b9b5d>] ? shrink_page_list+0x34d/0x530
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802b9ee2>] ? shrink_inactive_list+0x1a2/0x4b0
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ba26b>] ? shrink_zone+0x7b/0x160
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ba3dd>] ? shrink_zones+0x8d/0x150
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ba526>] ? do_try_to_free_pages+0x86/0x2e0
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ba877>] ? try_to_free_pages+0x67/0x70
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802b9380>] ? isolate_pages_global+0x0/0x50
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802b2b49>] ? __alloc_pages_internal+0x239/0x520
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802d5c6d>] ? alloc_pages_current+0xad/0x110
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ac617>] ? __page_cache_alloc+0x67/0x80
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ad253>] ? __grab_cache_page+0x63/0xb0
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff803171a9>] ? block_write_begin+0x89/0xf0
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffffa0d9248a>] ? xfs_vm_write_begin+0x2a/0x30 [xfs]
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffffa0d92050>] ? xfs_get_blocks+0x0/0x20 [xfs]
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ab93c>] ? generic_perform_write+0xbc/0x1c0
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802ad6a2>] ? generic_file_buffered_write+0x92/0x170
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffffa0d9b2f3>] ? xfs_write+0x6b3/0x9b0 [xfs]
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffffa0d96ca8>] ? xfs_file_aio_write+0x58/0x60 [xfs]
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff802e9b79>] ? do_sync_write+0xf9/0x140
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff80267050>] ? autoremove_wake_function+0x0/0x40
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff80387071>] ? aa_file_permission+0x21/0xf0
Dec 6 10:44:54 Server kernel: [44205.953002] [<ffffffff80387198>] ? apparmor_file_...

Read more...

Launchpad Janitor (janitor) wrote :
Download full text (12.0 KiB)

This bug was fixed in the package linux - 2.6.24-23.46

---------------
linux (2.6.24-23.46) hardy-proposed; urgency=low

  [Alessio Igor Bogani]

  * rt: Updated PREEMPT_RT support to rt21
    - LP: #302138

  [Amit Kucheria]

  * SAUCE: Update lpia patches from moblin tree
    - LP: #291457

  [Andy Whitcroft]

  * SAUCE: replace gfs2_bitfit with upstream version to prevent oops
    - LP: #276641

  [Colin Ian King]

  * isdn: Do not validate ISDN net device address prior to interface-up
    - LP: #237306
  * hwmon: (coretemp) Add Penryn CPU to coretemp
    - LP: #235119
  * USB: add support for Motorola ROKR Z6 cellphone in mass storage mode
    - LP: #263217
  * md: fix an occasional deadlock in raid5
    - LP: #208551

  [Stefan Bader]

  * SAUCE: buildenv: Show CVE entries in printchanges
  * SAUCE: buildenv: Send git-ubuntu-log informational message to stderr
  * Xen: dma: avoid unnecessarily SWIOTLB bounce buffering
    - LP: #247148
  * Update openvz patchset to apply to latest stable tree.
    - LP: #301634
  * XEN: Fix FTBS with stable updates
    - LP: #301634

  [Steve Conklin]

  * Add HID quirk for dual USB gamepad
    - LP: #140608

  [Tim Gardner]

  * Enable CONFIG_AX25_DAMA_SLAVE=y
    - LP: #257684
  * SAUCE: Correctly blacklist Thinkpad r40e in ACPI
    - LP: #278794
  * SAUCE: ALPS touchpad for Dell Latitude E6500/E6400
    - LP: #270643

  [Upstream Kernel Changes]

  * Revert "[Bluetooth] Eliminate checks for impossible conditions in IRQ
    handler"
    - LP: #217659
  * KVM: VMX: Clear CR4.VMXE in hardware_disable
    - LP: #268981
  * iov_iter_advance() fix
    - LP: #231746
  * Fix off-by-one error in iov_iter_advance()
    - LP: #231746
  * USB: serial: ch341: New VID/PID for CH341 USB-serial
    - LP: #272485
  * x86: Fix 32-bit x86 MSI-X allocation leakage
    - LP: #273103
  * b43legacy: Fix failure in rate-adjustment mechanism
    - LP: #273143
  * x86: Reserve FIRST_DEVICE_VECTOR in used_vectors bitmap.
    - LP: #276334
  * openvz: merge missed fixes from vanilla 2.6.24 openvz branch
    - LP: #298059
  * openvz: some autofs related fixes
    - LP: #298059
  * openvz: fix ve stop deadlock after nfs connect
    - LP: #298059
  * openvz: fix netlink and rtnl inside container
    - LP: #298059
  * openvz: fix wrong size of ub0_percpu
    - LP: #298059
  * openvz: fix OOPS while stopping VE started before binfmt_misc.ko loaded
    - LP: #298059
  * x86-64: Fix "bytes left to copy" return value for copy_from_user()
  * NET: Fix race in dev_close(). (Bug 9750)
    - LP: #301608
  * IPV6: Fix IPsec datagram fragmentation
    - LP: #301608
  * IPV6: dst_entry leak in ip4ip6_err.
    - LP: #301608
  * IPV4: Remove IP_TOS setting privilege checks.
    - LP: #301608
  * IPCONFIG: The kernel gets no IP from some DHCP servers
    - LP: #301608
  * IPCOMP: Disable BH on output when using shared tfm
    - LP: #301608
  * IRQ_NOPROBE helper functions
    - LP: #301608
  * MIPS: Mark all but i8259 interrupts as no-probe.
    - LP: #301608
  * ub: fix up the conversion to sg_init_table()
    - LP: #301608
  * x86: adjust enable_NMI_through_LVT0()
    - LP: #301608
  * SCSI ips: handle scsi_add_host() failure, and other err cl...

Changed in linux:
status: Fix Committed → Fix Released

DesktopMan, since you are the original bug reporter, it would be great to get confirmation from you that this newer kernel does indeed fix the bug you had reported here. Thanks.

DesktopMan (christian-auby) wrote :

I do not have this setup anymore (9+ months, needed it operational), but there have been other people reporting the same problem more recently. Hopefully one of them will be able to confirm.

I have a system exhibiting the same/similar symptoms.

Running a fresh install of Ubuntu 9.04 jaunty
uname -a: Linux ServerX 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:58:03 UTC 2009 x86_64 GNU/Linux

Motherboard: SUPERMICRO MBD-H8DME-2-O
SATA card: SUPERMICRO AOC-SAT2-MV8 (Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 09))

The system has a SW RAID6 array made of four 1TB disks. Currently the array is degraded and only has 3 disks to work with.

md1 : active raid6 sde1[4] sdd1[0] sdc1[2]
      1953519872 blocks level 6, 64k chunk, algorithm 2 [4/2] [U_U_]
      [==>..................] recovery = 10.8% (105874700/976759936) finish=142.0min speed=102195K/sec

With the array on the PCI-X card I'm able to recreate the crash by failing a drive and reading it to the array. Some time after 50% it will hang and the system is unresponsive.

The system boots from RAID1 md0 two 500GB drive which is on the motherboards controller. I was able to add a disk plugged into the PCI-X to md0 and it would sync w/o problems.

Moving the RAID6 array to the mother boards controller the rebuild will work w/o problems.

kpolberg mentioned adjusting stripe_cache_size.The command he posted:
      echo 16384 > /sys/block/md1/md/stripe_cache_size
Looks like it helps, no crash fro 24Hrs.

If it remains stable I will try with a larger array.

Will post more info if needed.

forgot to mention the RAID6 array is using reiserfs

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers