RAID1 data-checks cause CPU soft lockups

Bug #212684 reported by Wladimir Mutel
48
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Unassigned
Nominated for Hardy by ceg
Nominated for Lucid by ceg

Bug Description

Binary package hint: linux-image-2.6.24-15-generic

I track Hardy development packages on some of my systems.
They share some common hardware and configuration features, in part :
Pentium4 CPU with Hyperthreading turned on
(so that 2 logical cores are visible); 1 or 2 GB RAM;
Intel chipset with ICH5/6/7[R] SATA controller with RAID turned off ;
two SATA disks (Seagate, WD or Samsung) of 200..320..500 GB each.
On each of these disks, two 0xfd partitions (Linux RAID autodetect) are allocated : first of 100 MB, and second taking the rest of disk volume.
They are assembled by mdadm in RAID1 (mirrored) md arrays :
first, of 100 MB, to hold /boot filesystem,
and second, the big one, to hold LVM2 PV, a volume group with a couple of LVs with filesystems, including root FS (all of type ext3), as well as a swap LV.
The boot loader is LILO. Systems boot and run well, providing appropriate fault-tolerance for disks when needed.

However, mdadm package provides 'checkarray' script which is run by cron on first Sunday of each month to check for RAID arrays integrity.
The actions performed by script are in fact
'echo check > $i' for i in /sys/block/*/md/sync_action
And this integrity check gives the following messages in the kernel log :

Apr 6 01:06:02 hostname kernel: [ 9859.807932] md: data-check of RAID array md0
Apr 6 01:06:02 hostname kernel: [ 9859.808090] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Apr 6 01:06:02 hostname kernel: [ 9859.808222] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Apr 6 01:06:02 hostname kernel: [ 9859.808422] md: using 128k window, over a total of 104320 blocks.
Apr 6 01:06:02 hostname kernel: [ 9859.886364] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Apr 6 01:06:04 hostname kernel: [ 9862.098900] md: md0: data-check done.
Apr 6 01:06:04 hostname kernel: [ 9862.137205] md: data-check of RAID array md2
Apr 6 01:06:04 hostname kernel: [ 9862.137238] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Apr 6 01:06:04 hostname kernel: [ 9862.137272] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Apr 6 01:06:04 hostname kernel: [ 9862.137327] md: using 128k window, over a total of 312464128 blocks.
Apr 6 01:06:04 hostname kernel: [ 9862.189968] RAID1 conf printout:
Apr 6 01:06:04 hostname kernel: [ 9862.190003] --- wd:2 rd:2
Apr 6 01:06:04 hostname kernel: [ 9862.190035] disk 0, wo:0, o:1, dev:sdb1
Apr 6 01:06:04 hostname kernel: [ 9862.190062] disk 1, wo:0, o:1, dev:sda1

 ... 13 seconds later :

Apr 6 01:06:17 hostname kernel: [ 9875.118427] BUG: soft lockup - CPU#0 stuck for 11s! [md2_raid1:2378]
Apr 6 01:06:17 hostname kernel: [ 9875.118581]
Apr 6 01:06:17 hostname kernel: [ 9875.118671] Pid: 2378, comm: md2_raid1 Not tainted (2.6.24-15-generic #1)
Apr 6 01:06:17 hostname kernel: [ 9875.118811] EIP: 0060:[<f887c9b0>] EFLAGS: 00010282 CPU: 0
Apr 6 01:06:17 hostname kernel: [ 9875.119048] EIP is at raid1d+0x770/0xff0 [raid1]
Apr 6 01:06:17 hostname kernel: [ 9875.119159] EAX: e7ffb000 EBX: c14fff60 ECX: 00000f24 EDX: f4b41800
Apr 6 01:06:17 hostname kernel: [ 9875.119284] ESI: e7ffb0dc EDI: e807e0dc EBP: df92fe40 ESP: f7495e9c
Apr 6 01:06:17 hostname kernel: [ 9875.119448] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Apr 6 01:06:17 hostname kernel: [ 9875.119567] CR0: 8005003b CR2: b7f32480 CR3: 374de000 CR4: 000006d0
Apr 6 01:06:17 hostname kernel: [ 9875.119696] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Apr 6 01:06:17 hostname kernel: [ 9875.119833] DR6: ffff0ff0 DR7: 00000400
Apr 6 01:06:17 hostname kernel: [ 9875.120537] [jbd:schedule+0x20a/0x650] schedule+0x20a/0x600
Apr 6 01:06:17 hostname kernel: [ 9875.121051] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.121319] [shpchp:schedule_timeout+0x76/0x2d0] schedule_timeout+0x76/0xd0
Apr 6 01:06:17 hostname kernel: [ 9875.121494] [apic_timer_interrupt+0x28/0x30] apic_timer_interrupt+0x28/0x30
Apr 6 01:06:17 hostname kernel: [ 9875.121751] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.122050] [<f887007b>] mirror_status+0x19b/0x250 [dm_mirror]
Apr 6 01:06:17 hostname kernel: [ 9875.122333] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.122590] [<f88b5ef3>] md_thread+0x23/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.122854] [<c0141b70>] autoremove_wake_function+0x0/0x40
Apr 6 01:06:17 hostname kernel: [ 9875.123122] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.123356] [kthread+0x42/0x70] kthread+0x42/0x70
Apr 6 01:06:17 hostname kernel: [ 9875.123497] [kthread+0x0/0x70] kthread+0x0/0x70
Apr 6 01:06:17 hostname kernel: [ 9875.123673] [kernel_thread_helper+0x7/0x10] kernel_thread_helper+0x7/0x10
Apr 6 01:06:17 hostname kernel: [ 9875.123937] =======================

... Then, these soft lockups repeat about each 13..20 seconds. Their stacks are not the same, but they share common spot in 'raid1d' function or thread. Most frequent offset is raid1d+0x770. Sometimes it is +0x17b or nearby values. Here is a sample distribution :

      1 raid1d+0x174/0xff0
      5 raid1d+0x17b/0xff0
      1 raid1d+0x18d/0xff0
      1 raid1d+0x669/0xff0
      1 raid1d+0x75f/0xff0
     86 raid1d+0x770/0xff0
      1 raid1d+0x772/0xff0

During the check of these two RAID1 arrays, 100 MB and 320 GB in total, the lockups happened more than 90 times. Astronomically, this check had taken 1 hour 37 minutes.

Don't know if I should ignore these lockups, or ask you, dear maintainers, to research into the problem. I just decided to inform you. I'll give you any more details on your request. I just listed the common traits of three my systems (with equal sets of Hardy packages) where these lockups are reproducible.

Revision history for this message
dmb (dbyrne-lineone) wrote :

I've been having unexplained lockups on my Ubuntu 7.10 (2.6.22-14-server) implementation for some months, the symtom is a complete crash requiring a power cycle to recover the system: power and rest buttons on the server don't respond, needs a switch off/on at the mains. Server is a (don't laugh !) Pentium III 500MHz dual-CPU with 1Gb RAM, 2x300Gb Seagate Barracuda ATA (IDE) disks.

Coincidentally on checking /var/log/messages today I found the info below, and subsequent search turned up this bug report. Thought I'd post in case it's of any use.

...
Apr 5 23:25:25 karanda -- MARK --
Apr 5 23:45:25 karanda -- MARK --
Apr 6 00:05:25 karanda -- MARK --
Apr 6 00:25:26 karanda -- MARK --
Apr 6 00:45:26 karanda -- MARK --
Apr 6 01:05:27 karanda -- MARK --
Apr 6 01:06:02 karanda kernel: [231760.863936] md: data-check of RAID array md0
Apr 6 01:06:02 karanda kernel: [231760.863971] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Apr 6 01:06:02 karanda kernel: [231760.863983] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Apr 6 01:06:02 karanda kernel: [231760.864004] md: using 128k window, over a total of 51199040 blocks.
Apr 6 01:06:02 karanda kernel: [231760.882873] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Apr 6 01:06:02 karanda kernel: [231760.911406] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Apr 6 01:06:02 karanda kernel: [231760.911443] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Apr 6 19:48:04 karanda syslogd 1.4.1#21ubuntu3: restart.

Revision history for this message
dmb (dbyrne-lineone) wrote :

Just to confirm, I've manually run "/usr/share/mdadm/checkarray --all" on my system manually, result is system lockup at 10.1% of md0 completed.

Revision history for this message
Wladimir Mutel (mwg) wrote :

My systems do not froze or stuck. The check is going on with normal speed (spending about 1 hour for 200 GB, 1.5 hours for 320 GB, and 2.5h for 500GB) and writes into kernel log about its normal completion. Just these soft lockups is what bothers me.
Your system may just be old and show some faults under load. Soon I'll make md-raid1 on 2 IDE disks on similar older system (P3TDDE board, 2xPIII-1133MHz) and tell you how it goes in my case.

Revision history for this message
dmb (dbyrne-lineone) wrote : Re: [Bug 212684] Re: RAID1 data-checks cause CPU soft lockups
Download full text (7.3 KiB)

Thanks for your mail Wladimir,

I think you may well be right - it's an old system but has had a new
lease of life with Linux ! I'm also going to try reducing the max
bandwidth of the md devices to 10Mb/s to see if that helps.

I'll let you know if I find an answer, and would be very interested to
hear what you find if you do try building the PIII system.

Regards,

David

>----Original Message----
>From: <email address hidden>
>Date: 07/04/2008 6:56
>To: <email address hidden>
>Subj: [Bug 212684] Re: RAID1 data-checks cause CPU soft lockups
>
>My systems do not froze or stuck. The check is going on with normal
speed (spending about 1 hour for 200 GB, 1.5 hours for 320 GB, and 2.5h
for 500GB) and writes into kernel log about its normal completion. Just
these soft lockups is what bothers me.
>Your system may just be old and show some faults under load. Soon
I'll make md-raid1 on 2 IDE disks on similar older system (P3TDDE
board, 2xPIII-1133MHz) and tell you how it goes in my case.
>
>--
>RAID1 data-checks cause CPU soft lockups
>https://bugs.launchpad.net/bugs/212684
>You received this bug notification because you are a direct
subscriber
>of the bug.
>
>Status in Source Package "linux" in Ubuntu: New
>
>Bug description:
>Binary package hint: linux-image-2.6.24-15-generic
>
>I track Hardy development packages on some of my systems.
>They share some common hardware and configuration features, in part :
>Pentium4 CPU with Hyperthreading turned on
>(so that 2 logical cores are visible); 1 or 2 GB RAM;
>Intel chipset with ICH5/6/7[R] SATA controller with RAID turned off ;
>two SATA disks (Seagate, WD or Samsung) of 200..320..500 GB each.
>On each of these disks, two 0xfd partitions (Linux RAID autodetect)
are allocated : first of 100 MB, and second taking the rest of disk
volume.
>They are assembled by mdadm in RAID1 (mirrored) md arrays :
>first, of 100 MB, to hold /boot filesystem,
>and second, the big one, to hold LVM2 PV, a volume group with a
couple of LVs with filesystems, including root FS (all of type ext3),
as well as a swap LV.
>The boot loader is LILO. Systems boot and run well, providing
appropriate fault-tolerance for disks when needed.
>
>However, mdadm package provides 'checkarray' script which is run by
cron on first Sunday of each month to check for RAID arrays integrity.
>The actions performed by script are in fact
>'echo check > $i' for i in /sys/block/*/md/sync_action
>And this integrity check gives the following messages in the kernel
log :
>
>Apr 6 01:06:02 hostname kernel: [ 9859.807932] md: data-check of
RAID array md0
>Apr 6 01:06:02 hostname kernel: [ 9859.808090] md: minimum
_guaranteed_ speed: 1000 KB/sec/disk.
>Apr 6 01:06:02 hostname kernel: [ 9859.808222] md: using maximum
available idle IO bandwidth (but not more than 200000 KB/sec) for data-
check.
>Apr 6 01:06:02 hostname kernel: [ 9859.808422] md: using 128k
window, over a total of 104320 blocks.
>Apr 6 01:06:02 hostname kernel: [ 9859.886364] md: delaying data-
check of md2 until md0 has finished (they share one or more physical
units)
>Apr 6 01:06:04 hostname kernel: [ 9862.098900] md: md0: data-check
done.
>Apr 6 01:06:04 hostname kernel:...

Read more...

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Wladimir,

Is this still an issue with the most recent 2.6.24-19 Hardy kernel? If so, would you be willing to test and confirm this is still an issue with the latest Alpha for the upcoming Intrepid Ibex 8.10 which contains a 2.6.26 based kernel - http://www.ubuntu.com/testing. Please let us know your results if you do get a chance to test. If the issue still exists, per the kernel team's bug policy, can you please attach the following information. Please be sure to attach each file as a separate attachment.

* cat /proc/version_signature > version.log
* dmesg > dmesg.log
* sudo lspci -vvnn > lspci-vvnn.log

For more information regarding the kernel team bug policy, please refer to https://wiki.ubuntu.com/KernelTeamBugPolicies . Thanks again and we appreciate your help and feedback.

Changed in linux:
status: New → Incomplete
Revision history for this message
Wladimir Mutel (mwg) wrote :

The lockups are still there, with 2.6.24-19.
I would note that this is observed with ahci driver.
My other system with ata_piix driver does not show this behaviour (as far as I could observe running checkarray by hand).

Revision history for this message
Wladimir Mutel (mwg) wrote :
Revision history for this message
Wladimir Mutel (mwg) wrote :
Revision history for this message
Wladimir Mutel (mwg) wrote :

Next ones from my other system with ata_piix driver which does not lock up on raid check :

Revision history for this message
Wladimir Mutel (mwg) wrote :
Revision history for this message
Wladimir Mutel (mwg) wrote :
Revision history for this message
Wladimir Mutel (mwg) wrote :

I have a notebook (Asus Z99Le) where I sometimes test Ubuntu 8.10 but unfortunately it is not easy to create md-raid1 there as it has only one HDD. It would be inconvenient to attach one more drive on the same controller. It runs under ahci driver btw.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Revision history for this message
jab_celle (jan-ubuntu) wrote :

Hi,

I experienced the same problems on two identical servers at the same time:
Oct 5 01:06:01 server01 kernel: [472687.165519] md: data-check of RAID array md0
Oct 5 01:06:01 server01 kernel: [472687.165525] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Oct 5 01:06:01 server01 kernel: [472687.165528] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Oct 5 01:06:01 server01 kernel: [472687.165533] md: using 128k window, over a total of 48829440 blocks.
Oct 5 01:06:01 server01 kernel: [472687.166340] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Oct 5 01:07:47 server01 syslogd 1.5.0#1ubuntu1: restart.

and:

Oct 5 01:06:01 server02 kernel: [284514.605756] md: data-check of RAID array md0
Oct 5 01:06:01 server02 kernel: [284514.605761] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Oct 5 01:06:01 server02 kernel: [284514.605764] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Oct 5 01:06:01 server02 kernel: [284514.605769] md: using 128k window, over a total of 48829440 blocks.
Oct 5 01:06:01 server02 kernel: [284514.606738] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Oct 5 01:07:34 server02 syslogd 1.5.0#1ubuntu1: restart.

Both servers are Dell PowerEdge R200 with 4GB of RAM and 2x 400GB Sata disks. Both servers run DRBD (primary/primary) and OpenVZ. They had no load at this time. Kernel is: Linux serer01.XXXXXXXX 2.6.24-19-openvz #1 SMP Wed Aug 20 22:07:43 UTC 2008 x86_64 GNU/Linux.

Revision history for this message
Wladimir Mutel (mwg) wrote :

Btw, recently I have built another system (mobo ASUS M3N-H/HDMI, CPU Athlon X2 4450e (2 cores), chipset GeForce 8300, SATA controller 10de:0ad4 rev. a2 (MCP78S AHCI controller) driven by 'ahci', and 2 Samsung HD753LJ SATA2 disks united into mdadm RAID1 array as explained initially). Its monthly check (with kernel 2.6.24-19) has passed without any soft lockups. So the problem is dependent on 'ahci' module driving only certain hardware (Intel so far, who knows what else, but not everything).

Unfortunately I could not meet a will around me to test all this hardware with unstable/prerelease kernels and distributions. Usually everyone wants to run there a stable software to be sure in.

Revision history for this message
biolscedu (higgins) wrote :

This same problem causes my server to crash on the first Sunday of every month. The logs are attached.

Jan 4 01:06:01 grizzly kernel: [2176742.092908] md: data-check of RAID array md0
Jan 4 01:06:01 grizzly kernel: [2176742.092919] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Jan 4 01:06:01 grizzly kernel: [2176742.092922] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jan 4 01:06:01 grizzly kernel: [2176742.092930] md: using 128k window, over a total of 16779776 blocks.
Jan 4 01:06:01 grizzly kernel: [2176742.106731] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Jan 4 01:06:01 grizzly kernel: [2176742.114936] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
Jan 4 01:06:01 grizzly kernel: [2176742.118081] md: delaying data-check of md3 until md2 has finished (they share one or more physical units)
Jan 4 01:06:01 grizzly kernel: [2176742.118331] md: delaying data-check of md1 until md3 has finished (they share one or more physical units)
Jan 4 01:06:01 grizzly kernel: [2176742.118469] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Jan 4 01:06:01 grizzly kernel: [2176742.118600] md: delaying data-check of md3 until md2 has finished (they share one or more physical units)

Revision history for this message
Wladimir Mutel (mwg) wrote :

And so, on Jan 4th, the check was performed on one system now running under Ubuntu Intrepid, with kernel 2.6.27-9-generic, module ahci, driving Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller (rev 01) (8086:27c1). Lockups were still reported. CPU was stuck for 61s. In the reported stacks, there always were lines :

[<c037e5eb>] ? _spin_lock_irq+0x1b/0x20
[<f886c8f6>] raid1d+0xb6/0x3e0 [raid1]

above that, the stack content was varying.

Other two systems I mentioned initially are now controlled by another person so I could not give further reports from these.
The system with ata_piix mentioned in comment #9 had been disassembled. The system with AMD/NForce/ahci mentioned in comment #15 still runs under 2.6.24-22-generic (Hardy) and does not report any lockups on monthly data-checks. So I feel this is related to Intel SATA/ahci interaction.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

If anyone is willing, you may want to test the latest pre-release of Jaunty (currently Alpha3). It contains a 2.6.28 based kernel - http://cdimage.ubuntu.com/releases/jaunty/ . Thanks.

Changed in linux:
importance: Undecided → Medium
status: Incomplete → Triaged
Revision history for this message
Wladimir Mutel (mwg) wrote :

Good news for everyone. As of linux-image-2.6.27-11-generic 2.6.27-11.26 , it seems that these lockups had gone.
I installed this kernel package on Jan29th, then rebooted the system with new kernel in the night.
On Feb 1st, the RAID1 checks had passed without lockups on the system where they were usually reported earlier (intel+ahci).

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Thanks for the feedback. Lets go ahead and mark this Fixed Released then.

Changed in linux:
status: Triaged → Fix Released
Revision history for this message
Julius Bloch (jbloch) wrote :

Hi,
this problem still affects Ubuntu 8.04.2LTS systems.
So I think we should also have a fix for 8.04.2 LTS because there is no newer LTS version available.

Julius Bloch (jbloch)
Changed in linux (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
cdiggity (craig-nzenergy) wrote :

Is this fixed in Hardy LTS?

Revision history for this message
magec (magec) wrote :

Hi,

I have Ubuntu Hardy and the problem is still present.

> cat /proc/version_signature
Ubuntu 2.6.24-19.41-generic

The point is that I don't get any of those messages on system log (BUG: soft lockup - CPU#0 stuck for 11s!), but seems to be happening indeed, as every proccess accessing the disk stucks until the check ends. I have tried to renice the md?_resync process and to change the io priority with no success, I have also limited de max speed used to resync but with no luck either. The only sympton that I see is an iowait of 30% and the load average that grows a lot (simply due to the fact that every proccess is stuck waiting for the check to finish). I have read that the problem does not seem to reproduce on recent versions, is a backport planned to get info hardy, Is there a way to solve this or to workarround this on hardy? Does it have to do with the kernel version? (md driver?)

Well thanks in advance, any help will be appreciated.

Revision history for this message
ceg (ceg) wrote :

Intented to mark ths as fixed for newer release (lucid) in launchpad but could not do it.

Revision history for this message
Enrico (bugone) wrote :
Download full text (5.6 KiB)

Hi there,

Is there any solution yet for this issue?
I do not see any lockups on the logs, but the system still crashes.
I use OpenVZ kernels on 2 machines with same kernel (2.6.24-27-openvz), same hardware, both with hardy and both of them are affected.
Every 1st Sunday the systems crash on array check.

I've tried to downgrade the kernel of one of the two machines (2.6.24-26-openvz).
The next first Sunday of the month, I get the lower kernel version crashed, the other machine instead had a raid array sync failure, that somehow kept it alive.
I received this email about the array sync failure on the working machine:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md3 : active raid0 md1[0] md2[1]
     898643712 blocks 64k chunks

md2 : active raid1 sdc2[0] sdd2[1]
     449321920 blocks [2/2] [UU]

md1 : active raid1 sda2[2](F) sdb2[1]
     449321920 blocks [2/1] [_U]
     [===================>.] check = 99.9% (448939328/449321920) finish=0.5min speed=10895K/sec

md0 : active raid1 sda1[0] sdd1[3] sdc1[2] sdb1[1]
     39061952 blocks [4/4] [UUUU]

unused devices: <none>

And on /var/log/debug (on the machine that is still alive) we have:

May 2 01:06:01 verus /USR/SBIN/CRON[8509]: (root) CMD ([ -x /usr/share/mdadm/checkarray ] && [ $(date +%d) -le 7 ] && /usr/share/mdadm/checkarray --cron --all --quiet)
May 2 01:06:01 verus kernel: [2200235.996209] md: data-check of RAID array md0
May 2 01:06:01 verus kernel: [2200235.996218] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
May 2 01:06:01 verus kernel: [2200235.996224] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
May 2 01:06:01 verus kernel: [2200235.996235] md: using 128k window, over a total of 39061952 blocks.
May 2 01:06:01 verus kernel: [2200235.998255] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
May 2 01:06:01 verus kernel: [2200235.998875] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
May 2 01:06:01 verus mdadm: RebuildStarted event detected on md device /dev/md0
May 2 01:08:01 verus mdadm: Rebuild20 event detected on md device /dev/md0

The same machine, the month before, didn't had any array sync failure. It crashed and the message was the same:

Apr 4 01:06:01 verus /USR/SBIN/CRON[7539]: (root) CMD ([ -x /usr/share/mdadm/checkarray ] && [ $(date +%d) -le 7 ] && /usr/share/mdadm/checkarray --cron --all --quiet)
Apr 4 01:06:01 verus kernel: [1248149.935860] md: data-check of RAID array md0
Apr 4 01:06:01 verus kernel: [1248149.935868] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Apr 4 01:06:01 verus kernel: [1248149.935872] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Apr 4 01:06:01 verus kernel: [1248149.935883] md: using 128k window, over a total of 39061952 blocks.
Apr 4 01:06:01 verus kernel: [1248149.939456] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Apr 4 01:06:01 verus mdadm: RebuildStarted event detected on md device /dev/md0
Apr 4 01:06:01 verus kernel: [1248149.93...

Read more...

Revision history for this message
ryan (ryanobjc) wrote :

Hi guys,

This bug just took out my production databases. The overhead of raid check is not trivial, and ruined my raid10 performance on ebs thus leading to serious mysql performance and thus site performance.

Why is this bug still open? I am going to have to go thru and neuter this script on all my systems.

Revision history for this message
Yuriy Padlyak (gneeot) wrote :

Any news on this?

Revision history for this message
Kari Tuurihalme (thunderer85) wrote :

Hi.

Just run into this bug using 12.04 Ubuntu and 3.2.0-24-generic kernel.

here is the syslog:

May 6 00:57:02 Vouhaukko2 CRON[10944]: (root) CMD (if [ -x /usr/share/mdadm/checkarray ] && [ $(date +%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi)
May 6 00:57:02 Vouhaukko2 kernel: [738906.402166] md: data-check of RAID array md0
May 6 00:57:02 Vouhaukko2 kernel: [738906.402173] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
May 6 00:57:02 Vouhaukko2 kernel: [738906.402179] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
May 6 00:57:02 Vouhaukko2 kernel: [738906.402187] md: using 128k window, over a total of 488384448k.
May 6 00:57:02 Vouhaukko2 mdadm[1216]: RebuildStarted event detected on md device /dev/md0

and after that it hangs.

Anything I can do to help nail it?

Revision history for this message
Yuriy Padlyak (gneeot) wrote :

Any news? It had caused problem for our another server ance again :(

Revision history for this message
penalvch (penalvch) wrote :

Wladimir Mutel, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. As well, please comment on which kernel version specifically you tested.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream', and comment as to why specifically you were unable to test it.

Please let us know your results. Thanks in advance.

tags: added: needs-upstream-testing
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Wladimir Mutel (mwg) wrote :

Now I don't have access to these boxes where this happened a number of years ago.
In other setups, I could not reproduce this behaviour.

Revision history for this message
penalvch (penalvch) wrote :

Wladimir Mutel, this bug report is being closed due to your last comment regarding you no longer having access to the hardware. For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status. Thank you again for taking the time to report this bug and helping to make Ubuntu better. Please submit any future bugs you may find.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Kari Tuurihalme (thunderer85) wrote :

Christopher M. Penalver (penalvch): Please check the comments. There is two comments within 3 months that confirms it still being an issue. I still have access to the system that had this issue just over a month ago. Unfortunately it is headless and lynx haves some issue with launchpad referrer headers. Please advice.

Revision history for this message
penalvch (penalvch) wrote :

Kari Tuurihalme, I read all comments before taking any actions on reports.

Despite this, please be advised that closing this report is fully justified as the original reporter noted he does not have the original hardware nor could he reproduce the problem on disparate hardware:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/212684/comments/36

Hence, this is a hardware dependent issue, which demands separate bug reports for each hardware type until a developer says otherwise. Further justification is detailed in:
https://help.ubuntu.com/community/ReportingBugs#A3._Make_sure_the_bug_hasn.27t_already_been_reported
https://help.ubuntu.com/community/ReportingBugs#Adding_Apport_Debug_Information_to_an_Existing_Launchpad_Bug

If you, or anyone else are having a problem in Ubuntu, please file a new report by executing the following via the Terminal and feel free to subscribe me to it:
ubuntu-bug linux

Thanks!

Revision history for this message
Peng Yong (ppyy) wrote :

same problem on Ubuntu 12.04. all our server rebuild software raid this morning.

why this bug is still not fixed?

Revision history for this message
Peng Yong (ppyy) wrote :

to reproduce the problem, only run following script on any softraid server:

/usr/share/mdadm/checkarray --cron --all --idle --quiet;

then:

# cat /proc/mdstat <<<
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
      248640 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      292587328 blocks super 1.2 [2/2] [UU]
      [>....................] check = 0.0% (74240/292587328) finish=65.6min speed=74240K/sec

Changed in linux (Ubuntu):
status: Invalid → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Peng Yong, if you have a bug in Ubuntu, could you please file a new report by executing the following in a terminal:
ubuntu-bug linux

For more on this, please see the Ubuntu Bug Control and Ubuntu Bug Squad article:
https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue

and Ubuntu Community article:
https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette

When opening up the new report, please feel free to subscribe me to it.

Please note, not filing a new report may delay your problem being addressed as quickly as possible.

Thank you for your understanding.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.