RAID1 data-checks cause CPU soft lockups
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Invalid
|
Medium
|
Unassigned | ||
Bug Description
Binary package hint: linux-image-
I track Hardy development packages on some of my systems.
They share some common hardware and configuration features, in part :
Pentium4 CPU with Hyperthreading turned on
(so that 2 logical cores are visible); 1 or 2 GB RAM;
Intel chipset with ICH5/6/7[R] SATA controller with RAID turned off ;
two SATA disks (Seagate, WD or Samsung) of 200..320..500 GB each.
On each of these disks, two 0xfd partitions (Linux RAID autodetect) are allocated : first of 100 MB, and second taking the rest of disk volume.
They are assembled by mdadm in RAID1 (mirrored) md arrays :
first, of 100 MB, to hold /boot filesystem,
and second, the big one, to hold LVM2 PV, a volume group with a couple of LVs with filesystems, including root FS (all of type ext3), as well as a swap LV.
The boot loader is LILO. Systems boot and run well, providing appropriate fault-tolerance for disks when needed.
However, mdadm package provides 'checkarray' script which is run by cron on first Sunday of each month to check for RAID arrays integrity.
The actions performed by script are in fact
'echo check > $i' for i in /sys/block/
And this integrity check gives the following messages in the kernel log :
Apr 6 01:06:02 hostname kernel: [ 9859.807932] md: data-check of RAID array md0
Apr 6 01:06:02 hostname kernel: [ 9859.808090] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Apr 6 01:06:02 hostname kernel: [ 9859.808222] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Apr 6 01:06:02 hostname kernel: [ 9859.808422] md: using 128k window, over a total of 104320 blocks.
Apr 6 01:06:02 hostname kernel: [ 9859.886364] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Apr 6 01:06:04 hostname kernel: [ 9862.098900] md: md0: data-check done.
Apr 6 01:06:04 hostname kernel: [ 9862.137205] md: data-check of RAID array md2
Apr 6 01:06:04 hostname kernel: [ 9862.137238] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Apr 6 01:06:04 hostname kernel: [ 9862.137272] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Apr 6 01:06:04 hostname kernel: [ 9862.137327] md: using 128k window, over a total of 312464128 blocks.
Apr 6 01:06:04 hostname kernel: [ 9862.189968] RAID1 conf printout:
Apr 6 01:06:04 hostname kernel: [ 9862.190003] --- wd:2 rd:2
Apr 6 01:06:04 hostname kernel: [ 9862.190035] disk 0, wo:0, o:1, dev:sdb1
Apr 6 01:06:04 hostname kernel: [ 9862.190062] disk 1, wo:0, o:1, dev:sda1
... 13 seconds later :
Apr 6 01:06:17 hostname kernel: [ 9875.118427] BUG: soft lockup - CPU#0 stuck for 11s! [md2_raid1:2378]
Apr 6 01:06:17 hostname kernel: [ 9875.118581]
Apr 6 01:06:17 hostname kernel: [ 9875.118671] Pid: 2378, comm: md2_raid1 Not tainted (2.6.24-15-generic #1)
Apr 6 01:06:17 hostname kernel: [ 9875.118811] EIP: 0060:[<f887c9b0>] EFLAGS: 00010282 CPU: 0
Apr 6 01:06:17 hostname kernel: [ 9875.119048] EIP is at raid1d+0x770/0xff0 [raid1]
Apr 6 01:06:17 hostname kernel: [ 9875.119159] EAX: e7ffb000 EBX: c14fff60 ECX: 00000f24 EDX: f4b41800
Apr 6 01:06:17 hostname kernel: [ 9875.119284] ESI: e7ffb0dc EDI: e807e0dc EBP: df92fe40 ESP: f7495e9c
Apr 6 01:06:17 hostname kernel: [ 9875.119448] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Apr 6 01:06:17 hostname kernel: [ 9875.119567] CR0: 8005003b CR2: b7f32480 CR3: 374de000 CR4: 000006d0
Apr 6 01:06:17 hostname kernel: [ 9875.119696] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Apr 6 01:06:17 hostname kernel: [ 9875.119833] DR6: ffff0ff0 DR7: 00000400
Apr 6 01:06:17 hostname kernel: [ 9875.120537] [jbd:schedule+
Apr 6 01:06:17 hostname kernel: [ 9875.121051] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.121319] [shpchp:
Apr 6 01:06:17 hostname kernel: [ 9875.121494] [apic_timer_
Apr 6 01:06:17 hostname kernel: [ 9875.121751] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.122050] [<f887007b>] mirror_
Apr 6 01:06:17 hostname kernel: [ 9875.122333] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.122590] [<f88b5ef3>] md_thread+0x23/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.122854] [<c0141b70>] autoremove_
Apr 6 01:06:17 hostname kernel: [ 9875.123122] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod]
Apr 6 01:06:17 hostname kernel: [ 9875.123356] [kthread+0x42/0x70] kthread+0x42/0x70
Apr 6 01:06:17 hostname kernel: [ 9875.123497] [kthread+0x0/0x70] kthread+0x0/0x70
Apr 6 01:06:17 hostname kernel: [ 9875.123673] [kernel_
Apr 6 01:06:17 hostname kernel: [ 9875.123937] =======
... Then, these soft lockups repeat about each 13..20 seconds. Their stacks are not the same, but they share common spot in 'raid1d' function or thread. Most frequent offset is raid1d+0x770. Sometimes it is +0x17b or nearby values. Here is a sample distribution :
1 raid1d+0x174/0xff0
5 raid1d+0x17b/0xff0
1 raid1d+0x18d/0xff0
1 raid1d+0x669/0xff0
1 raid1d+0x75f/0xff0
86 raid1d+0x770/0xff0
1 raid1d+0x772/0xff0
During the check of these two RAID1 arrays, 100 MB and 320 GB in total, the lockups happened more than 90 times. Astronomically, this check had taken 1 hour 37 minutes.
Don't know if I should ignore these lockups, or ask you, dear maintainers, to research into the problem. I just decided to inform you. I'll give you any more details on your request. I just listed the common traits of three my systems (with equal sets of Hardy packages) where these lockups are reproducible.
Changed in linux (Ubuntu): | |
status: | Fix Released → Confirmed |
I've been having unexplained lockups on my Ubuntu 7.10 (2.6.22-14-server) implementation for some months, the symtom is a complete crash requiring a power cycle to recover the system: power and rest buttons on the server don't respond, needs a switch off/on at the mains. Server is a (don't laugh !) Pentium III 500MHz dual-CPU with 1Gb RAM, 2x300Gb Seagate Barracuda ATA (IDE) disks.
Coincidentally on checking /var/log/messages today I found the info below, and subsequent search turned up this bug report. Thought I'd post in case it's of any use.
...
Apr 5 23:25:25 karanda -- MARK --
Apr 5 23:45:25 karanda -- MARK --
Apr 6 00:05:25 karanda -- MARK --
Apr 6 00:25:26 karanda -- MARK --
Apr 6 00:45:26 karanda -- MARK --
Apr 6 01:05:27 karanda -- MARK --
Apr 6 01:06:02 karanda kernel: [231760.863936] md: data-check of RAID array md0
Apr 6 01:06:02 karanda kernel: [231760.863971] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Apr 6 01:06:02 karanda kernel: [231760.863983] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Apr 6 01:06:02 karanda kernel: [231760.864004] md: using 128k window, over a total of 51199040 blocks.
Apr 6 01:06:02 karanda kernel: [231760.882873] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Apr 6 01:06:02 karanda kernel: [231760.911406] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Apr 6 01:06:02 karanda kernel: [231760.911443] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
Apr 6 19:48:04 karanda syslogd 1.4.1#21ubuntu3: restart.