Ubuntu
mdadm package

Mdadm crash on raid5 reshape

Bug #1001019 reported by Alex Sergeyev on 2012-05-17

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	mdadm (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

Ubuntu 12.04 LTS
mdadm 3.2.3-2ubuntu1 0

I try to grow mdadm raid5 array with changing device number and chunk size on same time:
mdadm --grow /dev/md2 --raid-devices=4 --chunk=512 --backup-file=/root/md2_backup.img

After this /proc/mdstat don't respond and I can't gen any info about raid status

May 17 23:27:17 protoss kernel: [40835.240145] RAID conf printout:
May 17 23:27:17 protoss kernel: [40835.240153] --- level:5 rd:4 wd:4
May 17 23:27:17 protoss kernel: [40835.240160] disk 0, o:1, dev:sdc1
May 17 23:27:17 protoss kernel: [40835.240166] disk 1, o:1, dev:sdd1
May 17 23:27:17 protoss kernel: [40835.240171] disk 2, o:1, dev:sdf1
May 17 23:27:17 protoss kernel: [40835.240176] disk 3, o:1, dev:sdg1
May 17 23:27:17 protoss kernel: [40835.240332] md: reshape of RAID array md2
May 17 23:27:17 protoss kernel: [40835.240342] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
May 17 23:27:17 protoss kernel: [40835.240348] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
May 17 23:27:17 protoss kernel: [40835.240365] md: using 128k window, over a total of 1953513472k.
May 17 23:27:17 protoss kernel: [40835.507761] md: md_do_sync() got signal ... exiting
May 17 23:30:42 protoss kernel: [41040.652048] INFO: task md2_raid5:259 blocked for more than 120 seconds.
May 17 23:30:42 protoss kernel: [41040.652055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 23:30:42 protoss kernel: [41040.652062] md2_raid5 D ffffffff81806240 0 259 2 0x00000000
May 17 23:30:42 protoss kernel: [41040.652074] ffff8801336a1c40 0000000000000046 ffffea0001d7f180 ffffffff7fffffff
May 17 23:30:42 protoss kernel: [41040.652086] ffff8801336a1fd8 ffff8801336a1fd8 ffff8801336a1fd8 0000000000013780
May 17 23:30:42 protoss kernel: [41040.652097] ffff880138b796f0 ffff880133ba2de0 ffff8801336a1c30 ffff880133ba2de0
May 17 23:30:42 protoss kernel: [41040.652108] Call Trace:
May 17 23:30:42 protoss kernel: [41040.652123] [<ffffffff8165a55f>] schedule+0x3f/0x60
May 17 23:30:42 protoss kernel: [41040.652157] [<ffffffffa00a3d2e>] resize_stripes+0x51e/0x590 [raid456]
May 17 23:30:42 protoss kernel: [41040.652167] [<ffffffff81056c9c>] ? update_shares+0xcc/0x100
May 17 23:30:42 protoss kernel: [41040.652176] [<ffffffff8105f990>] ? try_to_wake_up+0x200/0x200
May 17 23:30:42 protoss kernel: [41040.652192] [<ffffffffa00a3e1f>] check_reshape+0x7f/0xd0 [raid456]
May 17 23:38:42 protoss kernel: [41520.652310] [<ffffffff8108a3a0>] ? flush_kthread_worker+0xa0/0xa0
May 17 23:38:42 protoss kernel: [41520.652319] [<ffffffff81666bf0>] ? gs_change+0x13/0x13
May 17 23:38:42 protoss kernel: [41520.652357] INFO: task lvm:25938 blocked for more than 120 seconds.
May 17 23:38:42 protoss kernel: [41520.652362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 23:38:42 protoss kernel: [41520.652367] lvm D ffffffff81806240 0 25938 25935 0x00000000
May 17 23:38:42 protoss kernel: [41520.652377] ffff88006c6b1ae8 0000000000000086 0000000000000000 ffff88006c6b1aa8
May 17 23:38:42 protoss kernel: [41520.652388] ffff88006c6b1fd8 ffff88006c6b1fd8 ffff88006c6b1fd8 0000000000013780
May 17 23:38:42 protoss kernel: [41520.652398] ffff880138b616f0 ffff88010c02dbc0 ffff88006c6b1ab8 ffff88013fc94040
May 17 23:38:42 protoss kernel: [41520.652409] Call Trace:
May 17 23:38:42 protoss kernel: [41520.652417] [<ffffffff8165a55f>] schedule+0x3f/0x60
May 17 23:38:42 protoss kernel: [41520.652424] [<ffffffff8165a60f>] io_schedule+0x8f/0xd0
May 17 23:38:42 protoss kernel: [41520.652434] [<ffffffff811b0e74>] dio_await_completion+0x54/0xd0
May 17 23:38:42 protoss kernel: [41520.652442] [<ffffffff811b3484>] __blockdev_direct_IO+0x954/0xd90
May 17 23:38:42 protoss kernel: [41520.652451] [<ffffffff811af6f0>] ? blkdev_get_block+0x80/0x80
May 17 23:38:42 protoss kernel: [41520.652461] [<ffffffff811af1b7>] blkdev_direct_IO+0x57/0x60
May 17 23:38:42 protoss kernel: [41520.652468] [<ffffffff811af6f0>] ? blkdev_get_block+0x80/0x80
May 17 23:38:42 protoss kernel: [41520.652478] [<ffffffff811196bb>] generic_file_aio_read+0x24b/0x280
May 17 23:38:42 protoss kernel: [41520.652488] [<ffffffff811874ac>] ? path_openat+0xfc/0x3f0
May 17 23:38:42 protoss kernel: [41520.652496] [<ffffffff81177452>] do_sync_read+0xd2/0x110
May 17 23:38:42 protoss kernel: [41520.652506] [<ffffffff8129cd03>] ? security_file_permission+0x93/0xb0
May 17 23:38:42 protoss kernel: [41520.652514] [<ffffffff811778d1>] ? rw_verify_area+0x61/0xf0
May 17 23:38:42 protoss kernel: [41520.652521] [<ffffffff81177db0>] vfs_read+0xb0/0x180
May 17 23:38:42 protoss kernel: [41520.652529] [<ffffffff81177eca>] sys_read+0x4a/0x90
May 17 23:38:42 protoss kernel: [41520.652537] [<ffffffff81664a82>] system_call_fastpath+0x16/0x1b

Revision history for this message

James Lee (james-lee) wrote on 2013-03-17:

Sample dmesg output showing a repro of the bug Edit (45.7 KiB, text/plain)

I've seen this bug too - I was also doing a reshape of a RAID5 array (but without changing chunk size - just adding a new drive).

I'm also seeing this intermittently only. One thing that might be relevant is that I'm reshaping several arrays in succession through a script (doing them in series, but with no waiting between reshapes). I've attached my dmesg output - similar error and stack to the above.

Let me know if there are any other diags which would help here, I can hopefully repro this without too much trouble.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-03-17:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mdadm (Ubuntu):
status:	New → Confirmed

Revision history for this message

ole.tange (n-launchpad-net-tange-dk) wrote on 2013-05-07:

I have seen this too:

md1 : active raid6 sdg[0] sdi[12](S) sdt[15](S) sdy[17](S) sdx[16](S) sdh[8] sdw[13] sdo[14] sdk[5] sdd[11] sdc[3] sdv[9] sdn[10]
27349121408 blocks super 1.2 level 6, 128k chunk, algorithm 2 [9/9] [UUUUUUUUU]
bitmap: 2/2 pages [8KB], 1048576KB chunk

mdadm -v --grow /dev/md1 -b none

mdadm -v --grow /dev/md1 --raid-devices=10 --backup-file=/root/back-md1

cat /proc/mdstat
<<hangs>>