kernel io hangs during mdcheck/resync

Bug #1942935 reported by Chad Wagner
78
This bug affects 15 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
linux-signed-hwe-5.11 (Ubuntu)
Confirmed
Undecided
Unassigned
linux-signed-hwe-5.15 (Ubuntu)
Confirmed
Undecided
Unassigned
linux-signed-hwe-5.4 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

It seems to always occur during an mdcheck/resync, if I am logged in via SSH it is still somewhat responsive and basic utilities like dmesg will work. But it apppears any write I/O will hang the terminal and nothing is written to syslog (presumably because it is blocked).

Below is output of dmesg and cat /proc/mdstat, it appears the data check was interrupted and /proc/mdstat still shows progress, and a whole slew of hung tasks including md1_resync itself.

[756484.534293] md: data-check of RAID array md0
[756484.628039] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[756493.808773] md: md0: data-check done.
[756493.829760] md: data-check of RAID array md1
[778112.446410] md: md1: data-check interrupted.
[810654.608102] md: data-check of RAID array md1
[832291.201064] md: md1: data-check interrupted.
[899745.389485] md: data-check of RAID array md1
[921395.835305] md: md1: data-check interrupted.
[921588.558834] INFO: task systemd-journal:376 blocked for more than 120 seconds.
[921588.558846] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.558850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.558854] task:systemd-journal state:D stack: 0 pid: 376 ppid: 1 flags:0x00000220
[921588.558859] Call Trace:
[921588.558864] __schedule+0x44c/0x8a0
[921588.558872] schedule+0x4f/0xc0
[921588.558876] md_write_start+0x150/0x240
[921588.558880] ? wait_woken+0x80/0x80
[921588.558886] raid5_make_request+0x88/0x890 [raid456]
[921588.558898] ? wait_woken+0x80/0x80
[921588.558901] ? mempool_kmalloc+0x17/0x20
[921588.558904] md_handle_request+0x12d/0x1a0
[921588.558907] ? __part_start_io_acct+0x51/0xf0
[921588.558912] md_submit_bio+0xca/0x100
[921588.558915] submit_bio_noacct+0x112/0x4f0
[921588.558918] ? ext4_fc_reserve_space+0x110/0x230
[921588.558922] submit_bio+0x51/0x1a0
[921588.558925] ? _cond_resched+0x19/0x30
[921588.558928] ? kmem_cache_alloc+0x38e/0x440
[921588.558932] ? ext4_init_io_end+0x1f/0x50
[921588.558936] ext4_io_submit+0x4d/0x60
[921588.558940] ext4_writepages+0x2c6/0xcd0
[921588.558944] do_writepages+0x43/0xd0
[921588.558948] ? do_writepages+0x43/0xd0
[921588.558951] ? fault_dirty_shared_page+0xa5/0x110
[921588.558955] __filemap_fdatawrite_range+0xcc/0x110
[921588.558960] file_write_and_wait_range+0x74/0xc0
[921588.558962] ext4_sync_file+0xf5/0x350
[921588.558967] vfs_fsync_range+0x49/0x80
[921588.558970] do_fsync+0x3d/0x70
[921588.558973] __x64_sys_fsync+0x14/0x20
[921588.558976] do_syscall_64+0x38/0x90
[921588.558980] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921588.558984] RIP: 0033:0x7f4c97ee832b
[921588.558987] RSP: 002b:00007ffdceb29e50 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921588.558991] RAX: ffffffffffffffda RBX: 000055ced34b0fa0 RCX: 00007f4c97ee832b
[921588.558993] RDX: 00007f4c97fc8000 RSI: 000055ced3487b70 RDI: 0000000000000021
[921588.558995] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007ffdceb29fa8
[921588.558996] R10: 00007f4c97d2c848 R11: 0000000000000293 R12: 00007ffdceb29fa8
[921588.558998] R13: 00007ffdceb29fa0 R14: 000055ced34b0fa0 R15: 000055ced34bcf90
[921588.559014] INFO: task mysqld:1505 blocked for more than 120 seconds.
[921588.559018] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559025] task:mysqld state:D stack: 0 pid: 1505 ppid: 1 flags:0x00000000
[921588.559030] Call Trace:
[921588.559032] __schedule+0x44c/0x8a0
[921588.559036] schedule+0x4f/0xc0
[921588.559040] md_write_start+0x150/0x240
[921588.559044] ? wait_woken+0x80/0x80
[921588.559047] raid5_make_request+0x88/0x890 [raid456]
[921588.559056] ? wait_woken+0x80/0x80
[921588.559059] ? mempool_kmalloc+0x17/0x20
[921588.559062] md_handle_request+0x12d/0x1a0
[921588.559065] ? __part_start_io_acct+0x51/0xf0
[921588.559068] md_submit_bio+0xca/0x100
[921588.559071] submit_bio_noacct+0x112/0x4f0
[921588.559075] submit_bio+0x51/0x1a0
[921588.559077] ? _cond_resched+0x19/0x30
[921588.559081] ? kmem_cache_alloc+0x38e/0x440
[921588.559084] ? ext4_init_io_end+0x1f/0x50
[921588.559088] ext4_io_submit+0x4d/0x60
[921588.559091] ext4_writepages+0x2c6/0xcd0
[921588.559094] ? __schedule+0x454/0x8a0
[921588.559097] ? hrtimer_start_range_ns+0x1aa/0x2f0
[921588.559100] ? timerqueue_del+0x24/0x50
[921588.559105] ? futex_wait+0x1ed/0x270
[921588.559109] do_writepages+0x43/0xd0
[921588.559112] ? do_writepages+0x43/0xd0
[921588.559115] ? futex_wake+0x7c/0x180
[921588.559118] __filemap_fdatawrite_range+0xcc/0x110
[921588.559123] file_write_and_wait_range+0x74/0xc0
[921588.559126] ext4_sync_file+0xf5/0x350
[921588.559129] vfs_fsync_range+0x49/0x80
[921588.559132] ? __fget_light+0x62/0x80
[921588.559136] do_fsync+0x3d/0x70
[921588.559139] __x64_sys_fsync+0x14/0x20
[921588.559142] do_syscall_64+0x38/0x90
[921588.559145] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921588.559148] RIP: 0033:0x7f36188d48db
[921588.559150] RSP: 002b:00007f35f05487d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921588.559154] RAX: ffffffffffffffda RBX: 00007f360502bf50 RCX: 00007f36188d48db
[921588.559156] RDX: 000d239200000000 RSI: 0000000000000000 RDI: 000000000000000a
[921588.559158] RBP: 00007f35f0548a00 R08: 0000000000000000 R09: 00007f360502beb0
[921588.559160] R10: 00007f35f0548ccf R11: 0000000000000293 R12: 00007f36042e9610
[921588.559162] R13: 000000000000000a R14: 0000000000000b00 R15: 0000000000000000
[921588.559175] INFO: task mdcheck:40625 blocked for more than 120 seconds.
[921588.559179] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559185] task:mdcheck state:D stack: 0 pid:40625 ppid: 1 flags:0x00000000
[921588.559189] Call Trace:
[921588.559191] __schedule+0x44c/0x8a0
[921588.559196] schedule+0x4f/0xc0
[921588.559200] schedule_timeout+0x202/0x290
[921588.559203] ? do_truncate+0x78/0xd0
[921588.559216] ? try_to_wake_up+0x66/0x5c0
[921588.559220] wait_for_completion+0x94/0x100
[921588.559225] kthread_stop+0x71/0x170
[921588.559229] md_unregister_thread+0x47/0x80
[921588.559232] md_reap_sync_thread+0x24/0x200
[921588.559236] action_store+0x160/0x2e0
[921588.559239] md_attr_store+0x82/0xc0
[921588.559243] sysfs_kf_write+0x3e/0x50
[921588.559246] kernfs_fop_write_iter+0x138/0x1c0
[921588.559249] new_sync_write+0x117/0x1b0
[921588.559254] vfs_write+0x1ca/0x280
[921588.559257] ksys_write+0x67/0xe0
[921588.559260] __x64_sys_write+0x1a/0x20
[921588.559264] do_syscall_64+0x38/0x90
[921588.559267] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921588.559271] RIP: 0033:0x7f8c6d2791e7
[921588.559273] RSP: 002b:00007ffea5d97178 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[921588.559276] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8c6d2791e7
[921588.559278] RDX: 0000000000000005 RSI: 000055b618a3eb40 RDI: 0000000000000001
[921588.559280] RBP: 000055b618a3eb40 R08: 000000000000000a R09: 0000000000000004
[921588.559282] R10: 000055b616e97017 R11: 0000000000000246 R12: 0000000000000005
[921588.559284] R13: 00007f8c6d3546a0 R14: 00007f8c6d3554a0 R15: 00007f8c6d3548a0
[921588.559287] INFO: task md1_resync:40644 blocked for more than 120 seconds.
[921588.559291] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559297] task:md1_resync state:D stack: 0 pid:40644 ppid: 2 flags:0x00004000
[921588.559301] Call Trace:
[921588.559303] __schedule+0x44c/0x8a0
[921588.559308] schedule+0x4f/0xc0
[921588.559311] md_do_sync.cold+0x8b7/0x953
[921588.559315] ? wait_woken+0x80/0x80
[921588.559319] ? kernel_sigaction+0x3b/0xd0
[921588.559323] md_thread+0xb1/0x170
[921588.559326] kthread+0x114/0x150
[921588.559329] ? md_start_sync+0x60/0x60
[921588.559332] ? kthread_park+0x90/0x90
[921588.559335] ret_from_fork+0x22/0x30
[921588.559341] INFO: task kworker/u16:0:41409 blocked for more than 120 seconds.
[921588.559345] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559353] task:kworker/u16:0 state:D stack: 0 pid:41409 ppid: 2 flags:0x00004000
[921588.559357] Workqueue: writeback wb_workfn (flush-253:1)
[921588.559361] Call Trace:
[921588.559363] __schedule+0x44c/0x8a0
[921588.559367] schedule+0x4f/0xc0
[921588.559371] md_write_start+0x150/0x240
[921588.559375] ? wait_woken+0x80/0x80
[921588.559378] raid5_make_request+0x88/0x890 [raid456]
[921588.559387] ? wait_woken+0x80/0x80
[921588.559390] ? mempool_kmalloc+0x17/0x20
[921588.559393] md_handle_request+0x12d/0x1a0
[921588.559396] ? __part_start_io_acct+0x51/0xf0
[921588.559399] md_submit_bio+0xca/0x100
[921588.559402] submit_bio_noacct+0x112/0x4f0
[921588.559405] submit_bio+0x51/0x1a0
[921588.559408] ? _cond_resched+0x19/0x30
[921588.559412] ? kmem_cache_alloc+0x38e/0x440
[921588.559415] ? ext4_init_io_end+0x1f/0x50
[921588.559419] ext4_io_submit+0x4d/0x60
[921588.559422] ext4_writepages+0x2c6/0xcd0
[921588.559425] ? update_group_capacity+0x2c/0x1e0
[921588.559430] do_writepages+0x43/0xd0
[921588.559433] ? do_writepages+0x43/0xd0
[921588.559436] __writeback_single_inode+0x44/0x2b0
[921588.559439] writeback_sb_inodes+0x22d/0x4c0
[921588.559443] __writeback_inodes_wb+0x56/0xf0
[921588.559446] wb_writeback+0x1ea/0x2a0
[921588.559450] wb_workfn+0x31b/0x490
[921588.559453] ? __update_idle_core+0x9b/0xb0
[921588.559456] process_one_work+0x220/0x3c0
[921588.559459] worker_thread+0x4d/0x3f0
[921588.559462] kthread+0x114/0x150
[921588.559465] ? process_one_work+0x3c0/0x3c0
[921588.559467] ? kthread_park+0x90/0x90
[921588.559471] ret_from_fork+0x22/0x30
[921709.391056] INFO: task jbd2/dm-1-8:319 blocked for more than 120 seconds.
[921709.391068] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391077] task:jbd2/dm-1-8 state:D stack: 0 pid: 319 ppid: 2 flags:0x00004000
[921709.391083] Call Trace:
[921709.391087] __schedule+0x44c/0x8a0
[921709.391094] schedule+0x4f/0xc0
[921709.391098] md_write_start+0x150/0x240
[921709.391103] ? wait_woken+0x80/0x80
[921709.391108] raid5_make_request+0x88/0x890 [raid456]
[921709.391120] ? wait_woken+0x80/0x80
[921709.391123] ? mempool_kmalloc+0x17/0x20
[921709.391127] md_handle_request+0x12d/0x1a0
[921709.391130] ? __part_start_io_acct+0x51/0xf0
[921709.391134] md_submit_bio+0xca/0x100
[921709.391137] submit_bio_noacct+0x112/0x4f0
[921709.391141] submit_bio+0x51/0x1a0
[921709.391144] ? bio_add_page+0x6a/0x90
[921709.391147] submit_bh_wbc+0x195/0x1c0
[921709.391151] submit_bh+0x13/0x20
[921709.391154] jbd2_journal_commit_transaction+0x74b/0x18f0
[921709.391159] ? try_to_del_timer_sync+0x54/0x80
[921709.391164] kjournald2+0xb6/0x280
[921709.391168] ? wait_woken+0x80/0x80
[921709.391170] kthread+0x114/0x150
[921709.391174] ? commit_timeout+0x20/0x20
[921709.391177] ? kthread_park+0x90/0x90
[921709.391180] ret_from_fork+0x22/0x30
[921709.391186] INFO: task systemd-journal:376 blocked for more than 241 seconds.
[921709.391191] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391199] task:systemd-journal state:D stack: 0 pid: 376 ppid: 1 flags:0x00000220
[921709.391204] Call Trace:
[921709.391205] __schedule+0x44c/0x8a0
[921709.391210] schedule+0x4f/0xc0
[921709.391214] md_write_start+0x150/0x240
[921709.391218] ? wait_woken+0x80/0x80
[921709.391221] raid5_make_request+0x88/0x890 [raid456]
[921709.391230] ? wait_woken+0x80/0x80
[921709.391233] ? mempool_kmalloc+0x17/0x20
[921709.391236] md_handle_request+0x12d/0x1a0
[921709.391239] ? __part_start_io_acct+0x51/0xf0
[921709.391242] md_submit_bio+0xca/0x100
[921709.391245] submit_bio_noacct+0x112/0x4f0
[921709.391248] ? ext4_fc_reserve_space+0x110/0x230
[921709.391252] submit_bio+0x51/0x1a0
[921709.391255] ? _cond_resched+0x19/0x30
[921709.391259] ? kmem_cache_alloc+0x38e/0x440
[921709.391262] ? ext4_init_io_end+0x1f/0x50
[921709.391266] ext4_io_submit+0x4d/0x60
[921709.391270] ext4_writepages+0x2c6/0xcd0
[921709.391274] do_writepages+0x43/0xd0
[921709.391278] ? do_writepages+0x43/0xd0
[921709.391281] ? fault_dirty_shared_page+0xa5/0x110
[921709.391285] __filemap_fdatawrite_range+0xcc/0x110
[921709.391290] file_write_and_wait_range+0x74/0xc0
[921709.391293] ext4_sync_file+0xf5/0x350
[921709.391297] vfs_fsync_range+0x49/0x80
[921709.391300] do_fsync+0x3d/0x70
[921709.391303] __x64_sys_fsync+0x14/0x20
[921709.391306] do_syscall_64+0x38/0x90
[921709.391310] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921709.391314] RIP: 0033:0x7f4c97ee832b
[921709.391317] RSP: 002b:00007ffdceb29e50 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921709.391321] RAX: ffffffffffffffda RBX: 000055ced34b0fa0 RCX: 00007f4c97ee832b
[921709.391323] RDX: 00007f4c97fc8000 RSI: 000055ced3487b70 RDI: 0000000000000021
[921709.391325] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007ffdceb29fa8
[921709.391327] R10: 00007f4c97d2c848 R11: 0000000000000293 R12: 00007ffdceb29fa8
[921709.391329] R13: 00007ffdceb29fa0 R14: 000055ced34b0fa0 R15: 000055ced34bcf90
[921709.391343] INFO: task mysqld:1505 blocked for more than 241 seconds.
[921709.391348] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391351] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391355] task:mysqld state:D stack: 0 pid: 1505 ppid: 1 flags:0x00000000
[921709.391359] Call Trace:
[921709.391361] __schedule+0x44c/0x8a0
[921709.391366] schedule+0x4f/0xc0
[921709.391370] md_write_start+0x150/0x240
[921709.391373] ? wait_woken+0x80/0x80
[921709.391376] raid5_make_request+0x88/0x890 [raid456]
[921709.391386] ? wait_woken+0x80/0x80
[921709.391388] ? mempool_kmalloc+0x17/0x20
[921709.391391] md_handle_request+0x12d/0x1a0
[921709.391395] ? __part_start_io_acct+0x51/0xf0
[921709.391398] md_submit_bio+0xca/0x100
[921709.391401] submit_bio_noacct+0x112/0x4f0
[921709.391404] submit_bio+0x51/0x1a0
[921709.391407] ? _cond_resched+0x19/0x30
[921709.391411] ? kmem_cache_alloc+0x38e/0x440
[921709.391414] ? ext4_init_io_end+0x1f/0x50
[921709.391418] ext4_io_submit+0x4d/0x60
[921709.391421] ext4_writepages+0x2c6/0xcd0
[921709.391423] ? __schedule+0x454/0x8a0
[921709.391427] ? hrtimer_start_range_ns+0x1aa/0x2f0
[921709.391430] ? timerqueue_del+0x24/0x50
[921709.391435] ? futex_wait+0x1ed/0x270
[921709.391439] do_writepages+0x43/0xd0
[921709.391443] ? do_writepages+0x43/0xd0
[921709.391445] ? futex_wake+0x7c/0x180
[921709.391449] __filemap_fdatawrite_range+0xcc/0x110
[921709.391453] file_write_and_wait_range+0x74/0xc0
[921709.391456] ext4_sync_file+0xf5/0x350
[921709.391459] vfs_fsync_range+0x49/0x80
[921709.391463] ? __fget_light+0x62/0x80
[921709.391466] do_fsync+0x3d/0x70
[921709.391469] __x64_sys_fsync+0x14/0x20
[921709.391472] do_syscall_64+0x38/0x90
[921709.391475] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921709.391478] RIP: 0033:0x7f36188d48db
[921709.391481] RSP: 002b:00007f35f05487d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921709.391484] RAX: ffffffffffffffda RBX: 00007f360502bf50 RCX: 00007f36188d48db
[921709.391486] RDX: 000d239200000000 RSI: 0000000000000000 RDI: 000000000000000a
[921709.391488] RBP: 00007f35f0548a00 R08: 0000000000000000 R09: 00007f360502beb0
[921709.391490] R10: 00007f35f0548ccf R11: 0000000000000293 R12: 00007f36042e9610
[921709.391491] R13: 000000000000000a R14: 0000000000000b00 R15: 0000000000000000
[921709.391505] INFO: task mdcheck:40625 blocked for more than 241 seconds.
[921709.391510] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391518] task:mdcheck state:D stack: 0 pid:40625 ppid: 1 flags:0x00000000
[921709.391522] Call Trace:
[921709.391524] __schedule+0x44c/0x8a0
[921709.391528] schedule+0x4f/0xc0
[921709.391532] schedule_timeout+0x202/0x290
[921709.391535] ? do_truncate+0x78/0xd0
[921709.391538] ? try_to_wake_up+0x66/0x5c0
[921709.391542] wait_for_completion+0x94/0x100
[921709.391547] kthread_stop+0x71/0x170
[921709.391550] md_unregister_thread+0x47/0x80
[921709.391554] md_reap_sync_thread+0x24/0x200
[921709.391557] action_store+0x160/0x2e0
[921709.391561] md_attr_store+0x82/0xc0
[921709.391564] sysfs_kf_write+0x3e/0x50
[921709.391567] kernfs_fop_write_iter+0x138/0x1c0
[921709.391571] new_sync_write+0x117/0x1b0
[921709.391575] vfs_write+0x1ca/0x280
[921709.391578] ksys_write+0x67/0xe0
[921709.391581] __x64_sys_write+0x1a/0x20
[921709.391584] do_syscall_64+0x38/0x90
[921709.391588] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921709.391591] RIP: 0033:0x7f8c6d2791e7
[921709.391593] RSP: 002b:00007ffea5d97178 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[921709.391597] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8c6d2791e7
[921709.391599] RDX: 0000000000000005 RSI: 000055b618a3eb40 RDI: 0000000000000001
[921709.391600] RBP: 000055b618a3eb40 R08: 000000000000000a R09: 0000000000000004
[921709.391602] R10: 000055b616e97017 R11: 0000000000000246 R12: 0000000000000005
[921709.391604] R13: 00007f8c6d3546a0 R14: 00007f8c6d3554a0 R15: 00007f8c6d3548a0
[921709.391607] INFO: task md1_resync:40644 blocked for more than 241 seconds.
[921709.391611] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391615] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391619] task:md1_resync state:D stack: 0 pid:40644 ppid: 2 flags:0x00004000
[921709.391623] Call Trace:
[921709.391625] __schedule+0x44c/0x8a0
[921709.391631] schedule+0x4f/0xc0
[921709.391635] md_do_sync.cold+0x8b7/0x953
[921709.391639] ? wait_woken+0x80/0x80
[921709.391642] ? kernel_sigaction+0x3b/0xd0
[921709.391646] md_thread+0xb1/0x170
[921709.391650] kthread+0x114/0x150
[921709.391652] ? md_start_sync+0x60/0x60
[921709.391655] ? kthread_park+0x90/0x90
[921709.391658] ret_from_fork+0x22/0x30
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sda3[0] sdd3[3] sdb3[1] sde3[5] sdc3[6]
      15625582592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      [================>....] check = 84.2% (3289834208/3906395648) finish=82779.6min speed=124K/sec

md0 : active raid1 sda2[0] sdb2[1] sdc2[2](S)
      488128 blocks super 1.2 [2/2] [UU]

unused devices: <none>

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.11.0-27-generic 5.11.0-27.29~20.04.1
ProcVersionSignature: Ubuntu 5.11.0-27.29~20.04.1-generic 5.11.22
Uname: Linux 5.11.0-27-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.18
Architecture: amd64
CasperMD5CheckResult: skip
Date: Tue Sep 7 16:35:01 2021
InstallationDate: Installed on 2014-03-12 (2736 days ago)
InstallationMedia: Ubuntu-Server 12.04.4 LTS "Precise Pangolin" - Release amd64 (20140204)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-hwe-5.11
UpgradeStatus: Upgraded to focal on 2020-12-20 (261 days ago)
---
ProblemType: Bug
AlsaVersion: Advanced Linux Sound Architecture Driver Version k5.13.0-16-generic.
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu69
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: [Errno 2] No such file or directory: 'fuser'
Card0.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer'
Card0.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer'
CasperMD5CheckResult: unknown
DistroRelease: Ubuntu 21.10
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci: Error: [Errno 2] No such file or directory: 'lspci'
Lspci-vt: Error: [Errno 2] No such file or directory: 'lspci'
Lsusb: Error: [Errno 2] No such file or directory: 'lsusb'
Lsusb-t: Error: [Errno 2] No such file or directory: 'lsusb'
Lsusb-v: Error: [Errno 2] No such file or directory: 'lsusb'
MachineType: innotek GmbH VirtualBox
Package: linux-signed-hwe-5.11
PciMultimedia:

ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 SHELL=/bin/bash
ProcFB: 0 svgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.13.0-16-generic root=/dev/mapper/vg1-root ro quiet splash
ProcVersionSignature: Ubuntu 5.13.0-16.16-generic 5.13.13
RelatedPackageVersions:
 linux-restricted-modules-5.13.0-16-generic N/A
 linux-backports-modules-5.13.0-16-generic N/A
 linux-firmware 1.199
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: impish
Uname: Linux 5.13.0-16-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 12/01/2006
dmi.bios.vendor: innotek GmbH
dmi.bios.version: VirtualBox
dmi.board.name: VirtualBox
dmi.board.vendor: Oracle Corporation
dmi.board.version: 1.2
dmi.chassis.type: 1
dmi.chassis.vendor: Oracle Corporation
dmi.modalias: dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:sku:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
dmi.product.family: Virtual Machine
dmi.product.name: VirtualBox
dmi.product.version: 1.2
dmi.sys.vendor: innotek GmbH

Revision history for this message
Chad Wagner (chad-wagner) wrote :
Revision history for this message
Chad Wagner (chad-wagner) wrote :
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hello Chad Wagner,

Thank you for reporting this issue. Could you please try installing the latest 20.04 HWE kernel and check whether the problem persists? The version currently in focal-updates is 5.11.0-34.36~20.04.1.

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Hi Kleber,
  I installed it later yesterday, but I won't know until the next resync. This has been a problem since at least linux 5.4 kernel that shipped with Ubuntu 20.04. I don't think I had these problems on Ubuntu 18.04 LTS, the same hardware, running the linux-image-generic at that time.

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Similar report here on 5.10.0-rc4:
https://www.spinics.net/lists/raid/msg66654.html

I ended up masking the services introduced with 20.04 LTS, and switched back the crontab.

systemctl mask mdcheck_continue.service mdcheck_continue.timer mdcheck_start.service mdcheck_start.timer
cat > /etc/cron.d/mdadm << 'EOF'
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft <email address hidden>
# distributed under the terms of the Artistic Licence 2.0
#

# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
EOF

The pausing and resuming of the integrity check was an annoyance for me anyways.

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Here is the proposed patch, Doesn't appear to have been applied. Last report was with 5.11rc5.

https://lore.kernel<email address hidden>/

Revision history for this message
Mikko Rantalainen (mira) wrote :

I think I've seen this issue once per two different systems so I think this is software issue. Does anybody know if patch in comment #6 is going to be included in Ubuntu 20.04 LTS?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe-5.11 (Ubuntu):
status: New → Confirmed
Revision history for this message
Mikko Rantalainen (mira) wrote :

One of the systems was using package linux-generic and in practice

Linux pedabackup 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/version_signature
Ubuntu 5.4.0-80.90-generic 5.4.124

Revision history for this message
Chad Wagner (chad-wagner) wrote :

The patch hasn't made it into mainline from what I have seen, it looks like it died back in March waiting for feedback from additional kernel developers. From what I have gathered this is a deadlock scenario directly caused by pausing the resync while the system is under heavy write activity.

Donald Buczek provided a reproducer which is a shell script that generates a lot of write activity and pauses/resumes the raid scrubbing. And he also provided a workaround to get the stuck system running without reboot:

echo active > /sys/block/md1/md/array_state

I haven't tried the patch or any of this, I pretty much eliminated the trigger which is mdcheck_start & mdcheck_continue and went back to the 18.04 LTS way of scrubbing arrays (which is basically don't pause/interrupt it once it starts). I ran through a checkarray yesterday to 100%, no problems. Meanwhile since upgrading to 20.04 LTS it has hung almost every single time through 5.4, 5.8, and 5.11 kernels.

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Here is Donald Buczek's reproducer script. I setup an Ubuntu 20.04 VM with latest linux-image-generic and was able to reproduce it within maybe 10 or 15 minutes. Exactly the same issue.

Filesystem layout built as follows:

# assemble raid devices
mdadm --create /dev/md0 --level=1 --raid-devices=2 --spare-devices=1 /dev/sda2 /dev/sdb2 /dev/sdc2
mdadm --create /dev/md1 --level=5 --raid-devices=5 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3

# create PVs, VGs, LVs
pvcreate /dev/md1
vgcreate vg1 /dev/md1
lvcreate --name root --extents 100%FREE vg1

# create filesystems
mkfs.ext4 /dev/md0
mkfs.ext4 /dev/vg1/root

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1942935

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: hirsute
tags: removed: hirsute
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Chad Wagner (chad-wagner) wrote :
tags: added: apport-collected impish
description: updated
Revision history for this message
Chad Wagner (chad-wagner) wrote : AlsaDevices.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : CRDA.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : Card0.Codecs.codec97.0.ac97.0-0.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : Card0.Codecs.codec97.0.ac97.0-0.regs.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : ProcModules.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : UdevDb.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : WifiSyslog.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote : acpidump.txt

apport information

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Same issue on impish 5.13.13 kernel, running in VBox.

tags: added: patch
Revision history for this message
Chad Wagner (chad-wagner) wrote :
Revision history for this message
Robert Lippmann (rob0214) wrote :
Download full text (12.5 KiB)

I think I'm running into this too.

Raid 6 array with 4 drives.

Linux pkitty 5.13.0-52-generic #59~20.04.1-Ubuntu SMP Thu Jun 16 21:21:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I have all filesystems unmounted (so there's no other activity that should be occurring on the drives).

Once it hangs, even trying to do

echo idle /sys/block/md0/md/sync_action

hangs and the entire array locks up.

It's not a matter of if it will happen, but when.

Is there any way to backport the fixes? 5.19 is a long way away.

kern.log snippet:

ul 5 19:16:36 pkitty kernel: [ 4471.805561] INFO: task scsi_eh_1:276 blocked for more than 120 seconds.
Jul 5 19:16:36 pkitty kernel: [ 4471.805573] Tainted: G OE 5.13.0-52-generic #59~20.04.1-Ubuntu
Jul 5 19:16:36 pkitty kernel: [ 4471.805577] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 5 19:16:36 pkitty kernel: [ 4471.805580] task:scsi_eh_1 state:D stack: 0 pid: 276 ppid: 2 flags:0x00004000
Jul 5 19:16:36 pkitty kernel: [ 4471.805588] Call Trace:
Jul 5 19:16:36 pkitty kernel: [ 4471.805591] <TASK>
Jul 5 19:16:36 pkitty kernel: [ 4471.805596] __schedule+0x2ee/0x900
Jul 5 19:16:36 pkitty kernel: [ 4471.805605] schedule+0x4f/0xc0
Jul 5 19:16:36 pkitty kernel: [ 4471.805609] schedule_preempt_disabled+0xe/0x10
Jul 5 19:16:36 pkitty kernel: [ 4471.805614] __mutex_lock.isra.0+0x183/0x4d0
Jul 5 19:16:36 pkitty kernel: [ 4471.805618] ? dequeue_entity+0xdb/0x410
Jul 5 19:16:36 pkitty kernel: [ 4471.805627] ? blk_mq_find_and_get_req+0x4f/0x90
Jul 5 19:16:36 pkitty kernel: [ 4471.805634] __mutex_lock_slowpath+0x13/0x20
Jul 5 19:16:36 pkitty kernel: [ 4471.805639] mutex_lock+0x32/0x40
Jul 5 19:16:36 pkitty kernel: [ 4471.805645] device_reset+0x22/0x50 [usb_storage]
Jul 5 19:16:36 pkitty kernel: [ 4471.805654] scsi_eh_ready_devs+0x580/0xa50
Jul 5 19:16:36 pkitty kernel: [ 4471.805683] kthread+0x12b/0x150
Jul 5 19:16:36 pkitty kernel: [ 4471.805687] ? set_kthread_struct+0x40/0x40
Jul 5 19:16:36 pkitty kernel: [ 4471.805692] ret_from_fork+0x22/0x30
Jul 5 19:16:36 pkitty kernel: [ 4471.805701] </TASK>
Jul 5 19:16:36 pkitty kernel: [ 4471.805704] INFO: task usb-storage:278 blocked for more than 241 seconds.
Jul 5 19:16:36 pkitty kernel: [ 4471.805708] Tainted: G OE 5.13.0-52-generic #59~20.04.1-Ubuntu
Jul 5 19:16:36 pkitty kernel: [ 4471.805711] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 5 19:16:36 pkitty kernel: [ 4471.805713] task:usb-storage state:D stack: 0 pid: 278 ppid: 2 flags:0x00004000
Jul 5 19:16:36 pkitty kernel: [ 4471.805718] Call Trace:
Jul 5 19:16:36 pkitty kernel: [ 4471.805720] <TASK>
Jul 5 19:16:36 pkitty kernel: [ 4471.805722] __schedule+0x2ee/0x900
Jul 5 19:16:36 pkitty kernel: [ 4471.805727] ? usleep_range_state+0x90/0x90
Jul 5 19:16:36 pkitty kernel: [ 4471.805733] schedule+0x4f/0xc0
Jul 5 19:16:36 pkitty kernel: [ 4471.805736] schedule_timeout+0x202/0x290
Jul 5 19:16:36 pkitty kernel: [ 4471.805742] ? xhci_urb_enqueue+0x1e4/0x520
Jul 5 19:16:36 pkitty kernel: [ 4471.805746] ? usleep_range_state+0x90/0x90
Jul 5 19:16:36 ...

Revision history for this message
Robert Lippmann (rob0214) wrote :

Tried the patch mentioned in comment #27, as well as the one attached to this bug.

Still hangs on check.

I don’t know much about the mdadm code, but why is it grabbing exclusive locks on a check, which is mostly read operations?

Revision history for this message
Chad Wagner (chad-wagner) wrote :

I believe to resolve the deadlock you want to do:
echo active > /sys/block/md1/md/array_state

Not "idle". You should see a hung task for mdcheck in there somewhere as well, and it only occurs when the raid is resyncing (md_resync should be running), at least for me I the workaround in comment 5:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.11/+bug/1942935/comments/5

Haven't had a problem since, upgraded to 22.04 since then as well with 5.15. I am pretty sure the problem is still there, the problem surfaced in 20.04 when they changed the raid consistency check. The new check pauses the check after 8 hours and triggers the deadlock, I just let it run to completion like it did in 18.04.

I have not checked the patches, but it's possible you have a different problem because I don't see in either of two hung process traces any calls to md_ code.

Revision history for this message
Robert Lippmann (rob0214) wrote :

I'm running checkarray manually, I took off all the start and stop stuff like you did.

echo active > /sys/block/md0/md/array_state doesn't fix.

I must have not gotten all the trace last time. I've attached it here.

Revision history for this message
Robert Lippmann (rob0214) wrote :

Digging further, I think I might be running into this bug:

https://<email address hidden>/T/

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Yeah, that's the same issue as this one. The issue is the raid is doing a consistency check (mdcheck) and is transitioned to an "idle" state and hits a deadlock that causes all I/O through the md device to block. The workaround is to change the array state back to active.

I made the changes in #5 almost a year ago and no problems, before that it pretty much hung almost every single month when the scheduled consistency check was triggered ever since upgrading to 20.04.

Revision history for this message
Robert Lippmann (rob0214) wrote :

It won’t let me change the state back to active.

Every time I try nothing happens and array_status is always idle.

Revision history for this message
Janne Blomqvist (blomqvist-janne) wrote :

The second of the patches mentioned in #27 (with git SHA 1e2677...) has, I believe, been backported to Ubuntu kernels 5.15.0-48 and 5.4.0-126.

We've still hit this with Ubuntu Jammy on 5.15.0-53, so I guess the first commit needs to be backported as well.

Revision history for this message
Robert Lippmann (rob0214) wrote :

Turns out my issue was a faulty drive, and the system would lock up when mdadm hit the bad sectors on resync. The issue seemed like it was lower in the blockdev code causing a deadlock.

I replaced the drive and the problem went away.

Revision history for this message
Chad Wagner (chad-wagner) wrote :

Comment #5 (https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.11/+bug/1942935/comments/5) has been a stable workaround for me (basically revert back to a continuous resync like 18.04).

My newer machines are using ZFS with raidz2 pools.

Revision history for this message
Kilian Felder (k.felder) wrote :

I hit this bug with Ubuntu 22.04 (Jammy) on Kernel 5.15.0-56

For testing, I have set up a RAID1 and a RAID5 in a VM. I put the disks for each RAID on a separate controller. Based on the 'mdhang' script (see comment #11), I was able to reproduce the error easily.

I ran 2 'mdhang' scripts at the same time, one for RAID1 and one for RAID5. The RAID5 blocked after a short time. The RAID1 continued to run without problems. So it probably only affects the RAID5.

On my production system I have now activated the workaround (see comment #5).

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe-5.4 (Ubuntu):
status: New → Confirmed
Revision history for this message
Brian Corriveau (bcorrive) wrote :
Download full text (4.1 KiB)

I hit his bug as well in Ubuntu 22.04 with kernel 5.15.0-67-generic

We have a single RAID 5 on 3 drives for 28T. I'm switching to the workaround from
comment #5.

Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.573081] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+0xad/0x170
Jun 4 12:48:11 server1 kernel: [1622699.573162] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573169] ? md_write_inc+0x60/0x60
Jun 4 12:48:11 server1 kernel: [1622699.573176] kthread+0x12a/0x150
Jun 4 12:48:11 server1 kernel: [1622699.573187] ? set_kthread_struct+0x50/0x50
Jun 4 12:48:11 server1 kernel: [1622699.573197] ret_from_fork+0x22/0x30
Jun 4 12:48:11 server1 kernel: [1622699.573212] </TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573231] INFO: task jbd2/dm-0-8:1375 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.581119] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.589004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.597959] task:jbd2/dm-0-8 state:D stack: 0 pid: 1375 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.597968] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.597970] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.597973] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.597984] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.597992] md_write_start.part.0+0x174/0x220
Jun 4 12:48:11 server1 kernel: [1622699.598002] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.598024] md_write_start+0x14/0x30
Jun 4 12:48:11 server1 kernel: [1622699.598032] raid5_make_request+0x77/0x540 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.598051] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.598058] md_handle_request+0x12d/0x1b0
Jun 4 12:48:11 server1 kernel: [1622699.598065] ? __blk_queue_split+0xfe/0x200
Jun 4 12:48:11 server1 kernel: [1622699.598075] md_submit_bio+0x71/0xc0
Jun 4 12:48:11 server1 kernel: [1622699.598082] __submit_bio+0x1a5/0x220
Jun 4 12:48:11 server1 kernel: [1622699.598091] ? mempool_alloc_slab+0x17/0x20
Jun 4 12:48:11 server1 kernel: [1622699.598102] __submit_bio_noacct+0x85/0x200
Jun 4 12:48:11 server1 kernel: [1622699.598...

Read more...

Revision history for this message
Filip Hruska (fhr) wrote :
Download full text (10.5 KiB)

Seeing this bug on Ubuntu 20.04 and Ubuntu 22.04 as well, with both normal and HWE kernels.

To add some more information, this bug seems to randomly appear during the initial RAID 6 creation process as well, where the array is mounted but completely empty and not accessed - so it's likely to originate within the mdadm resync process itself, unrelated to other system I/O operations. Around 1 in 15 arrays will "freeze" during the initial resync process in my experience, so it is not that uncommon unfortunately.

The symptoms are always the same - at some point during resync, the speeds will radically drop to single MB/s levels and continue degrading over time until "echo active > /sys/block/mdX/md/array_state" is issued. A minute or two after running that, the speeds ramp back up to normal.

This bug seems unrelated to hardware configuration, as I've seen it happen across multiple systems with different CPU vendors, HBA models and with different HDD sizes and vendors. Systems which were previously stable under Ubuntu 18.04 started exhibiting freezes after upgrading to 20.04 as well.

It would also seem disabling mdcheck_start and mdcheck_continue is not necessarily the magical bullet in fixing this, it certainly doesn't seem to help with freezing during the initial resync. I have also seen instances of mdadm scheduled resyncs freezing when triggered using the old cronjob method, with both systemd services and timers masked off.

Dmesg from a freshly installed system where the initial resync "froze" approximately 15 hours after the array was created:
mdadm --create /dev/md1 --level=6 --raid-devices=6 /dev/sd[cdefgh]
mkfs.ext4 /dev/md1
mount -o errors=remount-ro /dev/md1 /srv

[Jul 4 00:18] md/raid:md1: not clean -- starting background reconstruction
[ +0.000062] md/raid:md1: device sdh operational as raid disk 5
[ +0.000002] md/raid:md1: device sdg operational as raid disk 4
[ +0.000001] md/raid:md1: device sdf operational as raid disk 3
[ +0.000001] md/raid:md1: device sde operational as raid disk 2
[ +0.000001] md/raid:md1: device sdd operational as raid disk 1
[ +0.000001] md/raid:md1: device sdc operational as raid disk 0
[ +0.002104] md/raid:md1: raid level 6 active with 6 out of 6 devices, algorithm 2
[ +0.048241] md1: detected capacity change from 0 to 72000290684928
[ +0.000066] md: resync of RAID array md1
[Jul 4 00:32] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[Jul 4 02:39] perf: interrupt took too long (2522 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
[Jul 4 04:36] perf: interrupt took too long (3155 > 3152), lowering kernel.perf_event_max_sample_rate to 63250
[Jul 4 15:22] INFO: task md1_raid6:5688 blocked for more than 120 seconds.
[ +0.000059] Not tainted 5.4.0-153-generic #170-Ubuntu
[ +0.000032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ +0.000045] md1_raid6 D 0 5688 2 0x80004000
[ +0.000004] Call Trace:
[ +0.000011] __schedule+0x2e3/0x740
[ +0.000005] schedule+0x42/0xb0
[ +0.000011] raid5d+0x3e6/0x5f0 [raid456]
[ +0.000005] ? try_to_del_timer_sync+0x54/0x80
[ +0.000005] ? schedule_timeout+0x92/0...

Revision history for this message
brjhaverkamp (bert-bertenselena) wrote :

I can't add much in terms of data. But this is a +1. My symptoms were virtually identical to Chad Wagner's
This happened july 1st 2023 on my machine, the day the check started. It got stuck at 98%.
Only cure I could find was to reboot the system.

Revision history for this message
Brian Corriveau (bcorrive) wrote :
Download full text (4.4 KiB)

Unfortunately I have now have seen this bug happen with a data-check of RAID kicked off by system and by cron, so I +1 on Filips's comment #42

In my previous post I said I was switching over to cron triggered raid check, which I did, but it seems I have run into the exact same problem again. I got ten 120 second delays logged in syslog and then the whole system became unreachable.

The data-check started about an hour before the delays were logged.

This shows the first 120 delay that was logged.

Jul 31 21:00:01 server1 kernel: [168797.457900] md: data-check of RAID array md0
...
Jul 31 22:20:46 server1 kernel: [173642.734918] INFO: task kswapd0:190 blocked for more than 120 seconds.
Jul 31 22:20:46 server1 kernel: [173642.742246] Tainted: G O 5.15.0-75-generic #82-Ubuntu
Jul 31 22:20:46 server1 kernel: [173642.750039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 22:20:46 server1 kernel: [173642.758894] task:kswapd0 state:D stack: 0 pid: 190 ppid: 2 flags:0x00004000
Jul 31 22:20:46 server1 kernel: [173642.758906] Call Trace:
Jul 31 22:20:46 server1 kernel: [173642.758910] <TASK>
Jul 31 22:20:46 server1 kernel: [173642.758916] __schedule+0x24e/0x590
Jul 31 22:20:46 server1 kernel: [173642.758935] schedule+0x69/0x110
Jul 31 22:20:46 server1 kernel: [173642.758945] md_write_start.part.0+0x174/0x220
Jul 31 22:20:46 server1 kernel: [173642.758957] ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.758969] md_write_start+0x14/0x30
Jul 31 22:20:46 server1 kernel: [173642.758980] raid5_make_request+0x77/0x540 [raid456]
Jul 31 22:20:46 server1 kernel: [173642.759004] ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.759012] md_handle_request+0x12d/0x1b0
Jul 31 22:20:46 server1 kernel: [173642.759020] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759032] ? __blk_queue_split+0xfe/0x210
Jul 31 22:20:46 server1 kernel: [173642.759043] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759052] md_submit_bio+0x76/0xc0
Jul 31 22:20:46 server1 kernel: [173642.759060] __submit_bio+0x1a5/0x220
Jul 31 22:20:46 server1 kernel: [173642.759072] ? __mod_memcg_lruvec_state+0x63/0xe0
Jul 31 22:20:46 server1 kernel: [173642.759082] __submit_bio_noacct+0x85/0x200
Jul 31 22:20:46 server1 kernel: [173642.759093] submit_bio_noacct+0x4e/0x120
Jul 31 22:20:46 server1 kernel: [173642.759102] ? radix_tree_lookup+0xd/0x20
Jul 31 22:20:46 server1 kernel: [173642.759116] ? bio_associate_blkg_from_css+0x1b2/0x310
Jul 31 22:20:46 server1 kernel: [173642.759124] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759133] ? bio_add_page+0x68/0x90
Jul 31 22:20:46 server1 kernel: [173642.759139] submit_bio+0x4a/0x130
Jul 31 22:20:46 server1 kernel: [173642.759149] __swap_writepage+0x1aa/0x440
Jul 31 22:20:46 server1 kernel: [173642.759160] ? __frontswap_store+0x75/0x120
Jul 31 22:20:46 server1 kernel: [173642.759167] swap_writepage+0x32/0xa0
Jul 31 22:20:46 server1 kernel: [173642.759177] pageout+0x100/0x330
Jul 31 22:20:46 server1 kernel: [173642.759189] shrink_page_list+0xbd9/0xf00
Jul 31 22:20:46 server1 kernel: [173...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe-5.15 (Ubuntu):
status: New → Confirmed
Revision history for this message
Brian Corriveau (bcorrive) wrote :
Download full text (5.1 KiB)

Kernel 5.15.0-75 on Ubuntu 22.04, server1 and server2

Just hit this bug with a different server still running systemd to start the RAID check.

The logs indicate there may be more than one RAID check bug out there:
 - a bug with pausing the raid check
 - a bug with the raid check when it hasn't been paused

In the case with my server2 the raid check was started, then 6 hours later is was interrupted as expected with systemd, but about 20 minutes later the RAID locks up. Pause bug?

Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)

Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds.
Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu
Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000
Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug 7 00:23:03 server2 kernel: [2641193.450507] <TASK>
Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+0x24e/0x590
Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110
Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_start.part.0+0x174/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_start+0x14/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_request+0x77/0x540 [raid456]
Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_request+0x12d/0x1b0
Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_split+0xfe/0x210
Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_bio+0x76/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_bio+0x1a5/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_alloc_slab+0x17/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_bio_noacct+0x85/0x200
Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_alloc+0x1ab/0x2f0
Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_bio_noacct+0x4e/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_bio+0x4a/0x130
Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_bh_wbc+0x18d/0x1c0
Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_commit_transaction+0x861/0x17a0
Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_idle_core+0x93/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+0xa9/0x280
Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2...

Read more...

Revision history for this message
Silas Horton (shorton5) wrote :
Download full text (11.5 KiB)

On Ubuntu 22.04, 5.15.0-83-generic #92-Ubuntu - our storage system ran into this bug. mdcheck ran for the scheduled 1st day of the month and then hung 6 hours later.

Oct 1 06:52:13 server1 systemd[1]: Starting MD array scrubbing...
Oct 1 06:52:13 server1 root: mdcheck start checking /dev/md0
Oct 1 06:52:13 server1 kernel: [2129098.393495] md: data-check of RAID array md0

Oct 1 12:57:49 server1 kernel: [2151034.623372] INFO: task dmcrypt_write/2:1783 blocked for more than 241 seconds.
Oct 1 12:57:49 server1 kernel: [2151034.623446] Tainted: G S 5.15.0-83-generic #92-Ubuntu
Oct 1 12:57:49 server1 kernel: [2151034.623498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 12:57:49 server1 kernel: [2151034.623559] task:dmcrypt_write/2 state:D stack: 0 pid: 1783 ppid: 2 flags:0x00004000
Oct 1 12:57:49 server1 kernel: [2151034.623566] Call Trace:
Oct 1 12:57:49 server1 kernel: [2151034.623570] <TASK>
Oct 1 12:57:49 server1 kernel: [2151034.623574] __schedule+0x24e/0x590
Oct 1 12:57:49 server1 kernel: [2151034.623585] ? __schedule+0x256/0x590
Oct 1 12:57:49 server1 kernel: [2151034.623590] schedule+0x69/0x110
Oct 1 12:57:49 server1 kernel: [2151034.623596] md_write_start.part.0+0x174/0x220
Oct 1 12:57:49 server1 kernel: [2151034.623601] ? wait_woken+0x70/0x70
Oct 1 12:57:49 server1 kernel: [2151034.623610] md_write_start+0x14/0x30
Oct 1 12:57:49 server1 kernel: [2151034.623615] raid5_make_request+0x77/0x540 [raid456]
Oct 1 12:57:49 server1 kernel: [2151034.623633] ? cgroup_rstat_updated+0x11c/0x1e0
Oct 1 12:57:49 server1 kernel: [2151034.623642] ? wait_woken+0x70/0x70
Oct 1 12:57:49 server1 kernel: [2151034.623648] md_handle_request+0x12d/0x1b0
Oct 1 12:57:49 server1 kernel: [2151034.623657] ? submit_bio_checks+0x1a5/0x560
Oct 1 12:57:49 server1 kernel: [2151034.623664] md_submit_bio+0x76/0xc0
Oct 1 12:57:49 server1 kernel: [2151034.623670] __submit_bio+0x1a5/0x220
Oct 1 12:57:49 server1 kernel: [2151034.623675] ? psi_task_switch+0xc6/0x220
Oct 1 12:57:49 server1 kernel: [2151034.623682] __submit_bio_noacct+0x85/0x200
Oct 1 12:57:49 server1 kernel: [2151034.623687] submit_bio_noacct+0x4e/0x120
Oct 1 12:57:49 server1 kernel: [2151034.623691] ? schedule+0x69/0x110
Oct 1 12:57:49 server1 kernel: [2151034.623698] dmcrypt_write+0x104/0x130 [dm_crypt]
Oct 1 12:57:49 server1 kernel: [2151034.623708] ? crypt_ctr+0x600/0x600 [dm_crypt]
Oct 1 12:57:49 server1 kernel: [2151034.623715] kthread+0x12a/0x150
Oct 1 12:57:49 server1 kernel: [2151034.623723] ? set_kthread_struct+0x50/0x50
Oct 1 12:57:49 server1 kernel: [2151034.623730] ret_from_fork+0x22/0x30
Oct 1 12:57:49 server1 kernel: [2151034.623739] </TASK>
Oct 1 12:57:49 server1 kernel: [2151034.623778] INFO: task mdcheck:2323903 blocked for more than 241 seconds.
Oct 1 12:57:49 server1 kernel: [2151034.623833] Tainted: G S 5.15.0-83-generic #92-Ubuntu
Oct 1 12:57:49 server1 kernel: [2151034.625482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 1 12:57:49 server1 kernel: [2151034.627098] task:mdcheck state:D stack: 0 pid:232390...

Revision history for this message
Patrick Hampson (patrickhampson) wrote :

Noticed this issue on Ubuntu 20.04 with a md raid device. System exhibited the same behavior as other users have noted: high CPU usage and terminal locking up until the system is rebooted.

[14715252.569157] INFO: task md1_raid4:1763945 blocked for more than 120 seconds.
[14715252.570228] Not tainted 5.4.0-146-generic #163-Ubuntu
[14715252.571277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[14715252.572347] md1_raid4 D 0 1763945 2 0x80004000
[14715252.572357] Call Trace:
[14715252.572360] __schedule+0x2e3/0x740
[14715252.572363] schedule+0x42/0xb0
[14715252.572369] raid5d+0x3e6/0x5f0 [raid456]
[14715252.572376] ? schedule_timeout+0x10e/0x160
[14715252.572381] ? __wake_up_pollfree+0x40/0x40
[14715252.572384] md_thread+0x97/0x160
[14715252.572392] ? __wake_up_pollfree+0x40/0x40
[14715252.572394] kthread+0x104/0x140
[14715252.572399] ? md_start_sync+0x60/0x60
[14715252.572403] ? kthread_park+0x90/0x90
[14715252.572405] ret_from_fork+0x35/0x40
[14715252.572430] INFO: task kworker/u64:1:3189415 blocked for more than 120 seconds.

Revision history for this message
Jan Kratochvil (jan-kratochvil) wrote :

Linux version 6.4.12-200.fc38.x86_64 (mockbuild@30894952d3244f1ab967aeda9ed417f6) (gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1), GNU ld version 2.39-9.fc38) #1 SMP PREEMPT_DYNAMIC Wed Aug 23 17:46:49 UTC 2023

    230 ? I< 0:00 \_ [md]
   1377 ? S 6:15 \_ [md0_raid1]
   1565 ? D 4955:37 \_ [md4_raid6]
2772538 ? DN 111:11 \_ [md4_resync]

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sde2[0]
      13671170048 blocks super 1.2 [1/1] [U]
      bitmap: 1/102 pages [4KB], 65536KB chunk

md4 : active raid6 sdd1[2] sdb1[5] sde1[4] sdc1[1]
      11720779776 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      [=================>...] check = 89.0% (5217054876/5860389888) finish=852549.6min speed=12K/sec
      bitmap: 9/44 pages [36KB], 65536KB chunk

unused devices: <none>
# cat /proc/2772538/stack # md4_resync
[<0>] raid5_get_active_stripe+0x271/0x540 [raid456]
[<0>] raid5_sync_request+0x3ad/0x3d0 [raid456]
[<0>] md_do_sync+0x7be/0x11c0
[<0>] md_thread+0xae/0x190
[<0>] kthread+0xe8/0x120
[<0>] ret_from_fork+0x2c/0x50
# cat /proc/1565/stack # md4_raid6
[<0>] raid5d+0x524/0x750 [raid456]
[<0>] md_thread+0xae/0x190
[<0>] kthread+0xe8/0x120
[<0>] ret_from_fork+0x2c/0x50

Workarounded by:
# cat /sys/block/md4/md/array_state
write-pending
# echo active > /sys/block/md4/md/array_state
# cat /sys/block/md4/md/array_state
active

Revision history for this message
Lucas Magnien (haeiven) wrote :
Download full text (3.2 KiB)

Hello, got the same issue with Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-91-generic x86_64). Also tried 5.19 kernel and got the same problem.

Jan 7 02:28:26 cache4 systemd[1]: Starting MD array scrubbing...
Jan 7 02:28:26 cache4 root: mdcheck start checking /dev/md0
Jan 7 08:28:44 cache4 kernel: [2914434.326024] md: md0: data-check interrupted.
Jan 7 08:32:08 cache4 kernel: [2914638.397357] INFO: task jbd2/md0-8:1337 blocked for more than 120 seconds.
Jan 7 08:32:08 cache4 kernel: [2914638.397420] Not tainted 5.15.0-91-generic #99-Ubuntu
Jan 7 08:32:08 cache4 kernel: [2914638.397457] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 7 08:32:08 cache4 kernel: [2914638.397505] task:jbd2/md0-8 state:D stack: 0 pid: 1337 ppid: 2 flags:0x00004000
Jan 7 08:32:08 cache4 kernel: [2914638.397512] Call Trace:
Jan 7 08:32:08 cache4 kernel: [2914638.397515] <TASK>
Jan 7 08:32:08 cache4 kernel: [2914638.397520] __schedule+0x24e/0x590
Jan 7 08:32:08 cache4 kernel: [2914638.397530] schedule+0x69/0x110
Jan 7 08:32:08 cache4 kernel: [2914638.397535] md_write_start.part.0+0x174/0x220
Jan 7 08:32:08 cache4 kernel: [2914638.397540] ? wait_woken+0x70/0x70
Jan 7 08:32:08 cache4 kernel: [2914638.397547] md_write_start+0x14/0x30
Jan 7 08:32:08 cache4 kernel: [2914638.397553] raid5_make_request+0x77/0x540 [raid456]
Jan 7 08:32:08 cache4 kernel: [2914638.397566] ? jbd2_transaction_committed+0x1b/0x60
Jan 7 08:32:08 cache4 kernel: [2914638.397573] ? ext4_set_iomap+0x5a/0x1d0
Jan 7 08:32:08 cache4 kernel: [2914638.397579] ? wait_woken+0x70/0x70
Jan 7 08:32:08 cache4 kernel: [2914638.397584] md_handle_request+0x12d/0x1b0
Jan 7 08:32:08 cache4 kernel: [2914638.397589] ? submit_bio_checks+0x1a5/0x560
Jan 7 08:32:08 cache4 kernel: [2914638.397595] md_submit_bio+0x76/0xc0
Jan 7 08:32:08 cache4 kernel: [2914638.397600] __submit_bio+0x1a5/0x220
Jan 7 08:32:08 cache4 kernel: [2914638.397603] ? mempool_alloc_slab+0x17/0x20
Jan 7 08:32:08 cache4 kernel: [2914638.397611] __submit_bio_noacct+0x85/0x200
Jan 7 08:32:08 cache4 kernel: [2914638.397614] ? kmem_cache_alloc+0x1ab/0x2f0
Jan 7 08:32:08 cache4 kernel: [2914638.397619] submit_bio_noacct+0x4e/0x120
Jan 7 08:32:08 cache4 kernel: [2914638.397623] submit_bio+0x4a/0x130
Jan 7 08:32:08 cache4 kernel: [2914638.397627] submit_bh_wbc+0x18d/0x1c0
Jan 7 08:32:08 cache4 kernel: [2914638.397632] submit_bh+0x13/0x20
Jan 7 08:32:08 cache4 kernel: [2914638.397635] jbd2_journal_commit_transaction+0x861/0x17a0
Jan 7 08:32:08 cache4 kernel: [2914638.397640] ? __update_idle_core+0x93/0x120
Jan 7 08:32:08 cache4 kernel: [2914638.397649] kjournald2+0xa9/0x280
Jan 7 08:32:08 cache4 kernel: [2914638.397653] ? wait_woken+0x70/0x70
Jan 7 08:32:08 cache4 kernel: [2914638.397657] ? load_superblock.part.0+0xc0/0xc0
Jan 7 08:32:08 cache4 kernel: [2914638.397662] kthread+0x12a/0x150
Jan 7 08:32:08 cache4 kernel: [2914638.397667] ? set_kthread_struct+0x50/0x50
Jan 7 08:32:08 cache4 kernel: [2914638.397672] ret_from_fork+0x22/0x30
Jan 7 08:32:08 cache4 kernel: [2914638.397680] </TASK>

# cat /sys/block/md0/md/array_state
write-pending

This is happe...

Read more...

Revision history for this message
alex (iwram) wrote :

Hi. I have the same problem on radi5x2 arrays assembled in stripe
The total volume of the array is 97 TB and free 16 GB
Do users who have a bug also have little free space?
I couldn’t reproduce this condition on a test bench under a synthetic load.
Ubuntu 22.04.3 LTS
kernel 5.15.0.-72-generic

Revision history for this message
Brian Corriveau (bcorrive) wrote :

I was able to reproduce the hang problem using the script mdhang.sh (modified for my raid device location) on 22.04 with kernel 5.15 several times, usually within 10-15 minutes.

Using 22.04 hwe kernel 6.5 I was not able to reproduce the problem with a 25+ hours run of the same script. Seems something has been fixed in the newer hwe kernel.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.