Ubuntu
linux-signed-hwe-5.11 package

kernel io hangs during mdcheck/resync

Bug #1942935 reported by Chad Wagner on 2021-09-07

This bug affects 16 people

	Status	Importance	Assigned to
linux (Ubuntu)	Confirmed	Undecided	Unassigned
linux-signed-hwe-5.11 (Ubuntu)	Confirmed	Undecided	Unassigned
linux-signed-hwe-5.15 (Ubuntu)	Confirmed	Undecided	Unassigned
linux-signed-hwe-5.4 (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

It seems to always occur during an mdcheck/resync, if I am logged in via SSH it is still somewhat responsive and basic utilities like dmesg will work. But it apppears any write I/O will hang the terminal and nothing is written to syslog (presumably because it is blocked).

Below is output of dmesg and cat /proc/mdstat, it appears the data check was interrupted and /proc/mdstat still shows progress, and a whole slew of hung tasks including md1_resync itself.

[756484.534293] md: data-check of RAID array md0
[756484.628039] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[756493.808773] md: md0: data-check done.
[756493.829760] md: data-check of RAID array md1
[778112.446410] md: md1: data-check interrupted.
[810654.608102] md: data-check of RAID array md1
[832291.201064] md: md1: data-check interrupted.
[899745.389485] md: data-check of RAID array md1
[921395.835305] md: md1: data-check interrupted.
[921588.558834] INFO: task systemd-journal:376 blocked for more than 120 seconds.
[921588.558846] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.558850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.558854] task:systemd-journal state:D stack: 0 pid: 376 ppid: 1 flags:0x00000220
[921588.558859] Call Trace:
[921588.558864] __schedule+0x44c/0x8a0
[921588.558872] schedule+0x4f/0xc0
[921588.558876] md_write_start+0x150/0x240
[921588.558880] ? wait_woken+0x80/0x80
[921588.558886] raid5_make_request+0x88/0x890 [raid456]
[921588.558898] ? wait_woken+0x80/0x80
[921588.558901] ? mempool_kmalloc+0x17/0x20
[921588.558904] md_handle_request+0x12d/0x1a0
[921588.558907] ? __part_start_io_acct+0x51/0xf0
[921588.558912] md_submit_bio+0xca/0x100
[921588.558915] submit_bio_noacct+0x112/0x4f0
[921588.558918] ? ext4_fc_reserve_space+0x110/0x230
[921588.558922] submit_bio+0x51/0x1a0
[921588.558925] ? _cond_resched+0x19/0x30
[921588.558928] ? kmem_cache_alloc+0x38e/0x440
[921588.558932] ? ext4_init_io_end+0x1f/0x50
[921588.558936] ext4_io_submit+0x4d/0x60
[921588.558940] ext4_writepages+0x2c6/0xcd0
[921588.558944] do_writepages+0x43/0xd0
[921588.558948] ? do_writepages+0x43/0xd0
[921588.558951] ? fault_dirty_shared_page+0xa5/0x110
[921588.558955] __filemap_fdatawrite_range+0xcc/0x110
[921588.558960] file_write_and_wait_range+0x74/0xc0
[921588.558962] ext4_sync_file+0xf5/0x350
[921588.558967] vfs_fsync_range+0x49/0x80
[921588.558970] do_fsync+0x3d/0x70
[921588.558973] __x64_sys_fsync+0x14/0x20
[921588.558976] do_syscall_64+0x38/0x90
[921588.558980] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921588.558984] RIP: 0033:0x7f4c97ee832b
[921588.558987] RSP: 002b:00007ffdceb29e50 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921588.558991] RAX: ffffffffffffffda RBX: 000055ced34b0fa0 RCX: 00007f4c97ee832b
[921588.558993] RDX: 00007f4c97fc8000 RSI: 000055ced3487b70 RDI: 0000000000000021
[921588.558995] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007ffdceb29fa8
[921588.558996] R10: 00007f4c97d2c848 R11: 0000000000000293 R12: 00007ffdceb29fa8
[921588.558998] R13: 00007ffdceb29fa0 R14: 000055ced34b0fa0 R15: 000055ced34bcf90
[921588.559014] INFO: task mysqld:1505 blocked for more than 120 seconds.
[921588.559018] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559025] task:mysqld state:D stack: 0 pid: 1505 ppid: 1 flags:0x00000000
[921588.559030] Call Trace:
[921588.559032] __schedule+0x44c/0x8a0
[921588.559036] schedule+0x4f/0xc0
[921588.559040] md_write_start+0x150/0x240
[921588.559044] ? wait_woken+0x80/0x80
[921588.559047] raid5_make_request+0x88/0x890 [raid456]
[921588.559056] ? wait_woken+0x80/0x80
[921588.559059] ? mempool_kmalloc+0x17/0x20
[921588.559062] md_handle_request+0x12d/0x1a0
[921588.559065] ? __part_start_io_acct+0x51/0xf0
[921588.559068] md_submit_bio+0xca/0x100
[921588.559071] submit_bio_noacct+0x112/0x4f0
[921588.559075] submit_bio+0x51/0x1a0
[921588.559077] ? _cond_resched+0x19/0x30
[921588.559081] ? kmem_cache_alloc+0x38e/0x440
[921588.559084] ? ext4_init_io_end+0x1f/0x50
[921588.559088] ext4_io_submit+0x4d/0x60
[921588.559091] ext4_writepages+0x2c6/0xcd0
[921588.559094] ? __schedule+0x454/0x8a0
[921588.559097] ? hrtimer_start_range_ns+0x1aa/0x2f0
[921588.559100] ? timerqueue_del+0x24/0x50
[921588.559105] ? futex_wait+0x1ed/0x270
[921588.559109] do_writepages+0x43/0xd0
[921588.559112] ? do_writepages+0x43/0xd0
[921588.559115] ? futex_wake+0x7c/0x180
[921588.559118] __filemap_fdatawrite_range+0xcc/0x110
[921588.559123] file_write_and_wait_range+0x74/0xc0
[921588.559126] ext4_sync_file+0xf5/0x350
[921588.559129] vfs_fsync_range+0x49/0x80
[921588.559132] ? __fget_light+0x62/0x80
[921588.559136] do_fsync+0x3d/0x70
[921588.559139] __x64_sys_fsync+0x14/0x20
[921588.559142] do_syscall_64+0x38/0x90
[921588.559145] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921588.559148] RIP: 0033:0x7f36188d48db
[921588.559150] RSP: 002b:00007f35f05487d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921588.559154] RAX: ffffffffffffffda RBX: 00007f360502bf50 RCX: 00007f36188d48db
[921588.559156] RDX: 000d239200000000 RSI: 0000000000000000 RDI: 000000000000000a
[921588.559158] RBP: 00007f35f0548a00 R08: 0000000000000000 R09: 00007f360502beb0
[921588.559160] R10: 00007f35f0548ccf R11: 0000000000000293 R12: 00007f36042e9610
[921588.559162] R13: 000000000000000a R14: 0000000000000b00 R15: 0000000000000000
[921588.559175] INFO: task mdcheck:40625 blocked for more than 120 seconds.
[921588.559179] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559185] task:mdcheck state:D stack: 0 pid:40625 ppid: 1 flags:0x00000000
[921588.559189] Call Trace:
[921588.559191] __schedule+0x44c/0x8a0
[921588.559196] schedule+0x4f/0xc0
[921588.559200] schedule_timeout+0x202/0x290
[921588.559203] ? do_truncate+0x78/0xd0
[921588.559216] ? try_to_wake_up+0x66/0x5c0
[921588.559220] wait_for_completion+0x94/0x100
[921588.559225] kthread_stop+0x71/0x170
[921588.559229] md_unregister_thread+0x47/0x80
[921588.559232] md_reap_sync_thread+0x24/0x200
[921588.559236] action_store+0x160/0x2e0
[921588.559239] md_attr_store+0x82/0xc0
[921588.559243] sysfs_kf_write+0x3e/0x50
[921588.559246] kernfs_fop_write_iter+0x138/0x1c0
[921588.559249] new_sync_write+0x117/0x1b0
[921588.559254] vfs_write+0x1ca/0x280
[921588.559257] ksys_write+0x67/0xe0
[921588.559260] __x64_sys_write+0x1a/0x20
[921588.559264] do_syscall_64+0x38/0x90
[921588.559267] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921588.559271] RIP: 0033:0x7f8c6d2791e7
[921588.559273] RSP: 002b:00007ffea5d97178 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[921588.559276] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8c6d2791e7
[921588.559278] RDX: 0000000000000005 RSI: 000055b618a3eb40 RDI: 0000000000000001
[921588.559280] RBP: 000055b618a3eb40 R08: 000000000000000a R09: 0000000000000004
[921588.559282] R10: 000055b616e97017 R11: 0000000000000246 R12: 0000000000000005
[921588.559284] R13: 00007f8c6d3546a0 R14: 00007f8c6d3554a0 R15: 00007f8c6d3548a0
[921588.559287] INFO: task md1_resync:40644 blocked for more than 120 seconds.
[921588.559291] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559297] task:md1_resync state:D stack: 0 pid:40644 ppid: 2 flags:0x00004000
[921588.559301] Call Trace:
[921588.559303] __schedule+0x44c/0x8a0
[921588.559308] schedule+0x4f/0xc0
[921588.559311] md_do_sync.cold+0x8b7/0x953
[921588.559315] ? wait_woken+0x80/0x80
[921588.559319] ? kernel_sigaction+0x3b/0xd0
[921588.559323] md_thread+0xb1/0x170
[921588.559326] kthread+0x114/0x150
[921588.559329] ? md_start_sync+0x60/0x60
[921588.559332] ? kthread_park+0x90/0x90
[921588.559335] ret_from_fork+0x22/0x30
[921588.559341] INFO: task kworker/u16:0:41409 blocked for more than 120 seconds.
[921588.559345] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921588.559348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921588.559353] task:kworker/u16:0 state:D stack: 0 pid:41409 ppid: 2 flags:0x00004000
[921588.559357] Workqueue: writeback wb_workfn (flush-253:1)
[921588.559361] Call Trace:
[921588.559363] __schedule+0x44c/0x8a0
[921588.559367] schedule+0x4f/0xc0
[921588.559371] md_write_start+0x150/0x240
[921588.559375] ? wait_woken+0x80/0x80
[921588.559378] raid5_make_request+0x88/0x890 [raid456]
[921588.559387] ? wait_woken+0x80/0x80
[921588.559390] ? mempool_kmalloc+0x17/0x20
[921588.559393] md_handle_request+0x12d/0x1a0
[921588.559396] ? __part_start_io_acct+0x51/0xf0
[921588.559399] md_submit_bio+0xca/0x100
[921588.559402] submit_bio_noacct+0x112/0x4f0
[921588.559405] submit_bio+0x51/0x1a0
[921588.559408] ? _cond_resched+0x19/0x30
[921588.559412] ? kmem_cache_alloc+0x38e/0x440
[921588.559415] ? ext4_init_io_end+0x1f/0x50
[921588.559419] ext4_io_submit+0x4d/0x60
[921588.559422] ext4_writepages+0x2c6/0xcd0
[921588.559425] ? update_group_capacity+0x2c/0x1e0
[921588.559430] do_writepages+0x43/0xd0
[921588.559433] ? do_writepages+0x43/0xd0
[921588.559436] __writeback_single_inode+0x44/0x2b0
[921588.559439] writeback_sb_inodes+0x22d/0x4c0
[921588.559443] __writeback_inodes_wb+0x56/0xf0
[921588.559446] wb_writeback+0x1ea/0x2a0
[921588.559450] wb_workfn+0x31b/0x490
[921588.559453] ? __update_idle_core+0x9b/0xb0
[921588.559456] process_one_work+0x220/0x3c0
[921588.559459] worker_thread+0x4d/0x3f0
[921588.559462] kthread+0x114/0x150
[921588.559465] ? process_one_work+0x3c0/0x3c0
[921588.559467] ? kthread_park+0x90/0x90
[921588.559471] ret_from_fork+0x22/0x30
[921709.391056] INFO: task jbd2/dm-1-8:319 blocked for more than 120 seconds.
[921709.391068] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391077] task:jbd2/dm-1-8 state:D stack: 0 pid: 319 ppid: 2 flags:0x00004000
[921709.391083] Call Trace:
[921709.391087] __schedule+0x44c/0x8a0
[921709.391094] schedule+0x4f/0xc0
[921709.391098] md_write_start+0x150/0x240
[921709.391103] ? wait_woken+0x80/0x80
[921709.391108] raid5_make_request+0x88/0x890 [raid456]
[921709.391120] ? wait_woken+0x80/0x80
[921709.391123] ? mempool_kmalloc+0x17/0x20
[921709.391127] md_handle_request+0x12d/0x1a0
[921709.391130] ? __part_start_io_acct+0x51/0xf0
[921709.391134] md_submit_bio+0xca/0x100
[921709.391137] submit_bio_noacct+0x112/0x4f0
[921709.391141] submit_bio+0x51/0x1a0
[921709.391144] ? bio_add_page+0x6a/0x90
[921709.391147] submit_bh_wbc+0x195/0x1c0
[921709.391151] submit_bh+0x13/0x20
[921709.391154] jbd2_journal_commit_transaction+0x74b/0x18f0
[921709.391159] ? try_to_del_timer_sync+0x54/0x80
[921709.391164] kjournald2+0xb6/0x280
[921709.391168] ? wait_woken+0x80/0x80
[921709.391170] kthread+0x114/0x150
[921709.391174] ? commit_timeout+0x20/0x20
[921709.391177] ? kthread_park+0x90/0x90
[921709.391180] ret_from_fork+0x22/0x30
[921709.391186] INFO: task systemd-journal:376 blocked for more than 241 seconds.
[921709.391191] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391195] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391199] task:systemd-journal state:D stack: 0 pid: 376 ppid: 1 flags:0x00000220
[921709.391204] Call Trace:
[921709.391205] __schedule+0x44c/0x8a0
[921709.391210] schedule+0x4f/0xc0
[921709.391214] md_write_start+0x150/0x240
[921709.391218] ? wait_woken+0x80/0x80
[921709.391221] raid5_make_request+0x88/0x890 [raid456]
[921709.391230] ? wait_woken+0x80/0x80
[921709.391233] ? mempool_kmalloc+0x17/0x20
[921709.391236] md_handle_request+0x12d/0x1a0
[921709.391239] ? __part_start_io_acct+0x51/0xf0
[921709.391242] md_submit_bio+0xca/0x100
[921709.391245] submit_bio_noacct+0x112/0x4f0
[921709.391248] ? ext4_fc_reserve_space+0x110/0x230
[921709.391252] submit_bio+0x51/0x1a0
[921709.391255] ? _cond_resched+0x19/0x30
[921709.391259] ? kmem_cache_alloc+0x38e/0x440
[921709.391262] ? ext4_init_io_end+0x1f/0x50
[921709.391266] ext4_io_submit+0x4d/0x60
[921709.391270] ext4_writepages+0x2c6/0xcd0
[921709.391274] do_writepages+0x43/0xd0
[921709.391278] ? do_writepages+0x43/0xd0
[921709.391281] ? fault_dirty_shared_page+0xa5/0x110
[921709.391285] __filemap_fdatawrite_range+0xcc/0x110
[921709.391290] file_write_and_wait_range+0x74/0xc0
[921709.391293] ext4_sync_file+0xf5/0x350
[921709.391297] vfs_fsync_range+0x49/0x80
[921709.391300] do_fsync+0x3d/0x70
[921709.391303] __x64_sys_fsync+0x14/0x20
[921709.391306] do_syscall_64+0x38/0x90
[921709.391310] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921709.391314] RIP: 0033:0x7f4c97ee832b
[921709.391317] RSP: 002b:00007ffdceb29e50 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921709.391321] RAX: ffffffffffffffda RBX: 000055ced34b0fa0 RCX: 00007f4c97ee832b
[921709.391323] RDX: 00007f4c97fc8000 RSI: 000055ced3487b70 RDI: 0000000000000021
[921709.391325] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007ffdceb29fa8
[921709.391327] R10: 00007f4c97d2c848 R11: 0000000000000293 R12: 00007ffdceb29fa8
[921709.391329] R13: 00007ffdceb29fa0 R14: 000055ced34b0fa0 R15: 000055ced34bcf90
[921709.391343] INFO: task mysqld:1505 blocked for more than 241 seconds.
[921709.391348] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391351] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391355] task:mysqld state:D stack: 0 pid: 1505 ppid: 1 flags:0x00000000
[921709.391359] Call Trace:
[921709.391361] __schedule+0x44c/0x8a0
[921709.391366] schedule+0x4f/0xc0
[921709.391370] md_write_start+0x150/0x240
[921709.391373] ? wait_woken+0x80/0x80
[921709.391376] raid5_make_request+0x88/0x890 [raid456]
[921709.391386] ? wait_woken+0x80/0x80
[921709.391388] ? mempool_kmalloc+0x17/0x20
[921709.391391] md_handle_request+0x12d/0x1a0
[921709.391395] ? __part_start_io_acct+0x51/0xf0
[921709.391398] md_submit_bio+0xca/0x100
[921709.391401] submit_bio_noacct+0x112/0x4f0
[921709.391404] submit_bio+0x51/0x1a0
[921709.391407] ? _cond_resched+0x19/0x30
[921709.391411] ? kmem_cache_alloc+0x38e/0x440
[921709.391414] ? ext4_init_io_end+0x1f/0x50
[921709.391418] ext4_io_submit+0x4d/0x60
[921709.391421] ext4_writepages+0x2c6/0xcd0
[921709.391423] ? __schedule+0x454/0x8a0
[921709.391427] ? hrtimer_start_range_ns+0x1aa/0x2f0
[921709.391430] ? timerqueue_del+0x24/0x50
[921709.391435] ? futex_wait+0x1ed/0x270
[921709.391439] do_writepages+0x43/0xd0
[921709.391443] ? do_writepages+0x43/0xd0
[921709.391445] ? futex_wake+0x7c/0x180
[921709.391449] __filemap_fdatawrite_range+0xcc/0x110
[921709.391453] file_write_and_wait_range+0x74/0xc0
[921709.391456] ext4_sync_file+0xf5/0x350
[921709.391459] vfs_fsync_range+0x49/0x80
[921709.391463] ? __fget_light+0x62/0x80
[921709.391466] do_fsync+0x3d/0x70
[921709.391469] __x64_sys_fsync+0x14/0x20
[921709.391472] do_syscall_64+0x38/0x90
[921709.391475] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921709.391478] RIP: 0033:0x7f36188d48db
[921709.391481] RSP: 002b:00007f35f05487d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[921709.391484] RAX: ffffffffffffffda RBX: 00007f360502bf50 RCX: 00007f36188d48db
[921709.391486] RDX: 000d239200000000 RSI: 0000000000000000 RDI: 000000000000000a
[921709.391488] RBP: 00007f35f0548a00 R08: 0000000000000000 R09: 00007f360502beb0
[921709.391490] R10: 00007f35f0548ccf R11: 0000000000000293 R12: 00007f36042e9610
[921709.391491] R13: 000000000000000a R14: 0000000000000b00 R15: 0000000000000000
[921709.391505] INFO: task mdcheck:40625 blocked for more than 241 seconds.
[921709.391510] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391518] task:mdcheck state:D stack: 0 pid:40625 ppid: 1 flags:0x00000000
[921709.391522] Call Trace:
[921709.391524] __schedule+0x44c/0x8a0
[921709.391528] schedule+0x4f/0xc0
[921709.391532] schedule_timeout+0x202/0x290
[921709.391535] ? do_truncate+0x78/0xd0
[921709.391538] ? try_to_wake_up+0x66/0x5c0
[921709.391542] wait_for_completion+0x94/0x100
[921709.391547] kthread_stop+0x71/0x170
[921709.391550] md_unregister_thread+0x47/0x80
[921709.391554] md_reap_sync_thread+0x24/0x200
[921709.391557] action_store+0x160/0x2e0
[921709.391561] md_attr_store+0x82/0xc0
[921709.391564] sysfs_kf_write+0x3e/0x50
[921709.391567] kernfs_fop_write_iter+0x138/0x1c0
[921709.391571] new_sync_write+0x117/0x1b0
[921709.391575] vfs_write+0x1ca/0x280
[921709.391578] ksys_write+0x67/0xe0
[921709.391581] __x64_sys_write+0x1a/0x20
[921709.391584] do_syscall_64+0x38/0x90
[921709.391588] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[921709.391591] RIP: 0033:0x7f8c6d2791e7
[921709.391593] RSP: 002b:00007ffea5d97178 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[921709.391597] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8c6d2791e7
[921709.391599] RDX: 0000000000000005 RSI: 000055b618a3eb40 RDI: 0000000000000001
[921709.391600] RBP: 000055b618a3eb40 R08: 000000000000000a R09: 0000000000000004
[921709.391602] R10: 000055b616e97017 R11: 0000000000000246 R12: 0000000000000005
[921709.391604] R13: 00007f8c6d3546a0 R14: 00007f8c6d3554a0 R15: 00007f8c6d3548a0
[921709.391607] INFO: task md1_resync:40644 blocked for more than 241 seconds.
[921709.391611] Not tainted 5.11.0-27-generic #29~20.04.1-Ubuntu
[921709.391615] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[921709.391619] task:md1_resync state:D stack: 0 pid:40644 ppid: 2 flags:0x00004000
[921709.391623] Call Trace:
[921709.391625] __schedule+0x44c/0x8a0
[921709.391631] schedule+0x4f/0xc0
[921709.391635] md_do_sync.cold+0x8b7/0x953
[921709.391639] ? wait_woken+0x80/0x80
[921709.391642] ? kernel_sigaction+0x3b/0xd0
[921709.391646] md_thread+0xb1/0x170
[921709.391650] kthread+0x114/0x150
[921709.391652] ? md_start_sync+0x60/0x60
[921709.391655] ? kthread_park+0x90/0x90
[921709.391658] ret_from_fork+0x22/0x30
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sda3[0] sdd3[3] sdb3[1] sde3[5] sdc3[6]
15625582592 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
[================>....] check = 84.2% (3289834208/3906395648) finish=82779.6min speed=124K/sec

md0 : active raid1 sda2[0] sdb2[1] sdc2[2](S)
488128 blocks super 1.2 [2/2] [UU]

unused devices: <none>

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.11.0-27-generic 5.11.0-27.29~20.04.1
ProcVersionSignature: Ubuntu 5.11.0-27.29~20.04.1-generic 5.11.22
Uname: Linux 5.11.0-27-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.18
Architecture: amd64
CasperMD5CheckResult: skip
Date: Tue Sep 7 16:35:01 2021
InstallationDate: Installed on 2014-03-12 (2736 days ago)
InstallationMedia: Ubuntu-Server 12.04.4 LTS "Precise Pangolin" - Release amd64 (20140204)
ProcEnviron:
TERM=xterm
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
SourcePackage: linux-signed-hwe-5.11
UpgradeStatus: Upgraded to focal on 2020-12-20 (261 days ago)
---
ProblemType: Bug
AlsaVersion: Advanced Linux Sound Architecture Driver Version k5.13.0-16-generic.
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu69
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: [Errno 2] No such file or directory: 'fuser'
Card0.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer'
Card0.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer'
CasperMD5CheckResult: unknown
DistroRelease: Ubuntu 21.10
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci: Error: [Errno 2] No such file or directory: 'lspci'
Lspci-vt: Error: [Errno 2] No such file or directory: 'lspci'
Lsusb: Error: [Errno 2] No such file or directory: 'lsusb'
Lsusb-t: Error: [Errno 2] No such file or directory: 'lsusb'
Lsusb-v: Error: [Errno 2] No such file or directory: 'lsusb'
MachineType: innotek GmbH VirtualBox
Package: linux-signed-hwe-5.11
PciMultimedia:

ProcEnviron:
TERM=linux
PATH=(custom, no user)
SHELL=/bin/bash
ProcFB: 0 svgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.13.0-16-generic root=/dev/mapper/vg1-root ro quiet splash
ProcVersionSignature: Ubuntu 5.13.0-16.16-generic 5.13.13
RelatedPackageVersions:
linux-restricted-modules-5.13.0-16-generic N/A
linux-backports-modules-5.13.0-16-generic N/A
linux-firmware 1.199
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: impish
Uname: Linux 5.13.0-16-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 12/01/2006
dmi.bios.vendor: innotek GmbH
dmi.bios.version: VirtualBox
dmi.board.name: VirtualBox
dmi.board.vendor: Oracle Corporation
dmi.board.version: 1.2
dmi.chassis.type: 1
dmi.chassis.vendor: Oracle Corporation
dmi.modalias: dmi:bvninnotekGmbH:bvrVirtualBox:bd12/01/2006:svninnotekGmbH:pnVirtualBox:pvr1.2:sku:rvnOracleCorporation:rnVirtualBox:rvr1.2:cvnOracleCorporation:ct1:cvr:
dmi.product.family: Virtual Machine
dmi.product.name: VirtualBox
dmi.product.version: 1.2
dmi.sys.vendor: innotek GmbH

See original description

Tags:

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-07:

Dependencies.txt Edit (2.7 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.2 KiB, text/plain; charset="utf-8")

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-07:

screenlog.txt Edit (18.3 KiB, text/plain)

Revision history for this message

Kleber Sacilotto de Souza (kleber-souza) wrote on 2021-09-08:

Hello Chad Wagner,

Thank you for reporting this issue. Could you please try installing the latest 20.04 HWE kernel and check whether the problem persists? The version currently in focal-updates is 5.11.0-34.36~20.04.1.

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-08:

Hi Kleber,
I installed it later yesterday, but I won't know until the next resync. This has been a problem since at least linux 5.4 kernel that shipped with Ubuntu 20.04. I don't think I had these problems on Ubuntu 18.04 LTS, the same hardware, running the linux-image-generic at that time.

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-13:

Similar report here on 5.10.0-rc4:
https://www.spinics.net/lists/raid/msg66654.html

I ended up masking the services introduced with 20.04 LTS, and switched back the crontab.

systemctl mask mdcheck_continue.service mdcheck_continue.timer mdcheck_start.service mdcheck_start.timer
cat > /etc/cron.d/mdadm << 'EOF'
#
# cron.d/mdadm -- schedules periodic redundancy checks of MD devices
#
# Copyright © martin f. krafft <email address hidden>
# distributed under the terms of the Artistic Licence 2.0
#

# By default, run at 00:57 on every Sunday, but do nothing unless the day of
# the month is less than or equal to 7. Thus, only run on the first Sunday of
# each month. crontab(5) sucks, unfortunately, in this regard; therefore this
# hack (see #380425).
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
EOF

The pausing and resuming of the integrity check was an annoyance for me anyways.

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-13:

Here is the proposed patch, Doesn't appear to have been applied. Last report was with 5.11rc5.

https://lore.kernel<email address hidden>/

Revision history for this message

Mikko Rantalainen (mira) wrote on 2021-09-13:

I think I've seen this issue once per two different systems so I think this is software issue. Does anybody know if patch in comment #6 is going to be included in Ubuntu 20.04 LTS?

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-09-13:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe-5.11 (Ubuntu):
status:	New → Confirmed

Revision history for this message

Mikko Rantalainen (mira) wrote on 2021-09-13:

One of the systems was using package linux-generic and in practice

Linux pedabackup 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/version_signature
Ubuntu 5.4.0-80.90-generic 5.4.124

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-13:

#10

The patch hasn't made it into mainline from what I have seen, it looks like it died back in March waiting for feedback from additional kernel developers. From what I have gathered this is a deadlock scenario directly caused by pausing the resync while the system is under heavy write activity.

Donald Buczek provided a reproducer which is a shell script that generates a lot of write activity and pauses/resumes the raid scrubbing. And he also provided a workaround to get the stuck system running without reboot:

echo active > /sys/block/md1/md/array_state

I haven't tried the patch or any of this, I pretty much eliminated the trigger which is mdcheck_start & mdcheck_continue and went back to the 18.04 LTS way of scrubbing arrays (which is basically don't pause/interrupt it once it starts). I ran through a checkarray yesterday to 100%, no problems. Meanwhile since upgrading to 20.04 LTS it has hung almost every single time through 5.4, 5.8, and 5.11 kernels.

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-13:

#11

mdhang.sh Edit (810 bytes, text/x-sh)

Here is Donald Buczek's reproducer script. I setup an Ubuntu 20.04 VM with latest linux-image-generic and was able to reproduce it within maybe 10 or 15 minutes. Exactly the same issue.

Filesystem layout built as follows:

# assemble raid devices
mdadm --create /dev/md0 --level=1 --raid-devices=2 --spare-devices=1 /dev/sda2 /dev/sdb2 /dev/sdc2
mdadm --create /dev/md1 --level=5 --raid-devices=5 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sde3

# create PVs, VGs, LVs
pvcreate /dev/md1
vgcreate vg1 /dev/md1
lvcreate --name root --extents 100%FREE vg1

# create filesystems
mkfs.ext4 /dev/md0
mkfs.ext4 /dev/vg1/root

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2021-09-13: Missing required logs.

#12

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1942935

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: hirsute

Chad Wagner (chad-wagner) on 2021-09-14

tags:

removed: hirsute

Chad Wagner (chad-wagner) on 2021-09-17

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17:

#13

md-reap-sync-thread.patch Edit (8.4 KiB, text/plain)

tags:	added: apport-collected impish
description:	updated

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: AlsaDevices.txt

#14

AlsaDevices.txt Edit (384 bytes, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: CRDA.txt

#15

CRDA.txt Edit (454 bytes, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: Card0.Codecs.codec97.0.ac97.0-0.txt

#16

Card0.Codecs.codec97.0.ac97.0-0.txt Edit (687 bytes, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: Card0.Codecs.codec97.0.ac97.0-0.regs.txt

#17

Card0.Codecs.codec97.0.ac97.0-0.regs.txt Edit (767 bytes, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: CurrentDmesg.txt

#18

CurrentDmesg.txt Edit (9.2 KiB, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: ProcCpuinfo.txt

#19

ProcCpuinfo.txt Edit (1.7 KiB, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: ProcCpuinfoMinimal.txt

#20

ProcCpuinfoMinimal.txt Edit (884 bytes, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: ProcInterrupts.txt

#21

ProcInterrupts.txt Edit (1.7 KiB, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: ProcModules.txt

#22

ProcModules.txt Edit (2.7 KiB, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: UdevDb.txt

#23

UdevDb.txt Edit (133.4 KiB, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: WifiSyslog.txt

#24

WifiSyslog.txt Edit (60.3 KiB, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17: acpidump.txt

#25

acpidump.txt Edit (46.0 KiB, text/plain)

apport information

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2021-09-17:

#26

Same issue on impish 5.13.13 kernel, running in VBox.

Ubuntu Foundations Team Bug Bot (crichton) on 2021-09-18

tags:

added: patch

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2022-05-30:

#27

Looks like two patches are landing in next to resolve this:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20220527&id=8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20220527&id=1e267742283a4b5a8ca65755c44166be27e9aa0f

Revision history for this message

Robert Lippmann (rob0214) wrote on 2022-07-06:

#28

Download full text (12.5 KiB)

I think I'm running into this too.

Raid 6 array with 4 drives.

Linux pkitty 5.13.0-52-generic #59~20.04.1-Ubuntu SMP Thu Jun 16 21:21:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I have all filesystems unmounted (so there's no other activity that should be occurring on the drives).

Once it hangs, even trying to do

echo idle /sys/block/md0/md/sync_action

hangs and the entire array locks up.

It's not a matter of if it will happen, but when.

Is there any way to backport the fixes? 5.19 is a long way away.

kern.log snippet:

I think I'm running into this too.

Raid 6 array with 4 drives.

Linux pkitty 5.13.0-52-generic #59~20.04.1-Ubuntu SMP Thu Jun 16 21:21:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I have all filesystems unmounted (so there's no other activity that should be occurring on the drives).

Once it hangs, even trying to do

echo idle /sys/block/md0/md/sync_action

hangs and the entire array locks up.

It's not a matter of if it will happen, but when.

Is there any way to backport the fixes?  5.19 is a long way away.

kern.log snippet:

ul  5 19:16:36 pkitty kernel: [ 4471.805561] INFO: task scsi_eh_1:276 blocked for more than 120 seconds.
Jul  5 19:16:36 pkitty kernel: [ 4471.805573]       Tainted: G           OE     5.13.0-52-generic #59~20.04.1-Ubuntu
Jul  5 19:16:36 pkitty kernel: [ 4471.805577] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 19:16:36 pkitty kernel: [ 4471.805580] task:scsi_eh_1       state:D stack:    0 pid:  276 ppid:     2 flags:0x00004000
Jul  5 19:16:36 pkitty kernel: [ 4471.805588] Call Trace:
Jul  5 19:16:36 pkitty kernel: [ 4471.805591]  <TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.805596]  __schedule+0x2ee/0x900
Jul  5 19:16:36 pkitty kernel: [ 4471.805605]  schedule+0x4f/0xc0
Jul  5 19:16:36 pkitty kernel: [ 4471.805609]  schedule_preempt_disabled+0xe/0x10
Jul  5 19:16:36 pkitty kernel: [ 4471.805614]  __mutex_lock.isra.0+0x183/0x4d0
Jul  5 19:16:36 pkitty kernel: [ 4471.805618]  ? dequeue_entity+0xdb/0x410
Jul  5 19:16:36 pkitty kernel: [ 4471.805627]  ? blk_mq_find_and_get_req+0x4f/0x90
Jul  5 19:16:36 pkitty kernel: [ 4471.805634]  __mutex_lock_slowpath+0x13/0x20
Jul  5 19:16:36 pkitty kernel: [ 4471.805639]  mutex_lock+0x32/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.805645]  device_reset+0x22/0x50 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805654]  scsi_eh_ready_devs+0x580/0xa50
Jul  5 19:16:36 pkitty kernel: [ 4471.805683]  kthread+0x12b/0x150
Jul  5 19:16:36 pkitty kernel: [ 4471.805687]  ? set_kthread_struct+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.805692]  ret_from_fork+0x22/0x30
Jul  5 19:16:36 pkitty kernel: [ 4471.805701]  </TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.805704] INFO: task usb-storage:278 blocked for more than 241 seconds.
Jul  5 19:16:36 pkitty kernel: [ 4471.805708]       Tainted: G           OE     5.13.0-52-generic #59~20.04.1-Ubuntu
Jul  5 19:16:36 pkitty kernel: [ 4471.805711] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 19:16:36 pkitty kernel: [ 4471.805713] task:usb-storage     state:D stack:    0 pid:  278 ppid:     2 flags:0x00004000
Jul  5 19:16:36 pkitty kernel: [ 4471.805718] Call Trace:
Jul  5 19:16:36 pkitty kernel: [ 4471.805720]  <TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.805722]  __schedule+0x2ee/0x900
Jul  5 19:16:36 pkitty kernel: [ 4471.805727]  ? usleep_range_state+0x90/0x90
Jul  5 19:16:36 pkitty kernel: [ 4471.805733]  schedule+0x4f/0xc0
Jul  5 19:16:36 pkitty kernel: [ 4471.805736]  schedule_timeout+0x202/0x290
Jul  5 19:16:36 pkitty kernel: [ 4471.805742]  ? xhci_urb_enqueue+0x1e4/0x520
Jul  5 19:16:36 pkitty kernel: [ 4471.805746]  ? usleep_range_state+0x90/0x90
Jul  5 19:16:36 pkitty kernel: [ 4471.805751]  __wait_for_common+0xba/0x160
Jul  5 19:16:36 pkitty kernel: [ 4471.805756]  wait_for_completion+0x24/0x30
Jul  5 19:16:36 pkitty kernel: [ 4471.805760]  usb_sg_wait+0xea/0x170
Jul  5 19:16:36 pkitty kernel: [ 4471.805767]  usb_stor_bulk_transfer_sglist+0x97/0xe0 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805776]  usb_stor_bulk_srb+0x3d/0x70 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805783]  usb_stor_Bulk_transport+0x177/0x410 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805790]  ? schedule_timeout+0x202/0x290
Jul  5 19:16:36 pkitty kernel: [ 4471.805796]  usb_stor_invoke_transport+0x3e/0x520 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805804]  ? usleep_range_state+0x90/0x90
Jul  5 19:16:36 pkitty kernel: [ 4471.805809]  ? __wait_for_common+0xfb/0x160
Jul  5 19:16:36 pkitty kernel: [ 4471.805813]  ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
Jul  5 19:16:36 pkitty kernel: [ 4471.805821]  usb_stor_transparent_scsi_command+0xe/0x10 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805828]  usb_stor_control_thread+0x19b/0x2a0 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805836]  ? storage_probe+0x2b0/0x2b0 [usb_storage]
Jul  5 19:16:36 pkitty kernel: [ 4471.805844]  kthread+0x12b/0x150
Jul  5 19:16:36 pkitty kernel: [ 4471.805848]  ? set_kthread_struct+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.805852]  ret_from_fork+0x22/0x30
Jul  5 19:16:36 pkitty kernel: [ 4471.805861]  </TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.805865] INFO: task md0_raid6:419 blocked for more than 241 seconds.
Jul  5 19:16:36 pkitty kernel: [ 4471.805869]       Tainted: G           OE     5.13.0-52-generic #59~20.04.1-Ubuntu
Jul  5 19:16:36 pkitty kernel: [ 4471.805872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 19:16:36 pkitty kernel: [ 4471.805874] task:md0_raid6       state:D stack:    0 pid:  419 ppid:     2 flags:0x00004000
Jul  5 19:16:36 pkitty kernel: [ 4471.805878] Call Trace:
Jul  5 19:16:36 pkitty kernel: [ 4471.805880]  <TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.805882]  __schedule+0x2ee/0x900
Jul  5 19:16:36 pkitty kernel: [ 4471.805885]  ? blk_flush_plug_list+0xe3/0x110
Jul  5 19:16:36 pkitty kernel: [ 4471.805891]  schedule+0x4f/0xc0
Jul  5 19:16:36 pkitty kernel: [ 4471.805895]  io_schedule+0x16/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.805899]  blk_mq_get_tag+0x11d/0x270
Jul  5 19:16:36 pkitty kernel: [ 4471.805903]  ? __wake_up_pollfree+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.805910]  __blk_mq_alloc_request+0xec/0x120
Jul  5 19:16:36 pkitty kernel: [ 4471.805919]  blk_mq_submit_bio+0x108/0x590
Jul  5 19:16:36 pkitty kernel: [ 4471.805924]  submit_bio_noacct+0x446/0x510
Jul  5 19:16:36 pkitty kernel: [ 4471.805930]  ops_run_io+0x724/0xcb0 [raid456]
Jul  5 19:16:36 pkitty kernel: [ 4471.805943]  handle_stripe+0x59f/0x1360 [raid456]
Jul  5 19:16:36 pkitty kernel: [ 4471.805958]  handle_active_stripes.isra.0+0x3c5/0x5a0 [raid456]
Jul  5 19:16:36 pkitty kernel: [ 4471.805970]  raid5d+0x35d/0x550 [raid456]
Jul  5 19:16:36 pkitty kernel: [ 4471.805981]  md_thread+0xb1/0x170
Jul  5 19:16:36 pkitty kernel: [ 4471.805986]  ? __wake_up_pollfree+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.805992]  ? md_start_sync+0x60/0x60
Jul  5 19:16:36 pkitty kernel: [ 4471.805996]  kthread+0x12b/0x150
Jul  5 19:16:36 pkitty kernel: [ 4471.806000]  ? set_kthread_struct+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.806005]  ret_from_fork+0x22/0x30
Jul  5 19:16:36 pkitty kernel: [ 4471.806013]  </TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.806031] INFO: task pool-udisksd:6829 blocked for more than 241 seconds.
Jul  5 19:16:36 pkitty kernel: [ 4471.806035]       Tainted: G           OE     5.13.0-52-generic #59~20.04.1-Ubuntu
Jul  5 19:16:36 pkitty kernel: [ 4471.806038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 19:16:36 pkitty kernel: [ 4471.806040] task:pool-udisksd    state:D stack:    0 pid: 6829 ppid:     1 flags:0x00000000
Jul  5 19:16:36 pkitty kernel: [ 4471.806045] Call Trace:
Jul  5 19:16:36 pkitty kernel: [ 4471.806047]  <TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.806049]  __schedule+0x2ee/0x900
Jul  5 19:16:36 pkitty kernel: [ 4471.806054]  schedule+0x4f/0xc0
Jul  5 19:16:36 pkitty kernel: [ 4471.806057]  io_schedule+0x16/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.806061]  blk_mq_get_tag+0x11d/0x270
Jul  5 19:16:36 pkitty kernel: [ 4471.806065]  ? __wake_up_pollfree+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.806072]  __blk_mq_alloc_request+0xec/0x120
Jul  5 19:16:36 pkitty kernel: [ 4471.806079]  blk_mq_alloc_request+0x7c/0xc0
Jul  5 19:16:36 pkitty kernel: [ 4471.806085]  blk_get_request+0x27/0x60
Jul  5 19:16:36 pkitty kernel: [ 4471.806093]  sg_io+0xa0/0x3f0
Jul  5 19:16:36 pkitty kernel: [ 4471.806096]  ? _copy_from_user+0x2b/0x60
Jul  5 19:16:36 pkitty kernel: [ 4471.806101]  ? get_sg_io_hdr+0xc2/0xf0
Jul  5 19:16:36 pkitty kernel: [ 4471.806105]  scsi_cmd_ioctl+0x24c/0x2a0
Jul  5 19:16:36 pkitty kernel: [ 4471.806110]  scsi_cmd_blk_ioctl+0x51/0x70
Jul  5 19:16:36 pkitty kernel: [ 4471.806114]  sd_ioctl_common+0xbb/0x1f0
Jul  5 19:16:36 pkitty kernel: [ 4471.806119]  sd_ioctl+0x1c/0x50
Jul  5 19:16:36 pkitty kernel: [ 4471.806124]  blkdev_ioctl+0x11c/0x250
Jul  5 19:16:36 pkitty kernel: [ 4471.806128]  ? __fget_light+0xce/0xf0
Jul  5 19:16:36 pkitty kernel: [ 4471.806136]  block_ioctl+0x3d/0x50
Jul  5 19:16:36 pkitty kernel: [ 4471.806140]  __x64_sys_ioctl+0x91/0xc0
Jul  5 19:16:36 pkitty kernel: [ 4471.806146]  do_syscall_64+0x61/0xb0
Jul  5 19:16:36 pkitty kernel: [ 4471.806151]  ? sysvec_apic_timer_interrupt+0x4e/0x90
Jul  5 19:16:36 pkitty kernel: [ 4471.806157]  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
Jul  5 19:16:36 pkitty kernel: [ 4471.806164]  entry_SYSCALL_64_after_hwframe+0x44/0xae
Jul  5 19:16:36 pkitty kernel: [ 4471.806170] RIP: 0033:0x7f52345bc3ab
Jul  5 19:16:36 pkitty kernel: [ 4471.806175] RSP: 002b:00007f523356e8e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jul  5 19:16:36 pkitty kernel: [ 4471.806180] RAX: ffffffffffffffda RBX: 00007f523356ea80 RCX: 00007f52345bc3ab
Jul  5 19:16:36 pkitty kernel: [ 4471.806183] RDX: 00007f523356e900 RSI: 0000000000002285 RDI: 0000000000000012
Jul  5 19:16:36 pkitty kernel: [ 4471.806186] RBP: 0000000000001388 R08: 00007f523356eaa0 R09: 00007f523356f5e8
Jul  5 19:16:36 pkitty kernel: [ 4471.806189] R10: 00007f523356ea10 R11: 0000000000000246 R12: 00007f523356ec10
Jul  5 19:16:36 pkitty kernel: [ 4471.806191] R13: 0000000000000000 R14: 0000000000000012 R15: 00007f523356eaa0
Jul  5 19:16:36 pkitty kernel: [ 4471.806196]  </TASK>
Jul  5 19:16:36 pkitty kernel: [ 4471.806255] INFO: task md0_resync:5122 blocked for more than 241 seconds.
Jul  5 19:16:36 pkitty kernel: [ 4471.806259]       Tainted: G           OE     5.13.0-52-generic #59~20.04.1-Ubuntu
Jul  5 19:16:36 pkitty kernel: [ 4471.806262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 19:16:36 pkitty kernel: [ 4471.806272]  __schedule+0x2ee/0x900
Jul  5 19:16:36 pkitty kernel: [ 4471.806277]  schedule+0x4f/0xc0
Jul  5 19:16:36 pkitty kernel: [ 4471.806281]  raid5_get_active_stripe+0x449/0x5e0 [raid456]
Jul  5 19:16:36 pkitty kernel: [ 4471.806292]  ? __wake_up_pollfree+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.806299]  raid5_sync_request+0x387/0x3b0 [raid456]
Jul  5 19:16:36 pkitty kernel: [ 4471.806311]  md_do_sync.cold+0x3ed/0x953
Jul  5 19:16:36 pkitty kernel: [ 4471.806317]  ? __wake_up_pollfree+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.806325]  md_thread+0xb1/0x170
Jul  5 19:16:36 pkitty kernel: [ 4471.806330]  ? md_start_sync+0x60/0x60
Jul  5 19:16:36 pkitty kernel: [ 4471.806334]  kthread+0x12b/0x150
Jul  5 19:16:36 pkitty kernel: [ 4471.806338]  ? set_kthread_struct+0x40/0x40
Jul  5 19:16:36 pkitty kernel: [ 4471.806342]  ret_from_fork+0x22/0x30
Jul  5 19:16:36 pkitty kernel: [ 4471.806351]  </TASK>

[...]

Jul  5 19:18:37 pkitty kernel: [ 4592.635404] INFO: task scsi_eh_1:276 blocked for more than 241 seconds.
Jul  5 19:18:37 pkitty kernel: [ 4592.635412]       Tainted: G           OE     5.13.0-52-generic #59~20.04.1-Ubuntu
Jul  5 19:18:37 pkitty kernel: [ 4592.635415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 19:18:37 pkitty kernel: [ 4592.635417] task:scsi_eh_1       state:D stack:    0 pid:  276 ppid:     2 flags:0x00004000
Jul  5 19:18:37 pkitty kernel: [ 4592.635423] Call Trace:
Jul  5 19:18:37 pkitty kernel: [ 4592.635425]  <TASK>
Jul  5 19:18:37 pkitty kernel: [ 4592.635428]  __schedule+0x2ee/0x900
Jul  5 19:18:37 pkitty kernel: [ 4592.635435]  schedule+0x4f/0xc0
Jul  5 19:18:37 pkitty kernel: [ 4592.635437]  schedule_preempt_disabled+0xe/0x10
Jul  5 19:18:37 pkitty kernel: [ 4592.635440]  __mutex_lock.isra.0+0x183/0x4d0
Jul  5 19:18:37 pkitty kernel: [ 4592.635457]  mutex_lock+0x32/0x40
Jul  5 19:18:37 pkitty kernel: [ 4592.635462]  device_reset+0x22/0x50 [usb_storage]
Jul  5 19:18:37 pkitty kernel: [ 4592.635468]  scsi_eh_ready_devs+0x580/0xa50
Jul  5 19:18:37 pkitty kernel: [ 4592.635474]  ? __pm_runtime_resume+0x60/0x80
Jul  5 19:18:37 pkitty kernel: [ 4592.635477]  scsi_error_handler+0x446/0x520
Jul  5 19:18:37 pkitty kernel: [ 4592.635482]  ? scsi_eh_get_sense+0x230/0x230
Jul  5 19:18:37 pkitty kernel: [ 4592.635487]  kthread+0x12b/0x150
Jul  5 19:18:37 pkitty kernel: [ 4592.635490]  ? set_kthread_struct+0x40/0x40
Jul  5 19:18:37 pkitty kernel: [ 4592.635493]  ret_from_fork+0x22/0x30
Jul  5 19:18:37 pkitty kernel: [ 4592.635499]  </TASK>

Revision history for this message

Robert Lippmann (rob0214) wrote on 2022-07-06:

#29

Tried the patch mentioned in comment #27, as well as the one attached to this bug.

Still hangs on check.

I don’t know much about the mdadm code, but why is it grabbing exclusive locks on a check, which is mostly read operations?

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2022-07-08:

#30

I believe to resolve the deadlock you want to do:
echo active > /sys/block/md1/md/array_state

Not "idle". You should see a hung task for mdcheck in there somewhere as well, and it only occurs when the raid is resyncing (md_resync should be running), at least for me I the workaround in comment 5:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.11/+bug/1942935/comments/5

Haven't had a problem since, upgraded to 22.04 since then as well with 5.15. I am pretty sure the problem is still there, the problem surfaced in 20.04 when they changed the raid consistency check. The new check pauses the check after 8 hours and triggers the deadlock, I just let it run to completion like it did in 18.04.

I have not checked the patches, but it's possible you have a different problem because I don't see in either of two hung process traces any calls to md_ code.

Revision history for this message

Robert Lippmann (rob0214) wrote on 2022-07-09:

#31

kernel log snippet Edit (30.5 KiB, text/html)

I'm running checkarray manually, I took off all the start and stop stuff like you did.

echo active > /sys/block/md0/md/array_state doesn't fix.

I must have not gotten all the trace last time. I've attached it here.

Revision history for this message

Robert Lippmann (rob0214) wrote on 2022-07-09:

#32

Digging further, I think I might be running into this bug:

https://<email address hidden>/T/

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2022-07-10:

#33

Yeah, that's the same issue as this one. The issue is the raid is doing a consistency check (mdcheck) and is transitioned to an "idle" state and hits a deadlock that causes all I/O through the md device to block. The workaround is to change the array state back to active.

I made the changes in #5 almost a year ago and no problems, before that it pretty much hung almost every single month when the scheduled consistency check was triggered ever since upgrading to 20.04.

Revision history for this message

Robert Lippmann (rob0214) wrote on 2022-07-11:

#34

It won’t let me change the state back to active.

Every time I try nothing happens and array_status is always idle.

Revision history for this message

Janne Blomqvist (blomqvist-janne) wrote on 2022-12-08:

#35

The second of the patches mentioned in #27 (with git SHA 1e2677...) has, I believe, been backported to Ubuntu kernels 5.15.0-48 and 5.4.0-126.

We've still hit this with Ubuntu Jammy on 5.15.0-53, so I guess the first commit needs to be backported as well.

Revision history for this message

Robert Lippmann (rob0214) wrote on 2022-12-12:

#36

Turns out my issue was a faulty drive, and the system would lock up when mdadm hit the bad sectors on resync. The issue seemed like it was lower in the blockdev code causing a deadlock.

I replaced the drive and the problem went away.

Revision history for this message

Chad Wagner (chad-wagner) wrote on 2022-12-23:

#37

Comment #5 (https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.11/+bug/1942935/comments/5) has been a stable workaround for me (basically revert back to a continuous resync like 18.04).

My newer machines are using ZFS with raidz2 pools.

Revision history for this message

Kilian Felder (k.felder) wrote on 2023-01-05:

#38

I hit this bug with Ubuntu 22.04 (Jammy) on Kernel 5.15.0-56

For testing, I have set up a RAID1 and a RAID5 in a VM. I put the disks for each RAID on a separate controller. Based on the 'mdhang' script (see comment #11), I was able to reproduce the error easily.

I ran 2 'mdhang' scripts at the same time, one for RAID1 and one for RAID5. The RAID5 blocked after a short time. The RAID1 continued to run without problems. So it probably only affects the RAID5.

On my production system I have now activated the workaround (see comment #5).

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-06-13:

#40

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe-5.4 (Ubuntu):
status:	New → Confirmed

Revision history for this message

Brian Corriveau (bcorrive) wrote on 2023-06-13:

#41

Download full text (4.1 KiB)

I hit his bug as well in Ubuntu 22.04 with kernel 5.15.0-67-generic

We have a single RAID 5 on 3 drives for 28T. I'm switching to the workaround from
comment #5.

Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.573081] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+0xad/0x170
Jun 4 12:48:11 server1 kernel: [1622699.573162] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573169] ? md_write_inc+0x60/0x60
Jun 4 12:48:11 server1 kernel: [1622699.573176] kthread+0x12a/0x150
Jun 4 12:48:11 server1 kernel: [1622699.573187] ? set_kthread_struct+0x50/0x50
Jun 4 12:48:11 server1 kernel: [1622699.573197] ret_from_fork+0x22/0x30
Jun 4 12:48:11 server1 kernel: [1622699.573212] </TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573231] INFO: task jbd2/dm-0-8:1375 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.581119] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.589004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.597959] task:jbd2/dm-0-8 state:D stack: 0 pid: 1375 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.597968] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.597970] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.597973] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.597984] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.597992] md_write_start.part.0+0x174/0x220
Jun 4 12:48:11 server1 kernel: [1622699.598002] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.598024] md_write_start+0x14/0x30
Jun 4 12:48:11 server1 kernel: [1622699.598032] raid5_make_request+0x77/0x540 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.598051] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.598058] md_handle_request+0x12d/0x1b0
Jun 4 12:48:11 server1 kernel: [1622699.598065] ? __blk_queue_split+0xfe/0x200
Jun 4 12:48:11 server1 kernel: [1622699.598075] md_submit_bio+0x71/0xc0
Jun 4 12:48:11 server1 kernel: [1622699.598082] __submit_bio+0x1a5/0x220
Jun 4 12:48:11 server1 kernel: [1622699.598091] ? mempool_alloc_slab+0x17/0x20
Jun 4 12:48:11 server1 kernel: [1622699.598102] __submit_bio_noacct+0x85/0x200
Jun 4 12:48:11 server1 kernel: [1622699.598...

I hit his bug as well in Ubuntu 22.04 with kernel 5.15.0-67-generic

We have a single RAID 5 on 3 drives for 28T. I'm switching to the workaround from
comment #5.

Jun  4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds.
Jun  4 12:48:11 server1 kernel: [1622699.556202]       Tainted: G           OE     5.15.0-67-generic #74-Ubuntu
Jun  4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5       state:D stack:    0 pid:  406 ppid:     2 flags:0x00004000
Jun  4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun  4 12:48:11 server1 kernel: [1622699.573081]  <TASK>
Jun  4 12:48:11 server1 kernel: [1622699.573087]  __schedule+0x24e/0x590
Jun  4 12:48:11 server1 kernel: [1622699.573103]  schedule+0x69/0x110
Jun  4 12:48:11 server1 kernel: [1622699.573115]  raid5d+0x3d9/0x5f0 [raid456]
Jun  4 12:48:11 server1 kernel: [1622699.573140]  ? wait_woken+0x70/0x70
Jun  4 12:48:11 server1 kernel: [1622699.573151]  md_thread+0xad/0x170
Jun  4 12:48:11 server1 kernel: [1622699.573162]  ? wait_woken+0x70/0x70
Jun  4 12:48:11 server1 kernel: [1622699.573169]  ? md_write_inc+0x60/0x60
Jun  4 12:48:11 server1 kernel: [1622699.573176]  kthread+0x12a/0x150
Jun  4 12:48:11 server1 kernel: [1622699.573187]  ? set_kthread_struct+0x50/0x50
Jun  4 12:48:11 server1 kernel: [1622699.573197]  ret_from_fork+0x22/0x30
Jun  4 12:48:11 server1 kernel: [1622699.573212]  </TASK>
Jun  4 12:48:11 server1 kernel: [1622699.573231] INFO: task jbd2/dm-0-8:1375 blocked for more than 120 seconds.
Jun  4 12:48:11 server1 kernel: [1622699.581119]       Tainted: G           OE     5.15.0-67-generic #74-Ubuntu
Jun  4 12:48:11 server1 kernel: [1622699.589004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  4 12:48:11 server1 kernel: [1622699.597959] task:jbd2/dm-0-8     state:D stack:    0 pid: 1375 ppid:     2 flags:0x00004000
Jun  4 12:48:11 server1 kernel: [1622699.597968] Call Trace:
Jun  4 12:48:11 server1 kernel: [1622699.597970]  <TASK>
Jun  4 12:48:11 server1 kernel: [1622699.597973]  __schedule+0x24e/0x590
Jun  4 12:48:11 server1 kernel: [1622699.597984]  schedule+0x69/0x110
Jun  4 12:48:11 server1 kernel: [1622699.597992]  md_write_start.part.0+0x174/0x220
Jun  4 12:48:11 server1 kernel: [1622699.598002]  ? wait_woken+0x70/0x70
Jun  4 12:48:11 server1 kernel: [1622699.598024]  md_write_start+0x14/0x30
Jun  4 12:48:11 server1 kernel: [1622699.598032]  raid5_make_request+0x77/0x540 [raid456]
Jun  4 12:48:11 server1 kernel: [1622699.598051]  ? wait_woken+0x70/0x70
Jun  4 12:48:11 server1 kernel: [1622699.598058]  md_handle_request+0x12d/0x1b0
Jun  4 12:48:11 server1 kernel: [1622699.598065]  ? __blk_queue_split+0xfe/0x200
Jun  4 12:48:11 server1 kernel: [1622699.598075]  md_submit_bio+0x71/0xc0
Jun  4 12:48:11 server1 kernel: [1622699.598082]  __submit_bio+0x1a5/0x220
Jun  4 12:48:11 server1 kernel: [1622699.598091]  ? mempool_alloc_slab+0x17/0x20
Jun  4 12:48:11 server1 kernel: [1622699.598102]  __submit_bio_noacct+0x85/0x200
Jun  4 12:48:11 server1 kernel: [1622699.598110]  ? kmem_cache_alloc+0x1ab/0x2f0
Jun  4 12:48:11 server1 kernel: [1622699.598122]  submit_bio_noacct+0x4e/0x120
Jun  4 12:48:11 server1 kernel: [1622699.598131]  submit_bio+0x4a/0x130
Jun  4 12:48:11 server1 kernel: [1622699.598139]  submit_bh_wbc+0x18d/0x1c0
Jun  4 12:48:11 server1 kernel: [1622699.598151]  submit_bh+0x13/0x20
Jun  4 12:48:11 server1 kernel: [1622699.598160]  jbd2_journal_commit_transaction+0x861/0x17a0
Jun  4 12:48:11 server1 kernel: [1622699.598170]  ? __update_idle_core+0x93/0x120
Jun  4 12:48:11 server1 kernel: [1622699.598184]  kjournald2+0xa9/0x280
Jun  4 12:48:11 server1 kernel: [1622699.598190]  ? wait_woken+0x70/0x70
Jun  4 12:48:11 server1 kernel: [1622699.598197]  ? load_superblock.part.0+0xc0/0xc0
Jun  4 12:48:11 server1 kernel: [1622699.598202]  kthread+0x12a/0x150
Jun  4 12:48:11 server1 kernel: [1622699.598210]  ? set_kthread_struct+0x50/0x50
Jun  4 12:48:11 server1 kernel: [1622699.598218]  ret_from_fork+0x22/0x30
Jun  4 12:48:11 server1 kernel: [1622699.598229]  </TASK>

Revision history for this message

Filip Hruska (fhr) wrote on 2023-07-04:

#42

Download full text (10.5 KiB)

Seeing this bug on Ubuntu 20.04 and Ubuntu 22.04 as well, with both normal and HWE kernels.

To add some more information, this bug seems to randomly appear during the initial RAID 6 creation process as well, where the array is mounted but completely empty and not accessed - so it's likely to originate within the mdadm resync process itself, unrelated to other system I/O operations. Around 1 in 15 arrays will "freeze" during the initial resync process in my experience, so it is not that uncommon unfortunately.

The symptoms are always the same - at some point during resync, the speeds will radically drop to single MB/s levels and continue degrading over time until "echo active > /sys/block/mdX/md/array_state" is issued. A minute or two after running that, the speeds ramp back up to normal.

This bug seems unrelated to hardware configuration, as I've seen it happen across multiple systems with different CPU vendors, HBA models and with different HDD sizes and vendors. Systems which were previously stable under Ubuntu 18.04 started exhibiting freezes after upgrading to 20.04 as well.

It would also seem disabling mdcheck_start and mdcheck_continue is not necessarily the magical bullet in fixing this, it certainly doesn't seem to help with freezing during the initial resync. I have also seen instances of mdadm scheduled resyncs freezing when triggered using the old cronjob method, with both systemd services and timers masked off.

Dmesg from a freshly installed system where the initial resync "froze" approximately 15 hours after the array was created:
mdadm --create /dev/md1 --level=6 --raid-devices=6 /dev/sd[cdefgh]
mkfs.ext4 /dev/md1
mount -o errors=remount-ro /dev/md1 /srv

Seeing this bug on Ubuntu 20.04 and Ubuntu 22.04 as well, with both normal and HWE kernels.

[Jul 4 00:18] md/raid:md1: not clean -- starting background reconstruction
[  +0.000062] md/raid:md1: device sdh operational as raid disk 5
[  +0.000002] md/raid:md1: device sdg operational as raid disk 4
[  +0.000001] md/raid:md1: device sdf operational as raid disk 3
[  +0.000001] md/raid:md1: device sde operational as raid disk 2
[  +0.000001] md/raid:md1: device sdd operational as raid disk 1
[  +0.000001] md/raid:md1: device sdc operational as raid disk 0
[  +0.002104] md/raid:md1: raid level 6 active with 6 out of 6 devices, algorithm 2
[  +0.048241] md1: detected capacity change from 0 to 72000290684928
[  +0.000066] md: resync of RAID array md1
[Jul 4 00:32] EXT4-fs (md1): mounted filesystem with ordered data mode. Opts: errors=remount-ro
[Jul 4 02:39] perf: interrupt took too long (2522 > 2500), lowering kernel.perf_event_max_sample_rate to 79250
[Jul 4 04:36] perf: interrupt took too long (3155 > 3152), lowering kernel.perf_event_max_sample_rate to 63250
[Jul 4 15:22] INFO: task md1_raid6:5688 blocked for more than 120 seconds.
[  +0.000059]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000045] md1_raid6       D    0  5688      2 0x80004000
[  +0.000004] Call Trace:
[  +0.000011]  __schedule+0x2e3/0x740
[  +0.000005]  schedule+0x42/0xb0
[  +0.000011]  raid5d+0x3e6/0x5f0 [raid456]
[  +0.000005]  ? try_to_del_timer_sync+0x54/0x80
[  +0.000005]  ? schedule_timeout+0x92/0x160
[  +0.000004]  ? __wake_up_pollfree+0x40/0x40
[  +0.000004]  md_thread+0x97/0x160
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000004]  kthread+0x104/0x140
[  +0.000003]  ? md_start_sync+0x60/0x60
[  +0.000003]  ? kthread_park+0x90/0x90
[  +0.000002]  ret_from_fork+0x1f/0x40
[  +0.000005] INFO: task md1_resync:5724 blocked for more than 120 seconds.
[  +0.000039]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000044] md1_resync      D    0  5724      2 0x80004000
[  +0.000002] Call Trace:
[  +0.000005]  __schedule+0x2e3/0x740
[  +0.000004]  schedule+0x42/0xb0
[  +0.000007]  raid5_get_active_stripe+0x459/0x610 [raid456]
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000007]  raid5_sync_request+0x38b/0x3b0 [raid456]
[  +0.000004]  ? cpumask_next+0x1b/0x20
[  +0.000003]  ? is_mddev_idle+0xc1/0x11e
[  +0.000004]  md_do_sync.cold+0x3ef/0x992
[  +0.000005]  ? sched_clock+0x9/0x10
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000004]  md_thread+0x97/0x160
[  +0.000004]  kthread+0x104/0x140
[  +0.000002]  ? md_start_sync+0x60/0x60
[  +0.000003]  ? kthread_park+0x90/0x90
[  +0.000003]  ret_from_fork+0x1f/0x40
[  +0.000003] INFO: task jbd2/md1-8:6099 blocked for more than 120 seconds.
[  +0.000038]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000043] jbd2/md1-8      D    0  6099      2 0x80004000
[  +0.000001] Call Trace:
[  +0.000004]  __schedule+0x2e3/0x740
[  +0.000003]  ? __wake_up_common_lock+0x8a/0xc0
[  +0.000004]  schedule+0x42/0xb0
[  +0.000005]  jbd2_journal_commit_transaction+0x24e/0x18b0
[  +0.000004]  ? dequeue_entity+0x118/0x460
[  +0.000002]  ? record_times+0x1b/0x90
[  +0.000004]  ? __wake_up_pollfree+0x40/0x40
[  +0.000005]  ? try_to_del_timer_sync+0x54/0x80
[  +0.000003]  kjournald2+0xb6/0x280
[  +0.000004]  ? __wake_up_pollfree+0x40/0x40
[  +0.000003]  kthread+0x104/0x140
[  +0.000003]  ? commit_timeout+0x20/0x20
[  +0.000003]  ? kthread_park+0x90/0x90
[  +0.000002]  ret_from_fork+0x1f/0x40
[  +0.000004] INFO: task ext4lazyinit:6101 blocked for more than 120 seconds.
[  +0.000038]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000043] ext4lazyinit    D    0  6101      2 0x80004000
[  +0.000002] Call Trace:
[  +0.000004]  __schedule+0x2e3/0x740
[  +0.000003]  schedule+0x42/0xb0
[  +0.000004]  md_write_start+0x151/0x230
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000007]  raid5_make_request+0x88/0x870 [raid456]
[  +0.000002]  ? newidle_balance+0x231/0x3e0
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000003]  ? __switch_to_asm+0x42/0x80
[  +0.000002]  ? __switch_to_asm+0x5a/0x80
[  +0.000003]  md_handle_request+0x123/0x1a0
[  +0.000004]  md_make_request+0x92/0x1b0
[  +0.000003]  generic_make_request+0xcf/0x320
[  +0.000004]  ? bio_alloc_bioset+0x145/0x200
[  +0.000003]  submit_bio+0x48/0x1d0
[  +0.000003]  ? bio_add_page+0x6a/0x90
[  +0.000003]  submit_bio_wait+0x58/0x90
[  +0.000004]  blkdev_issue_zeroout+0x13f/0x220
[  +0.000006]  ext4_init_inode_table+0x203/0x380
[  +0.000004]  ext4_lazyinit_thread.part.0+0x2be/0x3c0
[  +0.000003]  ext4_lazyinit_thread+0x20/0x30
[  +0.000003]  kthread+0x104/0x140
[  +0.000003]  ? ext4_lazyinit_thread.part.0+0x3c0/0x3c0
[  +0.000002]  ? kthread_park+0x90/0x90
[  +0.000003]  ret_from_fork+0x1f/0x40
[Jul 4 15:24] INFO: task md1_raid6:5688 blocked for more than 241 seconds.
[  +0.000046]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000040] md1_raid6       D    0  5688      2 0x80004000
[  +0.000004] Call Trace:
[  +0.000009]  __schedule+0x2e3/0x740
[  +0.000004]  schedule+0x42/0xb0
[  +0.000010]  raid5d+0x3e6/0x5f0 [raid456]
[  +0.000005]  ? try_to_del_timer_sync+0x54/0x80
[  +0.000004]  ? schedule_timeout+0x92/0x160
[  +0.000004]  ? __wake_up_pollfree+0x40/0x40
[  +0.000003]  md_thread+0x97/0x160
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000004]  kthread+0x104/0x140
[  +0.000003]  ? md_start_sync+0x60/0x60
[  +0.000002]  ? kthread_park+0x90/0x90
[  +0.000002]  ret_from_fork+0x1f/0x40
[  +0.000004] INFO: task md1_resync:5724 blocked for more than 241 seconds.
[  +0.000035]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000039] md1_resync      D    0  5724      2 0x80004000
[  +0.000002] Call Trace:
[  +0.000004]  __schedule+0x2e3/0x740
[  +0.000003]  schedule+0x42/0xb0
[  +0.000007]  raid5_get_active_stripe+0x459/0x610 [raid456]
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000007]  raid5_sync_request+0x38b/0x3b0 [raid456]
[  +0.000003]  ? cpumask_next+0x1b/0x20
[  +0.000004]  ? is_mddev_idle+0xc1/0x11e
[  +0.000003]  md_do_sync.cold+0x3ef/0x992
[  +0.000004]  ? sched_clock+0x9/0x10
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000004]  md_thread+0x97/0x160
[  +0.000003]  kthread+0x104/0x140
[  +0.000002]  ? md_start_sync+0x60/0x60
[  +0.000003]  ? kthread_park+0x90/0x90
[  +0.000002]  ret_from_fork+0x1f/0x40
[  +0.000004] INFO: task jbd2/md1-8:6099 blocked for more than 241 seconds.
[  +0.000035]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000038] jbd2/md1-8      D    0  6099      2 0x80004000
[  +0.000002] Call Trace:
[  +0.000003]  __schedule+0x2e3/0x740
[  +0.000003]  ? __wake_up_common_lock+0x8a/0xc0
[  +0.000003]  schedule+0x42/0xb0
[  +0.000005]  jbd2_journal_commit_transaction+0x24e/0x18b0
[  +0.000003]  ? dequeue_entity+0x118/0x460
[  +0.000003]  ? record_times+0x1b/0x90
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000004]  ? try_to_del_timer_sync+0x54/0x80
[  +0.000004]  kjournald2+0xb6/0x280
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000003]  kthread+0x104/0x140
[  +0.000003]  ? commit_timeout+0x20/0x20
[  +0.000002]  ? kthread_park+0x90/0x90
[  +0.000002]  ret_from_fork+0x1f/0x40
[  +0.000004] INFO: task ext4lazyinit:6101 blocked for more than 241 seconds.
[  +0.000035]       Not tainted 5.4.0-153-generic #170-Ubuntu
[  +0.000028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  +0.000039] ext4lazyinit    D    0  6101      2 0x80004000
[  +0.000001] Call Trace:
[  +0.000003]  __schedule+0x2e3/0x740
[  +0.000004]  schedule+0x42/0xb0
[  +0.000003]  md_write_start+0x151/0x230
[  +0.000003]  ? __wake_up_pollfree+0x40/0x40
[  +0.000006]  raid5_make_request+0x88/0x870 [raid456]
[  +0.000002]  ? newidle_balance+0x231/0x3e0
[  +0.000004]  ? __wake_up_pollfree+0x40/0x40
[  +0.000002]  ? __switch_to_asm+0x42/0x80
[  +0.000002]  ? __switch_to_asm+0x5a/0x80
[  +0.000003]  md_handle_request+0x123/0x1a0
[  +0.000003]  md_make_request+0x92/0x1b0
[  +0.000004]  generic_make_request+0xcf/0x320
[  +0.000003]  ? bio_alloc_bioset+0x145/0x200
[  +0.000003]  submit_bio+0x48/0x1d0
[  +0.000002]  ? bio_add_page+0x6a/0x90
[  +0.000003]  submit_bio_wait+0x58/0x90
[  +0.000004]  blkdev_issue_zeroout+0x13f/0x220
[  +0.000005]  ext4_init_inode_table+0x203/0x380
[  +0.000004]  ext4_lazyinit_thread.part.0+0x2be/0x3c0
[  +0.000003]  ext4_lazyinit_thread+0x20/0x30
[  +0.000003]  kthread+0x104/0x140
[  +0.000002]  ? ext4_lazyinit_thread.part.0+0x3c0/0x3c0
[  +0.000003]  ? kthread_park+0x90/0x90
[  +0.000002]  ret_from_fork+0x1f/0x40

Revision history for this message

brjhaverkamp (bert-bertenselena) wrote on 2023-07-13:

#43

I can't add much in terms of data. But this is a +1. My symptoms were virtually identical to Chad Wagner's
This happened july 1st 2023 on my machine, the day the check started. It got stuck at 98%.
Only cure I could find was to reboot the system.

Revision history for this message

Brian Corriveau (bcorrive) wrote on 2023-08-08:

#44

Download full text (4.4 KiB)

Unfortunately I have now have seen this bug happen with a data-check of RAID kicked off by system and by cron, so I +1 on Filips's comment #42

In my previous post I said I was switching over to cron triggered raid check, which I did, but it seems I have run into the exact same problem again. I got ten 120 second delays logged in syslog and then the whole system became unreachable.

The data-check started about an hour before the delays were logged.

This shows the first 120 delay that was logged.

Unfortunately I have now have seen this bug happen with a data-check of RAID kicked off by system and by cron, so I +1 on Filips's comment #42

The data-check started about an hour before the delays were logged.

This shows the first 120 delay that was logged.

Jul 31 21:00:01 server1 kernel: [168797.457900] md: data-check of RAID array md0
...
Jul 31 22:20:46 server1 kernel: [173642.734918] INFO: task kswapd0:190 blocked for more than 120 seconds.
Jul 31 22:20:46 server1 kernel: [173642.742246]       Tainted: G           O      5.15.0-75-generic #82-Ubuntu
Jul 31 22:20:46 server1 kernel: [173642.750039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 22:20:46 server1 kernel: [173642.758894] task:kswapd0         state:D stack:    0 pid:  190 ppid:     2 flags:0x00004000
Jul 31 22:20:46 server1 kernel: [173642.758906] Call Trace:
Jul 31 22:20:46 server1 kernel: [173642.758910]  <TASK>
Jul 31 22:20:46 server1 kernel: [173642.758916]  __schedule+0x24e/0x590
Jul 31 22:20:46 server1 kernel: [173642.758935]  schedule+0x69/0x110
Jul 31 22:20:46 server1 kernel: [173642.758945]  md_write_start.part.0+0x174/0x220
Jul 31 22:20:46 server1 kernel: [173642.758957]  ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.758969]  md_write_start+0x14/0x30
Jul 31 22:20:46 server1 kernel: [173642.758980]  raid5_make_request+0x77/0x540 [raid456]
Jul 31 22:20:46 server1 kernel: [173642.759004]  ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.759012]  md_handle_request+0x12d/0x1b0
Jul 31 22:20:46 server1 kernel: [173642.759020]  ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759032]  ? __blk_queue_split+0xfe/0x210
Jul 31 22:20:46 server1 kernel: [173642.759043]  ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759052]  md_submit_bio+0x76/0xc0
Jul 31 22:20:46 server1 kernel: [173642.759060]  __submit_bio+0x1a5/0x220
Jul 31 22:20:46 server1 kernel: [173642.759072]  ? __mod_memcg_lruvec_state+0x63/0xe0
Jul 31 22:20:46 server1 kernel: [173642.759082]  __submit_bio_noacct+0x85/0x200
Jul 31 22:20:46 server1 kernel: [173642.759093]  submit_bio_noacct+0x4e/0x120
Jul 31 22:20:46 server1 kernel: [173642.759102]  ? radix_tree_lookup+0xd/0x20
Jul 31 22:20:46 server1 kernel: [173642.759116]  ? bio_associate_blkg_from_css+0x1b2/0x310
Jul 31 22:20:46 server1 kernel: [173642.759124]  ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759133]  ? bio_add_page+0x68/0x90
Jul 31 22:20:46 server1 kernel: [173642.759139]  submit_bio+0x4a/0x130
Jul 31 22:20:46 server1 kernel: [173642.759149]  __swap_writepage+0x1aa/0x440
Jul 31 22:20:46 server1 kernel: [173642.759160]  ? __frontswap_store+0x75/0x120
Jul 31 22:20:46 server1 kernel: [173642.759167]  swap_writepage+0x32/0xa0
Jul 31 22:20:46 server1 kernel: [173642.759177]  pageout+0x100/0x330
Jul 31 22:20:46 server1 kernel: [173642.759189]  shrink_page_list+0xbd9/0xf00
Jul 31 22:20:46 server1 kernel: [173642.759202]  shrink_inactive_list+0x171/0x4c0
Jul 31 22:20:46 server1 kernel: [173642.759211]  ? super_cache_count+0x7c/0x100
Jul 31 22:20:46 server1 kernel: [173642.759222]  shrink_lruvec+0x2b6/0x3f0
Jul 31 22:20:46 server1 kernel: [173642.759233]  ? shrink_slab+0x10a/0x120
Jul 31 22:20:46 server1 kernel: [173642.759242]  ? shrink_slab+0x10a/0x120
Jul 31 22:20:46 server1 kernel: [173642.759251]  shrink_node_memcgs+0x172/0x1d0
Jul 31 22:20:46 server1 kernel: [173642.759262]  shrink_node+0x16e/0x600
Jul 31 22:20:46 server1 kernel: [173642.759272]  balance_pgdat+0x36e/0x810
Jul 31 22:20:46 server1 kernel: [173642.759281]  ? finish_task_switch.isra.0+0x7e/0x280
Jul 31 22:20:46 server1 kernel: [173642.759294]  kswapd+0x10c/0x1c0
Jul 31 22:20:46 server1 kernel: [173642.759303]  ? balance_pgdat+0x810/0x810
Jul 31 22:20:46 server1 kernel: [173642.759313]  kthread+0x12a/0x150
Jul 31 22:20:46 server1 kernel: [173642.759322]  ? set_kthread_struct+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759332]  ret_from_fork+0x22/0x30
Jul 31 22:20:46 server1 kernel: [173642.759344]  </TASK>
Jul 31 22:20:46 server1 kernel: [173642.759356] INFO: task md0_raid5:409 blocked for more than 120 seconds.
...

In the same second nine more of these kinds of delays then silence.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-08-10:

#45

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-hwe-5.15 (Ubuntu):
status:	New → Confirmed

Revision history for this message

Brian Corriveau (bcorrive) wrote on 2023-08-10:

#46

Download full text (5.1 KiB)

Kernel 5.15.0-75 on Ubuntu 22.04, server1 and server2

Just hit this bug with a different server still running systemd to start the RAID check.

The logs indicate there may be more than one RAID check bug out there:
- a bug with pausing the raid check
- a bug with the raid check when it hasn't been paused

In the case with my server2 the raid check was started, then 6 hours later is was interrupted as expected with systemd, but about 20 minutes later the RAID locks up. Pause bug?

Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)

Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds.
Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu
Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000
Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug 7 00:23:03 server2 kernel: [2641193.450507] <TASK>
Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+0x24e/0x590
Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110
Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_start.part.0+0x174/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_start+0x14/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_request+0x77/0x540 [raid456]
Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_request+0x12d/0x1b0
Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_split+0xfe/0x210
Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_bio+0x76/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_bio+0x1a5/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_alloc_slab+0x17/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_bio_noacct+0x85/0x200
Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_alloc+0x1ab/0x2f0
Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_bio_noacct+0x4e/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_bio+0x4a/0x130
Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_bh_wbc+0x18d/0x1c0
Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_commit_transaction+0x861/0x17a0
Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_idle_core+0x93/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+0xa9/0x280
Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2...

Kernel 5.15.0-75 on Ubuntu 22.04, server1 and server2

Just hit this bug with a different server still running systemd to start the RAID check.

The logs indicate there may be more than one RAID check bug out there:
 - a bug with pausing the raid check
 - a bug with the raid check when it hasn't been paused

In the case with my server2 the raid check was started, then 6 hours later is was interrupted as expected with systemd, but about 20 minutes later the RAID locks up. Pause bug?

Aug  6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug  6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug  6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug  7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)

Aug  7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds.
Aug  7 00:23:03 server2 kernel: [2641193.433644]       Tainted: G           OE     5.15.0-75-generic #82-Ubuntu
Aug  7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8     state:D stack:    0 pid: 1904 ppid:     2 flags:0x00004000
Aug  7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug  7 00:23:03 server2 kernel: [2641193.450507]  <TASK>
Aug  7 00:23:03 server2 kernel: [2641193.450513]  __schedule+0x24e/0x590
Aug  7 00:23:03 server2 kernel: [2641193.450531]  schedule+0x69/0x110
Aug  7 00:23:03 server2 kernel: [2641193.450541]  md_write_start.part.0+0x174/0x220
Aug  7 00:23:03 server2 kernel: [2641193.450553]  ? wait_woken+0x70/0x70
Aug  7 00:23:03 server2 kernel: [2641193.450575]  md_write_start+0x14/0x30
Aug  7 00:23:03 server2 kernel: [2641193.450585]  raid5_make_request+0x77/0x540 [raid456]
Aug  7 00:23:03 server2 kernel: [2641193.450609]  ? wait_woken+0x70/0x70
Aug  7 00:23:03 server2 kernel: [2641193.450616]  md_handle_request+0x12d/0x1b0
Aug  7 00:23:03 server2 kernel: [2641193.450625]  ? __blk_queue_split+0xfe/0x210
Aug  7 00:23:03 server2 kernel: [2641193.450636]  md_submit_bio+0x76/0xc0
Aug  7 00:23:03 server2 kernel: [2641193.450643]  __submit_bio+0x1a5/0x220
Aug  7 00:23:03 server2 kernel: [2641193.450654]  ? mempool_alloc_slab+0x17/0x20
Aug  7 00:23:03 server2 kernel: [2641193.450665]  __submit_bio_noacct+0x85/0x200
Aug  7 00:23:03 server2 kernel: [2641193.450674]  ? kmem_cache_alloc+0x1ab/0x2f0
Aug  7 00:23:03 server2 kernel: [2641193.450687]  submit_bio_noacct+0x4e/0x120
Aug  7 00:23:03 server2 kernel: [2641193.450698]  submit_bio+0x4a/0x130
Aug  7 00:23:03 server2 kernel: [2641193.450707]  submit_bh_wbc+0x18d/0x1c0
Aug  7 00:23:03 server2 kernel: [2641193.450713]  submit_bh+0x13/0x20
Aug  7 00:23:03 server2 kernel: [2641193.450718]  jbd2_journal_commit_transaction+0x861/0x17a0
Aug  7 00:23:03 server2 kernel: [2641193.450729]  ? __update_idle_core+0x93/0x120
Aug  7 00:23:03 server2 kernel: [2641193.450744]  kjournald2+0xa9/0x280
Aug  7 00:23:03 server2 kernel: [2641193.450750]  ? wait_woken+0x70/0x70
Aug  7 00:23:03 server2 kernel: [2641193.450757]  ? load_superblock.part.0+0xc0/0xc0
Aug  7 00:23:03 server2 kernel: [2641193.450764]  kthread+0x12a/0x150
Aug  7 00:23:03 server2 kernel: [2641193.450774]  ? set_kthread_struct+0x50/0x50
Aug  7 00:23:03 server2 kernel: [2641193.450783]  ret_from_fork+0x22/0x30
Aug  7 00:23:03 server2 kernel: [2641193.450796]  </TASK>

In my comment #41 the raid check started and rain into a problem BEFORE the pause:
(started by systemd)

Jun  4 09:04:25 server1 kernel: [1609273.391583] md: data-check of RAID array md0

and the problem started here, BEFORE the check ran for 6 hours, which is when it would have been paused

My comment #44 for server1 was for the raid check started by cron, which of course has no pause.

With server2 it looks like the pause left some lock on the RAID which caused the server to  become unreachable.

With server1 the problem has happened twice now, once with systemd, and once with cron. Both times no pause was involved.

Both servers have RAID5 with about 30T, so relatively large.

Revision history for this message

Silas Horton (shorton5) wrote on 2023-10-02:

#47

Download full text (11.5 KiB)

On Ubuntu 22.04, 5.15.0-83-generic #92-Ubuntu - our storage system ran into this bug. mdcheck ran for the scheduled 1st day of the month and then hung 6 hours later.

Oct 1 06:52:13 server1 systemd[1]: Starting MD array scrubbing...
Oct 1 06:52:13 server1 root: mdcheck start checking /dev/md0
Oct 1 06:52:13 server1 kernel: [2129098.393495] md: data-check of RAID array md0

On Ubuntu 22.04, 5.15.0-83-generic #92-Ubuntu - our storage system ran into this bug. mdcheck ran for the scheduled 1st day of the month and then hung 6 hours later.

Oct  1 06:52:13 server1 systemd[1]: Starting MD array scrubbing...
Oct  1 06:52:13 server1 root: mdcheck start checking /dev/md0
Oct  1 06:52:13 server1 kernel: [2129098.393495] md: data-check of RAID array md0

Oct  1 12:57:49 server1 kernel: [2151034.623372] INFO: task dmcrypt_write/2:1783 blocked for more than 241 seconds.
Oct  1 12:57:49 server1 kernel: [2151034.623446]       Tainted: G S                5.15.0-83-generic #92-Ubuntu
Oct  1 12:57:49 server1 kernel: [2151034.623498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 12:57:49 server1 kernel: [2151034.623559] task:dmcrypt_write/2 state:D stack:    0 pid: 1783 ppid:     2 flags:0x00004000
Oct  1 12:57:49 server1 kernel: [2151034.623566] Call Trace:
Oct  1 12:57:49 server1 kernel: [2151034.623570]  <TASK>
Oct  1 12:57:49 server1 kernel: [2151034.623574]  __schedule+0x24e/0x590
Oct  1 12:57:49 server1 kernel: [2151034.623585]  ? __schedule+0x256/0x590
Oct  1 12:57:49 server1 kernel: [2151034.623590]  schedule+0x69/0x110
Oct  1 12:57:49 server1 kernel: [2151034.623596]  md_write_start.part.0+0x174/0x220
Oct  1 12:57:49 server1 kernel: [2151034.623601]  ? wait_woken+0x70/0x70
Oct  1 12:57:49 server1 kernel: [2151034.623610]  md_write_start+0x14/0x30
Oct  1 12:57:49 server1 kernel: [2151034.623615]  raid5_make_request+0x77/0x540 [raid456]
Oct  1 12:57:49 server1 kernel: [2151034.623633]  ? cgroup_rstat_updated+0x11c/0x1e0
Oct  1 12:57:49 server1 kernel: [2151034.623642]  ? wait_woken+0x70/0x70
Oct  1 12:57:49 server1 kernel: [2151034.623648]  md_handle_request+0x12d/0x1b0
Oct  1 12:57:49 server1 kernel: [2151034.623657]  ? submit_bio_checks+0x1a5/0x560
Oct  1 12:57:49 server1 kernel: [2151034.623664]  md_submit_bio+0x76/0xc0
Oct  1 12:57:49 server1 kernel: [2151034.623670]  __submit_bio+0x1a5/0x220
Oct  1 12:57:49 server1 kernel: [2151034.623675]  ? psi_task_switch+0xc6/0x220
Oct  1 12:57:49 server1 kernel: [2151034.623682]  __submit_bio_noacct+0x85/0x200
Oct  1 12:57:49 server1 kernel: [2151034.623687]  submit_bio_noacct+0x4e/0x120
Oct  1 12:57:49 server1 kernel: [2151034.623691]  ? schedule+0x69/0x110
Oct  1 12:57:49 server1 kernel: [2151034.623698]  dmcrypt_write+0x104/0x130 [dm_crypt]
Oct  1 12:57:49 server1 kernel: [2151034.623708]  ? crypt_ctr+0x600/0x600 [dm_crypt]
Oct  1 12:57:49 server1 kernel: [2151034.623715]  kthread+0x12a/0x150
Oct  1 12:57:49 server1 kernel: [2151034.623723]  ? set_kthread_struct+0x50/0x50
Oct  1 12:57:49 server1 kernel: [2151034.623730]  ret_from_fork+0x22/0x30
Oct  1 12:57:49 server1 kernel: [2151034.623739]  </TASK>
Oct  1 12:57:49 server1 kernel: [2151034.623778] INFO: task mdcheck:2323903 blocked for more than 241 seconds.
Oct  1 12:57:49 server1 kernel: [2151034.623833]       Tainted: G S                5.15.0-83-generic #92-Ubuntu
Oct  1 12:57:49 server1 kernel: [2151034.625482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 12:57:49 server1 kernel: [2151034.627098] task:mdcheck         state:D stack:    0 pid:2323903 ppid:     1 flags:0x00000002
Oct  1 12:57:49 server1 kernel: [2151034.627104] Call Trace:
Oct  1 12:57:49 server1 kernel: [2151034.627106]  <TASK>
Oct  1 12:57:49 server1 kernel: [2151034.627109]  __schedule+0x24e/0x590
Oct  1 12:57:49 server1 kernel: [2151034.627114]  ? select_idle_sibling+0x2b/0xa60
Oct  1 12:57:49 server1 kernel: [2151034.627124]  schedule+0x69/0x110
Oct  1 12:57:49 server1 kernel: [2151034.627129]  schedule_timeout+0x103/0x140
Oct  1 12:57:49 server1 kernel: [2151034.627135]  ? ttwu_queue_wakelist+0x131/0x1c0
Oct  1 12:57:49 server1 kernel: [2151034.627142]  __wait_for_common+0xae/0x150
Oct  1 12:57:49 server1 kernel: [2151034.627148]  ? usleep_range_state+0x90/0x90
Oct  1 12:57:49 server1 kernel: [2151034.627155]  wait_for_completion+0x24/0x30
Oct  1 12:57:49 server1 kernel: [2151034.627160]  kthread_stop+0x6d/0x170
Oct  1 12:57:49 server1 kernel: [2151034.627168]  md_unregister_thread+0x44/0x90
Oct  1 12:57:49 server1 kernel: [2151034.627172]  md_reap_sync_thread+0x24/0x230
Oct  1 12:57:49 server1 kernel: [2151034.627177]  action_store+0x16f/0x300
Oct  1 12:57:49 server1 kernel: [2151034.627182]  md_attr_store+0x95/0xf0
Oct  1 12:57:49 server1 kernel: [2151034.627187]  sysfs_kf_write+0x3e/0x50
Oct  1 12:57:49 server1 kernel: [2151034.627194]  kernfs_fop_write_iter+0x13b/0x1c0
Oct  1 12:57:49 server1 kernel: [2151034.627199]  new_sync_write+0x114/0x1a0
Oct  1 12:57:49 server1 kernel: [2151034.627207]  vfs_write+0x1d5/0x270
Oct  1 12:57:49 server1 kernel: [2151034.627212]  ksys_write+0x67/0xf0
Oct  1 12:57:49 server1 kernel: [2151034.627219]  __x64_sys_write+0x19/0x20
Oct  1 12:57:49 server1 kernel: [2151034.627225]  do_syscall_64+0x5c/0xc0
Oct  1 12:57:49 server1 kernel: [2151034.627232]  ? do_syscall_64+0x69/0xc0
Oct  1 12:57:49 server1 kernel: [2151034.627258]  ? do_syscall_64+0x69/0xc0
Oct  1 12:57:49 server1 kernel: [2151034.627263]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Oct  1 12:57:49 server1 kernel: [2151034.627271] RIP: 0033:0x7fb3bd5e8a37
Oct  1 12:57:49 server1 kernel: [2151034.627276] RSP: 002b:00007ffc02410e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Oct  1 12:57:49 server1 kernel: [2151034.627282] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fb3bd5e8a37
Oct  1 12:57:49 server1 kernel: [2151034.627285] RDX: 0000000000000005 RSI: 0000564ec7e41690 RDI: 0000000000000001
Oct  1 12:57:49 server1 kernel: [2151034.627288] RBP: 0000564ec7e41690 R08: 00007fb3bd6a5460 R09: 000000007fffffff
Oct  1 12:57:49 server1 kernel: [2151034.627291] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
Oct  1 12:57:49 server1 kernel: [2151034.627294] R13: 00007fb3bd6ee780 R14: 00007fb3bd6ea600 R15: 00007fb3bd6e9a00
Oct  1 12:57:49 server1 kernel: [2151034.627298]  </TASK>
Oct  1 12:57:49 server1 kernel: [2151034.627300] INFO: task md0_resync:2323914 blocked for more than 241 seconds.
Oct  1 12:57:49 server1 kernel: [2151034.628914]       Tainted: G S                5.15.0-83-generic #92-Ubuntu
Oct  1 12:57:49 server1 kernel: [2151034.630539] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 12:57:49 server1 kernel: [2151034.632209] task:md0_resync      state:D stack:    0 pid:2323914 ppid:     2 flags:0x00004000
Oct  1 12:57:49 server1 kernel: [2151034.632216] Call Trace:
Oct  1 12:57:49 server1 kernel: [2151034.632218]  <TASK>
Oct  1 12:57:49 server1 kernel: [2151034.632220]  __schedule+0x24e/0x590
Oct  1 12:57:49 server1 kernel: [2151034.632226]  ? __wake_up_common_lock+0x8a/0xc0
Oct  1 12:57:49 server1 kernel: [2151034.632232]  schedule+0x69/0x110
Oct  1 12:57:49 server1 kernel: [2151034.632238]  raid5_get_active_stripe+0x30e/0x3c0 [raid456]
Oct  1 12:57:49 server1 kernel: [2151034.632253]  ? wait_woken+0x70/0x70
Oct  1 12:57:49 server1 kernel: [2151034.632260]  raid5_sync_request+0x38c/0x3b0 [raid456]
Oct  1 12:57:49 server1 kernel: [2151034.632274]  ? __bpf_trace_tick_stop+0x20/0x20
Oct  1 12:57:49 server1 kernel: [2151034.632281]  md_do_sync.cold+0x42f/0xa17
Oct  1 12:57:49 server1 kernel: [2151034.632287]  ? wait_woken+0x70/0x70
Oct  1 12:57:49 server1 kernel: [2151034.632295]  md_thread+0xad/0x170
Oct  1 12:57:49 server1 kernel: [2151034.632305]  ? md_write_inc+0x60/0x60
Oct  1 12:57:49 server1 kernel: [2151034.632312]  kthread+0x12a/0x150
Oct  1 12:57:49 server1 kernel: [2151034.632320]  ? set_kthread_struct+0x50/0x50
Oct  1 12:57:49 server1 kernel: [2151034.632327]  ret_from_fork+0x22/0x30
Oct  1 12:57:49 server1 kernel: [2151034.632336]  </TASK>
Oct  1 12:57:49 server1 kernel: [2151034.632342] INFO: task kworker/7:1:2326463 blocked for more than 241 seconds.
Oct  1 12:57:49 server1 kernel: [2151034.633989]       Tainted: G S                5.15.0-83-generic #92-Ubuntu
Oct  1 12:57:49 server1 kernel: [2151034.635672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 12:57:49 server1 kernel: [2151034.637344] task:kworker/7:1     state:D stack:    0 pid:2326463 ppid:     2 flags:0x00004000
Oct  1 12:57:49 server1 kernel: [2151034.637351] Workqueue: xfs-sync/dm-0 xfs_log_worker [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.637562] Call Trace:
Oct  1 12:57:49 server1 kernel: [2151034.637564]  <TASK>
Oct  1 12:57:49 server1 kernel: [2151034.637566]  __schedule+0x24e/0x590
Oct  1 12:57:49 server1 kernel: [2151034.637573]  schedule+0x69/0x110
Oct  1 12:57:49 server1 kernel: [2151034.637578]  schedule_timeout+0x103/0x140
Oct  1 12:57:49 server1 kernel: [2151034.637585]  ? ttwu_queue_wakelist+0x131/0x1c0
Oct  1 12:57:49 server1 kernel: [2151034.637591]  __wait_for_common+0xae/0x150
Oct  1 12:57:49 server1 kernel: [2151034.637596]  ? usleep_range_state+0x90/0x90
Oct  1 12:57:49 server1 kernel: [2151034.637603]  wait_for_completion+0x24/0x30
Oct  1 12:57:49 server1 kernel: [2151034.637609]  flush_workqueue+0x133/0x3e0
Oct  1 12:57:49 server1 kernel: [2151034.637615]  xlog_cil_push_now.isra.0+0x94/0xa0 [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.637802]  xlog_cil_force_seq+0x78/0x2b0 [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.637987]  ? psi_task_switch+0xc6/0x220
Oct  1 12:57:49 server1 kernel: [2151034.637994]  xfs_log_force+0x9e/0x260 [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.638180]  xfs_log_worker+0x39/0x90 [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.638364]  process_one_work+0x22b/0x3d0
Oct  1 12:57:49 server1 kernel: [2151034.638370]  worker_thread+0x53/0x420
Oct  1 12:57:49 server1 kernel: [2151034.638374]  ? process_one_work+0x3d0/0x3d0
Oct  1 12:57:49 server1 kernel: [2151034.638378]  kthread+0x12a/0x150
Oct  1 12:57:49 server1 kernel: [2151034.638385]  ? set_kthread_struct+0x50/0x50
Oct  1 12:57:49 server1 kernel: [2151034.638392]  ret_from_fork+0x22/0x30
Oct  1 12:57:49 server1 kernel: [2151034.638401]  </TASK>
Oct  1 12:57:49 server1 kernel: [2151034.638411] INFO: task kworker/u114:38:2342034 blocked for more than 241 seconds.
Oct  1 12:57:49 server1 kernel: [2151034.640574]       Tainted: G S                5.15.0-83-generic #92-Ubuntu
Oct  1 12:57:49 server1 kernel: [2151034.642748] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  1 12:57:49 server1 kernel: [2151034.644960] task:kworker/u114:38 state:D stack:    0 pid:2342034 ppid:     2 flags:0x00004000
Oct  1 12:57:49 server1 kernel: [2151034.644967] Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.645157] Call Trace:
Oct  1 12:57:49 server1 kernel: [2151034.645159]  <TASK>
Oct  1 12:57:49 server1 kernel: [2151034.645161]  __schedule+0x24e/0x590
Oct  1 12:57:49 server1 kernel: [2151034.645168]  ? submit_bio+0x4a/0x130
Oct  1 12:57:49 server1 kernel: [2151034.645173]  schedule+0x69/0x110
Oct  1 12:57:49 server1 kernel: [2151034.645179]  xlog_state_get_iclog_space+0x120/0x370 [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.645367]  ? wake_up_q+0x90/0x90
Oct  1 12:57:49 server1 kernel: [2151034.645375]  xlog_write+0x148/0x680 [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.645563]  xlog_cil_push_work+0x34b/0x470 [xfs]
Oct  1 12:57:49 server1 kernel: [2151034.645750]  process_one_work+0x22b/0x3d0
Oct  1 12:57:49 server1 kernel: [2151034.645755]  worker_thread+0x53/0x420
Oct  1 12:57:49 server1 kernel: [2151034.645759]  ? process_one_work+0x3d0/0x3d0
Oct  1 12:57:49 server1 kernel: [2151034.645763]  kthread+0x12a/0x150
Oct  1 12:57:49 server1 kernel: [2151034.645770]  ? set_kthread_struct+0x50/0x50
Oct  1 12:57:49 server1 kernel: [2151034.645776]  ret_from_fork+0x22/0x30
Oct  1 12:57:49 server1 kernel: [2151034.645785]  </TASK>

Revision history for this message

Patrick Hampson (patrickhampson) wrote on 2023-10-02:

#48

Noticed this issue on Ubuntu 20.04 with a md raid device. System exhibited the same behavior as other users have noted: high CPU usage and terminal locking up until the system is rebooted.

[14715252.569157] INFO: task md1_raid4:1763945 blocked for more than 120 seconds.
[14715252.570228] Not tainted 5.4.0-146-generic #163-Ubuntu
[14715252.571277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[14715252.572347] md1_raid4 D 0 1763945 2 0x80004000
[14715252.572357] Call Trace:
[14715252.572360] __schedule+0x2e3/0x740
[14715252.572363] schedule+0x42/0xb0
[14715252.572369] raid5d+0x3e6/0x5f0 [raid456]
[14715252.572376] ? schedule_timeout+0x10e/0x160
[14715252.572381] ? __wake_up_pollfree+0x40/0x40
[14715252.572384] md_thread+0x97/0x160
[14715252.572392] ? __wake_up_pollfree+0x40/0x40
[14715252.572394] kthread+0x104/0x140
[14715252.572399] ? md_start_sync+0x60/0x60
[14715252.572403] ? kthread_park+0x90/0x90
[14715252.572405] ret_from_fork+0x35/0x40
[14715252.572430] INFO: task kworker/u64:1:3189415 blocked for more than 120 seconds.

Revision history for this message

Jan Kratochvil (jan-kratochvil) wrote on 2023-12-20:

#49

Linux version 6.4.12-200.fc38.x86_64 (mockbuild@30894952d3244f1ab967aeda9ed417f6) (gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1), GNU ld version 2.39-9.fc38) #1 SMP PREEMPT_DYNAMIC Wed Aug 23 17:46:49 UTC 2023

    230 ? I< 0:00 \_ [md]
   1377 ? S 6:15 \_ [md0_raid1]
   1565 ? D 4955:37 \_ [md4_raid6]
2772538 ? DN 111:11 \_ [md4_resync]

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sde2[0]
13671170048 blocks super 1.2 [1/1] [U]
bitmap: 1/102 pages [4KB], 65536KB chunk

md4 : active raid6 sdd1[2] sdb1[5] sde1[4] sdc1[1]
      11720779776 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/4] [UUUU]
      [=================>...] check = 89.0% (5217054876/5860389888) finish=852549.6min speed=12K/sec
      bitmap: 9/44 pages [36KB], 65536KB chunk

unused devices: <none>
# cat /proc/2772538/stack # md4_resync
[<0>] raid5_get_active_stripe+0x271/0x540 [raid456]
[<0>] raid5_sync_request+0x3ad/0x3d0 [raid456]
[<0>] md_do_sync+0x7be/0x11c0
[<0>] md_thread+0xae/0x190
[<0>] kthread+0xe8/0x120
[<0>] ret_from_fork+0x2c/0x50
# cat /proc/1565/stack # md4_raid6
[<0>] raid5d+0x524/0x750 [raid456]
[<0>] md_thread+0xae/0x190
[<0>] kthread+0xe8/0x120
[<0>] ret_from_fork+0x2c/0x50

Workarounded by:
# cat /sys/block/md4/md/array_state
write-pending
# echo active > /sys/block/md4/md/array_state
# cat /sys/block/md4/md/array_state
active

Revision history for this message

Lucas Magnien (haeiven) wrote on 2024-01-11:

#50

Download full text (3.2 KiB)

Hello, got the same issue with Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-91-generic x86_64). Also tried 5.19 kernel and got the same problem.

Jan 7 02:28:26 cache4 systemd[1]: Starting MD array scrubbing...
Jan 7 02:28:26 cache4 root: mdcheck start checking /dev/md0
Jan 7 08:28:44 cache4 kernel: [2914434.326024] md: md0: data-check interrupted.
Jan 7 08:32:08 cache4 kernel: [2914638.397357] INFO: task jbd2/md0-8:1337 blocked for more than 120 seconds.
Jan 7 08:32:08 cache4 kernel: [2914638.397420] Not tainted 5.15.0-91-generic #99-Ubuntu
Jan 7 08:32:08 cache4 kernel: [2914638.397457] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 7 08:32:08 cache4 kernel: [2914638.397505] task:jbd2/md0-8 state:D stack: 0 pid: 1337 ppid: 2 flags:0x00004000
Jan 7 08:32:08 cache4 kernel: [2914638.397512] Call Trace:
Jan 7 08:32:08 cache4 kernel: [2914638.397515] <TASK>
Jan 7 08:32:08 cache4 kernel: [2914638.397520] __schedule+0x24e/0x590
Jan 7 08:32:08 cache4 kernel: [2914638.397530] schedule+0x69/0x110
Jan 7 08:32:08 cache4 kernel: [2914638.397535] md_write_start.part.0+0x174/0x220
Jan 7 08:32:08 cache4 kernel: [2914638.397540] ? wait_woken+0x70/0x70
Jan 7 08:32:08 cache4 kernel: [2914638.397547] md_write_start+0x14/0x30
Jan 7 08:32:08 cache4 kernel: [2914638.397553] raid5_make_request+0x77/0x540 [raid456]
Jan 7 08:32:08 cache4 kernel: [2914638.397566] ? jbd2_transaction_committed+0x1b/0x60
Jan 7 08:32:08 cache4 kernel: [2914638.397573] ? ext4_set_iomap+0x5a/0x1d0
Jan 7 08:32:08 cache4 kernel: [2914638.397579] ? wait_woken+0x70/0x70
Jan 7 08:32:08 cache4 kernel: [2914638.397584] md_handle_request+0x12d/0x1b0
Jan 7 08:32:08 cache4 kernel: [2914638.397589] ? submit_bio_checks+0x1a5/0x560
Jan 7 08:32:08 cache4 kernel: [2914638.397595] md_submit_bio+0x76/0xc0
Jan 7 08:32:08 cache4 kernel: [2914638.397600] __submit_bio+0x1a5/0x220
Jan 7 08:32:08 cache4 kernel: [2914638.397603] ? mempool_alloc_slab+0x17/0x20
Jan 7 08:32:08 cache4 kernel: [2914638.397611] __submit_bio_noacct+0x85/0x200
Jan 7 08:32:08 cache4 kernel: [2914638.397614] ? kmem_cache_alloc+0x1ab/0x2f0
Jan 7 08:32:08 cache4 kernel: [2914638.397619] submit_bio_noacct+0x4e/0x120
Jan 7 08:32:08 cache4 kernel: [2914638.397623] submit_bio+0x4a/0x130
Jan 7 08:32:08 cache4 kernel: [2914638.397627] submit_bh_wbc+0x18d/0x1c0
Jan 7 08:32:08 cache4 kernel: [2914638.397632] submit_bh+0x13/0x20
Jan 7 08:32:08 cache4 kernel: [2914638.397635] jbd2_journal_commit_transaction+0x861/0x17a0
Jan 7 08:32:08 cache4 kernel: [2914638.397640] ? __update_idle_core+0x93/0x120
Jan 7 08:32:08 cache4 kernel: [2914638.397649] kjournald2+0xa9/0x280
Jan 7 08:32:08 cache4 kernel: [2914638.397653] ? wait_woken+0x70/0x70
Jan 7 08:32:08 cache4 kernel: [2914638.397657] ? load_superblock.part.0+0xc0/0xc0
Jan 7 08:32:08 cache4 kernel: [2914638.397662] kthread+0x12a/0x150
Jan 7 08:32:08 cache4 kernel: [2914638.397667] ? set_kthread_struct+0x50/0x50
Jan 7 08:32:08 cache4 kernel: [2914638.397672] ret_from_fork+0x22/0x30
Jan 7 08:32:08 cache4 kernel: [2914638.397680] </TASK>

# cat /sys/block/md0/md/array_state
write-pending

This is happe...

Hello, got the same issue with Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-91-generic x86_64). Also tried 5.19 kernel and got the same problem.

Jan  7 02:28:26 cache4 systemd[1]: Starting MD array scrubbing...
Jan  7 02:28:26 cache4 root: mdcheck start checking /dev/md0
Jan  7 08:28:44 cache4 kernel: [2914434.326024] md: md0: data-check interrupted.
Jan  7 08:32:08 cache4 kernel: [2914638.397357] INFO: task jbd2/md0-8:1337 blocked for more than 120 seconds.
Jan  7 08:32:08 cache4 kernel: [2914638.397420]       Not tainted 5.15.0-91-generic #99-Ubuntu
Jan  7 08:32:08 cache4 kernel: [2914638.397457] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  7 08:32:08 cache4 kernel: [2914638.397505] task:jbd2/md0-8      state:D stack:    0 pid: 1337 ppid:     2 flags:0x00004000
Jan  7 08:32:08 cache4 kernel: [2914638.397512] Call Trace:
Jan  7 08:32:08 cache4 kernel: [2914638.397515]  <TASK>
Jan  7 08:32:08 cache4 kernel: [2914638.397520]  __schedule+0x24e/0x590
Jan  7 08:32:08 cache4 kernel: [2914638.397530]  schedule+0x69/0x110
Jan  7 08:32:08 cache4 kernel: [2914638.397535]  md_write_start.part.0+0x174/0x220
Jan  7 08:32:08 cache4 kernel: [2914638.397540]  ? wait_woken+0x70/0x70
Jan  7 08:32:08 cache4 kernel: [2914638.397547]  md_write_start+0x14/0x30
Jan  7 08:32:08 cache4 kernel: [2914638.397553]  raid5_make_request+0x77/0x540 [raid456]
Jan  7 08:32:08 cache4 kernel: [2914638.397566]  ? jbd2_transaction_committed+0x1b/0x60
Jan  7 08:32:08 cache4 kernel: [2914638.397573]  ? ext4_set_iomap+0x5a/0x1d0
Jan  7 08:32:08 cache4 kernel: [2914638.397579]  ? wait_woken+0x70/0x70
Jan  7 08:32:08 cache4 kernel: [2914638.397584]  md_handle_request+0x12d/0x1b0
Jan  7 08:32:08 cache4 kernel: [2914638.397589]  ? submit_bio_checks+0x1a5/0x560
Jan  7 08:32:08 cache4 kernel: [2914638.397595]  md_submit_bio+0x76/0xc0
Jan  7 08:32:08 cache4 kernel: [2914638.397600]  __submit_bio+0x1a5/0x220
Jan  7 08:32:08 cache4 kernel: [2914638.397603]  ? mempool_alloc_slab+0x17/0x20
Jan  7 08:32:08 cache4 kernel: [2914638.397611]  __submit_bio_noacct+0x85/0x200
Jan  7 08:32:08 cache4 kernel: [2914638.397614]  ? kmem_cache_alloc+0x1ab/0x2f0
Jan  7 08:32:08 cache4 kernel: [2914638.397619]  submit_bio_noacct+0x4e/0x120
Jan  7 08:32:08 cache4 kernel: [2914638.397623]  submit_bio+0x4a/0x130
Jan  7 08:32:08 cache4 kernel: [2914638.397627]  submit_bh_wbc+0x18d/0x1c0
Jan  7 08:32:08 cache4 kernel: [2914638.397632]  submit_bh+0x13/0x20
Jan  7 08:32:08 cache4 kernel: [2914638.397635]  jbd2_journal_commit_transaction+0x861/0x17a0
Jan  7 08:32:08 cache4 kernel: [2914638.397640]  ? __update_idle_core+0x93/0x120
Jan  7 08:32:08 cache4 kernel: [2914638.397649]  kjournald2+0xa9/0x280
Jan  7 08:32:08 cache4 kernel: [2914638.397653]  ? wait_woken+0x70/0x70
Jan  7 08:32:08 cache4 kernel: [2914638.397657]  ? load_superblock.part.0+0xc0/0xc0
Jan  7 08:32:08 cache4 kernel: [2914638.397662]  kthread+0x12a/0x150
Jan  7 08:32:08 cache4 kernel: [2914638.397667]  ? set_kthread_struct+0x50/0x50
Jan  7 08:32:08 cache4 kernel: [2914638.397672]  ret_from_fork+0x22/0x30
Jan  7 08:32:08 cache4 kernel: [2914638.397680]  </TASK>

# cat /sys/block/md0/md/array_state
write-pending

This is happening on all our servers with NVMe devices.

Revision history for this message

alex (iwram) wrote on 2024-01-18:

#51

Hi. I have the same problem on radi5x2 arrays assembled in stripe
The total volume of the array is 97 TB and free 16 GB
Do users who have a bug also have little free space?
I couldn’t reproduce this condition on a test bench under a synthetic load.
Ubuntu 22.04.3 LTS
kernel 5.15.0.-72-generic

Revision history for this message

Brian Corriveau (bcorrive) wrote on 2024-01-22:

#52

I was able to reproduce the hang problem using the script mdhang.sh (modified for my raid device location) on 22.04 with kernel 5.15 several times, usually within 10-15 minutes.

Using 22.04 hwe kernel 6.5 I was not able to reproduce the problem with a 25+ hours run of the same script. Seems something has been fixed in the newer hwe kernel.