mdcheck/checkarray locks up system

Bug #2034450 reported by Brian Corriveau
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
New
Undecided
Unassigned

Bug Description

We hit his bug in Ubuntu 22.04 with kernel 5.15.0-67-generic with a 3 drive RAID5.
We did not see this bug with 18.04.

The system becomes unresponsive and rebooting seems to be the only way to fix it.

There seems to be two bugs:
- a bug with pausing the raid check (check triggered by systemd)
- a bug with the raid check when it hasn't been paused (check triggered by systemd, or cron)

These are the logs on the first tine we say it, shortly after upgrading from 18.04 all the way to 22.04.

Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.573081] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+0xad/0x170
Jun 4 12:48:11 server1 kernel: [1622699.573162] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.573169] ? md_write_inc+0x60/0x60
Jun 4 12:48:11 server1 kernel: [1622699.573176] kthread+0x12a/0x150
Jun 4 12:48:11 server1 kernel: [1622699.573187] ? set_kthread_struct+0x50/0x50
Jun 4 12:48:11 server1 kernel: [1622699.573197] ret_from_fork+0x22/0x30
Jun 4 12:48:11 server1 kernel: [1622699.573212] </TASK>
Jun 4 12:48:11 server1 kernel: [1622699.573231] INFO: task jbd2/dm-0-8:1375 blocked for more than 120 seconds.
Jun 4 12:48:11 server1 kernel: [1622699.581119] Tainted: G OE 5.15.0-67-generic #74-Ubuntu
Jun 4 12:48:11 server1 kernel: [1622699.589004] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 4 12:48:11 server1 kernel: [1622699.597959] task:jbd2/dm-0-8 state:D stack: 0 pid: 1375 ppid: 2 flags:0x00004000
Jun 4 12:48:11 server1 kernel: [1622699.597968] Call Trace:
Jun 4 12:48:11 server1 kernel: [1622699.597970] <TASK>
Jun 4 12:48:11 server1 kernel: [1622699.597973] __schedule+0x24e/0x590
Jun 4 12:48:11 server1 kernel: [1622699.597984] schedule+0x69/0x110
Jun 4 12:48:11 server1 kernel: [1622699.597992] md_write_start.part.0+0x174/0x220
Jun 4 12:48:11 server1 kernel: [1622699.598002] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.598024] md_write_start+0x14/0x30
Jun 4 12:48:11 server1 kernel: [1622699.598032] raid5_make_request+0x77/0x540 [raid456]
Jun 4 12:48:11 server1 kernel: [1622699.598051] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.598058] md_handle_request+0x12d/0x1b0
Jun 4 12:48:11 server1 kernel: [1622699.598065] ? __blk_queue_split+0xfe/0x200
Jun 4 12:48:11 server1 kernel: [1622699.598075] md_submit_bio+0x71/0xc0
Jun 4 12:48:11 server1 kernel: [1622699.598082] __submit_bio+0x1a5/0x220
Jun 4 12:48:11 server1 kernel: [1622699.598091] ? mempool_alloc_slab+0x17/0x20
Jun 4 12:48:11 server1 kernel: [1622699.598102] __submit_bio_noacct+0x85/0x200
Jun 4 12:48:11 server1 kernel: [1622699.598110] ? kmem_cache_alloc+0x1ab/0x2f0
Jun 4 12:48:11 server1 kernel: [1622699.598122] submit_bio_noacct+0x4e/0x120
Jun 4 12:48:11 server1 kernel: [1622699.598131] submit_bio+0x4a/0x130
Jun 4 12:48:11 server1 kernel: [1622699.598139] submit_bh_wbc+0x18d/0x1c0
Jun 4 12:48:11 server1 kernel: [1622699.598151] submit_bh+0x13/0x20
Jun 4 12:48:11 server1 kernel: [1622699.598160] jbd2_journal_commit_transaction+0x861/0x17a0
Jun 4 12:48:11 server1 kernel: [1622699.598170] ? __update_idle_core+0x93/0x120
Jun 4 12:48:11 server1 kernel: [1622699.598184] kjournald2+0xa9/0x280
Jun 4 12:48:11 server1 kernel: [1622699.598190] ? wait_woken+0x70/0x70
Jun 4 12:48:11 server1 kernel: [1622699.598197] ? load_superblock.part.0+0xc0/0xc0
Jun 4 12:48:11 server1 kernel: [1622699.598202] kthread+0x12a/0x150
Jun 4 12:48:11 server1 kernel: [1622699.598210] ? set_kthread_struct+0x50/0x50
Jun 4 12:48:11 server1 kernel: [1622699.598218] ret_from_fork+0x22/0x30
Jun 4 12:48:11 server1 kernel: [1622699.598229] </TASK>

Following some advice from
 https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe-5.11/+bug/1942935

I changed over to cron kicking off the raid check, but we still hit this bug.

These are the logs for the problem happening with the raid check started by cron

Jul 31 21:00:01 server1 kernel: [168797.457900] md: data-check of RAID array md0
...
Jul 31 22:20:46 server1 kernel: [173642.734918] INFO: task kswapd0:190 blocked for more than 120 seconds.
Jul 31 22:20:46 server1 kernel: [173642.742246] Tainted: G O 5.15.0-75-generic #82-Ubuntu
Jul 31 22:20:46 server1 kernel: [173642.750039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 22:20:46 server1 kernel: [173642.758894] task:kswapd0 state:D stack: 0 pid: 190 ppid: 2 flags:0x00004000
Jul 31 22:20:46 server1 kernel: [173642.758906] Call Trace:
Jul 31 22:20:46 server1 kernel: [173642.758910] <TASK>
Jul 31 22:20:46 server1 kernel: [173642.758916] __schedule+0x24e/0x590
Jul 31 22:20:46 server1 kernel: [173642.758935] schedule+0x69/0x110
Jul 31 22:20:46 server1 kernel: [173642.758945] md_write_start.part.0+0x174/0x220
Jul 31 22:20:46 server1 kernel: [173642.758957] ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.758969] md_write_start+0x14/0x30
Jul 31 22:20:46 server1 kernel: [173642.758980] raid5_make_request+0x77/0x540 [raid456]
Jul 31 22:20:46 server1 kernel: [173642.759004] ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.759012] md_handle_request+0x12d/0x1b0
Jul 31 22:20:46 server1 kernel: [173642.759020] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759032] ? __blk_queue_split+0xfe/0x210
Jul 31 22:20:46 server1 kernel: [173642.759043] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759052] md_submit_bio+0x76/0xc0
Jul 31 22:20:46 server1 kernel: [173642.759060] __submit_bio+0x1a5/0x220
Jul 31 22:20:46 server1 kernel: [173642.759072] ? __mod_memcg_lruvec_state+0x63/0xe0
Jul 31 22:20:46 server1 kernel: [173642.759082] __submit_bio_noacct+0x85/0x200
Jul 31 22:20:46 server1 kernel: [173642.759093] submit_bio_noacct+0x4e/0x120
Jul 31 22:20:46 server1 kernel: [173642.759102] ? radix_tree_lookup+0xd/0x20
Jul 31 22:20:46 server1 kernel: [173642.759116] ? bio_associate_blkg_from_css+0x1b2/0x310
Jul 31 22:20:46 server1 kernel: [173642.759124] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759133] ? bio_add_page+0x68/0x90
Jul 31 22:20:46 server1 kernel: [173642.759139] submit_bio+0x4a/0x130
Jul 31 22:20:46 server1 kernel: [173642.759149] __swap_writepage+0x1aa/0x440
Jul 31 22:20:46 server1 kernel: [173642.759160] ? __frontswap_store+0x75/0x120
Jul 31 22:20:46 server1 kernel: [173642.759167] swap_writepage+0x32/0xa0
Jul 31 22:20:46 server1 kernel: [173642.759177] pageout+0x100/0x330
Jul 31 22:20:46 server1 kernel: [173642.759189] shrink_page_list+0xbd9/0xf00
Jul 31 22:20:46 server1 kernel: [173642.759202] shrink_inactive_list+0x171/0x4c0
Jul 31 22:20:46 server1 kernel: [173642.759211] ? super_cache_count+0x7c/0x100
Jul 31 22:20:46 server1 kernel: [173642.759222] shrink_lruvec+0x2b6/0x3f0
Jul 31 22:20:46 server1 kernel: [173642.759233] ? shrink_slab+0x10a/0x120
Jul 31 22:20:46 server1 kernel: [173642.759242] ? shrink_slab+0x10a/0x120
Jul 31 22:20:46 server1 kernel: [173642.759251] shrink_node_memcgs+0x172/0x1d0
Jul 31 22:20:46 server1 kernel: [173642.759262] shrink_node+0x16e/0x600
Jul 31 22:20:46 server1 kernel: [173642.759272] balance_pgdat+0x36e/0x810
Jul 31 22:20:46 server1 kernel: [173642.759281] ? finish_task_switch.isra.0+0x7e/0x280
Jul 31 22:20:46 server1 kernel: [173642.759294] kswapd+0x10c/0x1c0
Jul 31 22:20:46 server1 kernel: [173642.759303] ? balance_pgdat+0x810/0x810
Jul 31 22:20:46 server1 kernel: [173642.759313] kthread+0x12a/0x150
Jul 31 22:20:46 server1 kernel: [173642.759322] ? set_kthread_struct+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759332] ret_from_fork+0x22/0x30
Jul 31 22:20:46 server1 kernel: [173642.759344] </TASK>
Jul 31 22:20:46 server1 kernel: [173642.759356] INFO: task md0_raid5:409 blocked for more than 120 seconds.
...

In the same second nine more of these kinds of delays then silence.

I have seen the problem happen before the systemd mdcheck pause, and also shortly after the pause was supposed to have happened, as with below:

Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing...
Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0
Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0
...
Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted.
... (minutes later the problem happens after the check is supposedly interrupted)

Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds.
Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu
Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000
Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace:
Aug 7 00:23:03 server2 kernel: [2641193.450507] <TASK>
Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+0x24e/0x590
Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110
Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_start.part.0+0x174/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_start+0x14/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_request+0x77/0x540 [raid456]
Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_request+0x12d/0x1b0
Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_split+0xfe/0x210
Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_bio+0x76/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_bio+0x1a5/0x220
Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_alloc_slab+0x17/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_bio_noacct+0x85/0x200
Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_alloc+0x1ab/0x2f0
Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_bio_noacct+0x4e/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_bio+0x4a/0x130
Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_bh_wbc+0x18d/0x1c0
Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20
Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_commit_transaction+0x861/0x17a0
Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_idle_core+0x93/0x120
Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+0xa9/0x280
Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+0x70/0x70
Aug 7 00:23:03 server2 kernel: [2641193.450757] ? load_superblock.part.0+0xc0/0xc0
Aug 7 00:23:03 server2 kernel: [2641193.450764] kthread+0x12a/0x150
Aug 7 00:23:03 server2 kernel: [2641193.450774] ? set_kthread_struct+0x50/0x50
Aug 7 00:23:03 server2 kernel: [2641193.450783] ret_from_fork+0x22/0x30
Aug 7 00:23:03 server2 kernel: [2641193.450796] </TASK>

Tags: mdadm raid
summary: - mdcheck/checkarray lock up system
+ mdcheck/checkarray locks up system
tags: added: mdadm
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.