Kernel 5.15.0-75 on Ubuntu 22.04, server1 and server2 Just hit this bug with a different server still running systemd to start the RAID check. The logs indicate there may be more than one RAID check bug out there: - a bug with pausing the raid check - a bug with the raid check when it hasn't been paused In the case with my server2 the raid check was started, then 6 hours later is was interrupted as expected with systemd, but about 20 minutes later the RAID locks up. Pause bug? Aug 6 18:19:42 server2 systemd[1]: Starting MD array scrubbing... Aug 6 18:19:42 server2 root: mdcheck start checking /dev/md0 Aug 6 18:19:42 server2 kernel: [2619392.036055] md: data-check of RAID array md0 ... Aug 7 00:19:44 server2 kernel: [2640994.200542] md: md0: data-check interrupted. ... (minutes later the problem happens after the check is supposedly interrupted) Aug 7 00:23:03 server2: [2641193.425737] INFO: task jbd2/dm-0-8:1904 blocked for more than 120 seconds. Aug 7 00:23:03 server2 kernel: [2641193.433644] Tainted: G OE 5.15.0-75-generic #82-Ubuntu Aug 7 00:23:03 server2 kernel: [2641193.441533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 7 00:23:03 server2 kernel: [2641193.450490] task:jbd2/dm-0-8 state:D stack: 0 pid: 1904 ppid: 2 flags:0x00004000 Aug 7 00:23:03 server2 kernel: [2641193.450503] Call Trace: Aug 7 00:23:03 server2 kernel: [2641193.450507] Aug 7 00:23:03 server2 kernel: [2641193.450513] __schedule+0x24e/0x590 Aug 7 00:23:03 server2 kernel: [2641193.450531] schedule+0x69/0x110 Aug 7 00:23:03 server2 kernel: [2641193.450541] md_write_start.part.0+0x174/0x220 Aug 7 00:23:03 server2 kernel: [2641193.450553] ? wait_woken+0x70/0x70 Aug 7 00:23:03 server2 kernel: [2641193.450575] md_write_start+0x14/0x30 Aug 7 00:23:03 server2 kernel: [2641193.450585] raid5_make_request+0x77/0x540 [raid456] Aug 7 00:23:03 server2 kernel: [2641193.450609] ? wait_woken+0x70/0x70 Aug 7 00:23:03 server2 kernel: [2641193.450616] md_handle_request+0x12d/0x1b0 Aug 7 00:23:03 server2 kernel: [2641193.450625] ? __blk_queue_split+0xfe/0x210 Aug 7 00:23:03 server2 kernel: [2641193.450636] md_submit_bio+0x76/0xc0 Aug 7 00:23:03 server2 kernel: [2641193.450643] __submit_bio+0x1a5/0x220 Aug 7 00:23:03 server2 kernel: [2641193.450654] ? mempool_alloc_slab+0x17/0x20 Aug 7 00:23:03 server2 kernel: [2641193.450665] __submit_bio_noacct+0x85/0x200 Aug 7 00:23:03 server2 kernel: [2641193.450674] ? kmem_cache_alloc+0x1ab/0x2f0 Aug 7 00:23:03 server2 kernel: [2641193.450687] submit_bio_noacct+0x4e/0x120 Aug 7 00:23:03 server2 kernel: [2641193.450698] submit_bio+0x4a/0x130 Aug 7 00:23:03 server2 kernel: [2641193.450707] submit_bh_wbc+0x18d/0x1c0 Aug 7 00:23:03 server2 kernel: [2641193.450713] submit_bh+0x13/0x20 Aug 7 00:23:03 server2 kernel: [2641193.450718] jbd2_journal_commit_transaction+0x861/0x17a0 Aug 7 00:23:03 server2 kernel: [2641193.450729] ? __update_idle_core+0x93/0x120 Aug 7 00:23:03 server2 kernel: [2641193.450744] kjournald2+0xa9/0x280 Aug 7 00:23:03 server2 kernel: [2641193.450750] ? wait_woken+0x70/0x70 Aug 7 00:23:03 server2 kernel: [2641193.450757] ? load_superblock.part.0+0xc0/0xc0 Aug 7 00:23:03 server2 kernel: [2641193.450764] kthread+0x12a/0x150 Aug 7 00:23:03 server2 kernel: [2641193.450774] ? set_kthread_struct+0x50/0x50 Aug 7 00:23:03 server2 kernel: [2641193.450783] ret_from_fork+0x22/0x30 Aug 7 00:23:03 server2 kernel: [2641193.450796] In my comment #41 the raid check started and rain into a problem BEFORE the pause: (started by systemd) Jun 4 09:04:25 server1 kernel: [1609273.391583] md: data-check of RAID array md0 and the problem started here, BEFORE the check ran for 6 hours, which is when it would have been paused Jun 4 12:48:11 server1 kernel: [1622699.548591] INFO: task md0_raid5:406 blocked for more than 120 seconds. Jun 4 12:48:11 server1 kernel: [1622699.556202] Tainted: G OE 5.15.0-67-generic #74-Ubuntu Jun 4 12:48:11 server1 kernel: [1622699.564101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 12:48:11 server1 kernel: [1622699.573063] task:md0_raid5 state:D stack: 0 pid: 406 ppid: 2 flags:0x00004000 Jun 4 12:48:11 server1 kernel: [1622699.573077] Call Trace: Jun 4 12:48:11 server1 kernel: [1622699.573081] Jun 4 12:48:11 server1 kernel: [1622699.573087] __schedule+0x24e/0x590 Jun 4 12:48:11 server1 kernel: [1622699.573103] schedule+0x69/0x110 Jun 4 12:48:11 server1 kernel: [1622699.573115] raid5d+0x3d9/0x5f0 [raid456] Jun 4 12:48:11 server1 kernel: [1622699.573140] ? wait_woken+0x70/0x70 Jun 4 12:48:11 server1 kernel: [1622699.573151] md_thread+0xad/0x170 My comment #44 for server1 was for the raid check started by cron, which of course has no pause. With server2 it looks like the pause left some lock on the RAID which caused the server to become unreachable. With server1 the problem has happened twice now, once with systemd, and once with cron. Both times no pause was involved. Both servers have RAID5 with about 30T, so relatively large.