Unfortunately I have now have seen this bug happen with a data-check of RAID kicked off by system and by cron, so I +1 on Filips's comment #42
In my previous post I said I was switching over to cron triggered raid check, which I did, but it seems I have run into the exact same problem again. I got ten 120 second delays logged in syslog and then the whole system became unreachable.
The data-check started about an hour before the delays were logged.
Unfortunately I have now have seen this bug happen with a data-check of RAID kicked off by system and by cron, so I +1 on Filips's comment #42
In my previous post I said I was switching over to cron triggered raid check, which I did, but it seems I have run into the exact same problem again. I got ten 120 second delays logged in syslog and then the whole system became unreachable.
The data-check started about an hour before the delays were logged.
This shows the first 120 delay that was logged.
Jul 31 21:00:01 server1 kernel: [168797.457900] md: data-check of RAID array md0 kernel/ hung_task_ timeout_ secs" disables this message. 0x24e/0x590 start.part. 0+0x174/ 0x220 0x70/0x70 start+0x14/ 0x30 request+ 0x77/0x540 [raid456] 0x70/0x70 request+ 0x12d/0x1b0 0x50/0x50 split+0xfe/ 0x210 0x50/0x50 bio+0x76/ 0xc0 bio+0x1a5/ 0x220 lruvec_ state+0x63/ 0xe0 bio_noacct+ 0x85/0x200 bio_noacct+ 0x4e/0x120 lookup+ 0xd/0x20 blkg_from_ css+0x1b2/ 0x310 0x50/0x50 page+0x68/ 0x90 bio+0x4a/ 0x130 writepage+ 0x1aa/0x440 store+0x75/ 0x120 0x32/0xa0 page_list+ 0xbd9/0xf00 inactive_ list+0x171/ 0x4c0 count+0x7c/ 0x100 lruvec+ 0x2b6/0x3f0 slab+0x10a/ 0x120 slab+0x10a/ 0x120 node_memcgs+ 0x172/0x1d0 node+0x16e/ 0x600 pgdat+0x36e/ 0x810 task_switch. isra.0+ 0x7e/0x280 pgdat+0x810/ 0x810 struct+ 0x50/0x50 fork+0x22/ 0x30
...
Jul 31 22:20:46 server1 kernel: [173642.734918] INFO: task kswapd0:190 blocked for more than 120 seconds.
Jul 31 22:20:46 server1 kernel: [173642.742246] Tainted: G O 5.15.0-75-generic #82-Ubuntu
Jul 31 22:20:46 server1 kernel: [173642.750039] "echo 0 > /proc/sys/
Jul 31 22:20:46 server1 kernel: [173642.758894] task:kswapd0 state:D stack: 0 pid: 190 ppid: 2 flags:0x00004000
Jul 31 22:20:46 server1 kernel: [173642.758906] Call Trace:
Jul 31 22:20:46 server1 kernel: [173642.758910] <TASK>
Jul 31 22:20:46 server1 kernel: [173642.758916] __schedule+
Jul 31 22:20:46 server1 kernel: [173642.758935] schedule+0x69/0x110
Jul 31 22:20:46 server1 kernel: [173642.758945] md_write_
Jul 31 22:20:46 server1 kernel: [173642.758957] ? wait_woken+
Jul 31 22:20:46 server1 kernel: [173642.758969] md_write_
Jul 31 22:20:46 server1 kernel: [173642.758980] raid5_make_
Jul 31 22:20:46 server1 kernel: [173642.759004] ? wait_woken+
Jul 31 22:20:46 server1 kernel: [173642.759012] md_handle_
Jul 31 22:20:46 server1 kernel: [173642.759020] ? page_index+
Jul 31 22:20:46 server1 kernel: [173642.759032] ? __blk_queue_
Jul 31 22:20:46 server1 kernel: [173642.759043] ? page_index+
Jul 31 22:20:46 server1 kernel: [173642.759052] md_submit_
Jul 31 22:20:46 server1 kernel: [173642.759060] __submit_
Jul 31 22:20:46 server1 kernel: [173642.759072] ? __mod_memcg_
Jul 31 22:20:46 server1 kernel: [173642.759082] __submit_
Jul 31 22:20:46 server1 kernel: [173642.759093] submit_
Jul 31 22:20:46 server1 kernel: [173642.759102] ? radix_tree_
Jul 31 22:20:46 server1 kernel: [173642.759116] ? bio_associate_
Jul 31 22:20:46 server1 kernel: [173642.759124] ? page_index+
Jul 31 22:20:46 server1 kernel: [173642.759133] ? bio_add_
Jul 31 22:20:46 server1 kernel: [173642.759139] submit_
Jul 31 22:20:46 server1 kernel: [173642.759149] __swap_
Jul 31 22:20:46 server1 kernel: [173642.759160] ? __frontswap_
Jul 31 22:20:46 server1 kernel: [173642.759167] swap_writepage+
Jul 31 22:20:46 server1 kernel: [173642.759177] pageout+0x100/0x330
Jul 31 22:20:46 server1 kernel: [173642.759189] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759202] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759211] ? super_cache_
Jul 31 22:20:46 server1 kernel: [173642.759222] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759233] ? shrink_
Jul 31 22:20:46 server1 kernel: [173642.759242] ? shrink_
Jul 31 22:20:46 server1 kernel: [173642.759251] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759262] shrink_
Jul 31 22:20:46 server1 kernel: [173642.759272] balance_
Jul 31 22:20:46 server1 kernel: [173642.759281] ? finish_
Jul 31 22:20:46 server1 kernel: [173642.759294] kswapd+0x10c/0x1c0
Jul 31 22:20:46 server1 kernel: [173642.759303] ? balance_
Jul 31 22:20:46 server1 kernel: [173642.759313] kthread+0x12a/0x150
Jul 31 22:20:46 server1 kernel: [173642.759322] ? set_kthread_
Jul 31 22:20:46 server1 kernel: [173642.759332] ret_from_
Jul 31 22:20:46 server1 kernel: [173642.759344] </TASK>
Jul 31 22:20:46 server1 kernel: [173642.759356] INFO: task md0_raid5:409 blocked for more than 120 seconds.
...
In the same second nine more of these kinds of delays then silence.