Comment 44 for bug 1942935

Revision history for this message
Brian Corriveau (bcorrive) wrote :

Unfortunately I have now have seen this bug happen with a data-check of RAID kicked off by system and by cron, so I +1 on Filips's comment #42

In my previous post I said I was switching over to cron triggered raid check, which I did, but it seems I have run into the exact same problem again. I got ten 120 second delays logged in syslog and then the whole system became unreachable.

The data-check started about an hour before the delays were logged.

This shows the first 120 delay that was logged.

Jul 31 21:00:01 server1 kernel: [168797.457900] md: data-check of RAID array md0
...
Jul 31 22:20:46 server1 kernel: [173642.734918] INFO: task kswapd0:190 blocked for more than 120 seconds.
Jul 31 22:20:46 server1 kernel: [173642.742246] Tainted: G O 5.15.0-75-generic #82-Ubuntu
Jul 31 22:20:46 server1 kernel: [173642.750039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 31 22:20:46 server1 kernel: [173642.758894] task:kswapd0 state:D stack: 0 pid: 190 ppid: 2 flags:0x00004000
Jul 31 22:20:46 server1 kernel: [173642.758906] Call Trace:
Jul 31 22:20:46 server1 kernel: [173642.758910] <TASK>
Jul 31 22:20:46 server1 kernel: [173642.758916] __schedule+0x24e/0x590
Jul 31 22:20:46 server1 kernel: [173642.758935] schedule+0x69/0x110
Jul 31 22:20:46 server1 kernel: [173642.758945] md_write_start.part.0+0x174/0x220
Jul 31 22:20:46 server1 kernel: [173642.758957] ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.758969] md_write_start+0x14/0x30
Jul 31 22:20:46 server1 kernel: [173642.758980] raid5_make_request+0x77/0x540 [raid456]
Jul 31 22:20:46 server1 kernel: [173642.759004] ? wait_woken+0x70/0x70
Jul 31 22:20:46 server1 kernel: [173642.759012] md_handle_request+0x12d/0x1b0
Jul 31 22:20:46 server1 kernel: [173642.759020] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759032] ? __blk_queue_split+0xfe/0x210
Jul 31 22:20:46 server1 kernel: [173642.759043] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759052] md_submit_bio+0x76/0xc0
Jul 31 22:20:46 server1 kernel: [173642.759060] __submit_bio+0x1a5/0x220
Jul 31 22:20:46 server1 kernel: [173642.759072] ? __mod_memcg_lruvec_state+0x63/0xe0
Jul 31 22:20:46 server1 kernel: [173642.759082] __submit_bio_noacct+0x85/0x200
Jul 31 22:20:46 server1 kernel: [173642.759093] submit_bio_noacct+0x4e/0x120
Jul 31 22:20:46 server1 kernel: [173642.759102] ? radix_tree_lookup+0xd/0x20
Jul 31 22:20:46 server1 kernel: [173642.759116] ? bio_associate_blkg_from_css+0x1b2/0x310
Jul 31 22:20:46 server1 kernel: [173642.759124] ? page_index+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759133] ? bio_add_page+0x68/0x90
Jul 31 22:20:46 server1 kernel: [173642.759139] submit_bio+0x4a/0x130
Jul 31 22:20:46 server1 kernel: [173642.759149] __swap_writepage+0x1aa/0x440
Jul 31 22:20:46 server1 kernel: [173642.759160] ? __frontswap_store+0x75/0x120
Jul 31 22:20:46 server1 kernel: [173642.759167] swap_writepage+0x32/0xa0
Jul 31 22:20:46 server1 kernel: [173642.759177] pageout+0x100/0x330
Jul 31 22:20:46 server1 kernel: [173642.759189] shrink_page_list+0xbd9/0xf00
Jul 31 22:20:46 server1 kernel: [173642.759202] shrink_inactive_list+0x171/0x4c0
Jul 31 22:20:46 server1 kernel: [173642.759211] ? super_cache_count+0x7c/0x100
Jul 31 22:20:46 server1 kernel: [173642.759222] shrink_lruvec+0x2b6/0x3f0
Jul 31 22:20:46 server1 kernel: [173642.759233] ? shrink_slab+0x10a/0x120
Jul 31 22:20:46 server1 kernel: [173642.759242] ? shrink_slab+0x10a/0x120
Jul 31 22:20:46 server1 kernel: [173642.759251] shrink_node_memcgs+0x172/0x1d0
Jul 31 22:20:46 server1 kernel: [173642.759262] shrink_node+0x16e/0x600
Jul 31 22:20:46 server1 kernel: [173642.759272] balance_pgdat+0x36e/0x810
Jul 31 22:20:46 server1 kernel: [173642.759281] ? finish_task_switch.isra.0+0x7e/0x280
Jul 31 22:20:46 server1 kernel: [173642.759294] kswapd+0x10c/0x1c0
Jul 31 22:20:46 server1 kernel: [173642.759303] ? balance_pgdat+0x810/0x810
Jul 31 22:20:46 server1 kernel: [173642.759313] kthread+0x12a/0x150
Jul 31 22:20:46 server1 kernel: [173642.759322] ? set_kthread_struct+0x50/0x50
Jul 31 22:20:46 server1 kernel: [173642.759332] ret_from_fork+0x22/0x30
Jul 31 22:20:46 server1 kernel: [173642.759344] </TASK>
Jul 31 22:20:46 server1 kernel: [173642.759356] INFO: task md0_raid5:409 blocked for more than 120 seconds.
...

In the same second nine more of these kinds of delays then silence.