Ever since upgrading to a Ryzen (1700X), I have experienced frequent system freezes, which may be related to the problems discussed here.
The freeze mostly happens during a certain heavily threaded task with disk io.
Symptoms:
* Screen completely freezes, including mouse pointer,
* Existing SSH connections die, no new connection can be established,
* System can no longer switch to text console,
* LEDs indicate **unceasing disk activity**,
* System still responds to pings,
* Alt-SysRq keys remain active, but cannot output to screen even if already in text console.
I've succeeded in capturing kernel logging after a freeze using netconsole:
Ever since upgrading to a Ryzen (1700X), I have experienced frequent system freezes, which may be related to the problems discussed here.
The freeze mostly happens during a certain heavily threaded task with disk io.
Symptoms:
* Screen completely freezes, including mouse pointer,
* Existing SSH connections die, no new connection can be established,
* System can no longer switch to text console,
* LEDs indicate **unceasing disk activity**,
* System still responds to pings,
* Alt-SysRq keys remain active, but cannot output to screen even if already in text console.
I've succeeded in capturing kernel logging after a freeze using netconsole:
This timeout message appears:
[35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds. kernel/ hung_task_ timeout_ secs" disables this message. 0x24e/0x880 0x16/0x40 io+0x11/ 0x60 on_bit+ 0x4c/0x90 line_wait_ on_bit+ 0x90/0xb0 0x40/0x40 on_buffer+ 0x32/0x40 commit_ transaction+ 0xdac/0x1730 to_asm+ 0x41/0x70 0xc8/0x270 0xc8/0x270 0x80/0x80 timeout+ 0x20/0x20 create_ worker_ on_cpu+ 0x70/0x70 fork+0x22/ 0x40 kernel/ hung_task_ timeout_ secs" disables this message. 0x24e/0x880 0x16/0x40 io+0x11/ 0x60 on_bit+ 0x4c/0x90 line_wait_ on_bit+ 0x90/0xb0 0x40/0x40 on_buffer+ 0x32/0x40 commit_ transaction+ 0xdac/0x1730 to_asm+ 0x41/0x70 0xc8/0x270 0xc8/0x270 0x80/0x80 timeout+ 0x20/0x20 create_ worker_ on_cpu+ 0x70/0x70 fork+0x22/ 0x40
[35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
[35042.581262] "echo 0 > /proc/sys/
[35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
[35042.581278] Call Trace:
[35042.581290] __schedule+
[35042.581295] ? bit_wait+0x60/0x60
[35042.581300] schedule+0x2c/0x80
[35042.581304] io_schedule+
[35042.581308] bit_wait_
[35042.581313] __wait_
[35042.581317] out_of_
[35042.581323] ? bit_waitqueue+
[35042.581328] __wait_
[35042.581333] jbd2_journal_
[35042.581337] ? __switch_
[35042.581343] kjournald2+
[35042.581347] ? kjournald2+
[35042.581351] ? wait_woken+
[35042.581355] kthread+0x121/0x140
[35042.581359] ? commit_
[35042.581363] ? kthread_
[35042.581366] ret_from_
[35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
[35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
[35042.581262] "echo 0 > /proc/sys/
[35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
[35042.581278] Call Trace:
[35042.581290] __schedule+
[35042.581295] ? bit_wait+0x60/0x60
[35042.581300] schedule+0x2c/0x80
[35042.581304] io_schedule+
[35042.581308] bit_wait_
[35042.581313] __wait_
[35042.581317] out_of_
[35042.581323] ? bit_waitqueue+
[35042.581328] __wait_
[35042.581333] jbd2_journal_
[35042.581337] ? __switch_
[35042.581343] kjournald2+
[35042.581347] ? kjournald2+
[35042.581351] ? wait_woken+
[35042.581355] kthread+0x121/0x140
[35042.581359] ? commit_
[35042.581363] ? kthread_
[35042.581366] ret_from_
Also, I have thousands of lines of output for blocked tasks. Most traces look more or less like this:
[34274.346748] sysrq: SysRq : Show Blocked State 0x24e/0x880 alloc_slab+ 0x15/0x20 timeout+ 0x15d/0x350 timer_interrupt +0xe0/0xe0 0x80/0x80 timeout+ 0x1e/0x50 alloc+0x15d/ 0x190 0x80/0x80 bioset+ 0xa9/0x1e0 and_process_ non_flush+ 0x147/0x2c0 and_process_ bio+0x139/ 0x2a0 request+ 0x7a/0xd0 0x990/0x990 make_request+ 0x124/0x300 bio+0x73/ 0x140 bio+0x73/ 0x140 bio+0xcd/ 0x100 writepage+ 0x323/0x3b0 store+0x73/ 0x100 0x34/0x90 isra.54+ 0x11b/0x350 page_list+ 0x99a/0xbc0 inactive_ list+0x242/ 0x590 node_memcg+ 0x364/0x770 node+0xf7/ 0x300 node+0xf7/ 0x300 to_free_ pages+0xc9/ 0x330 free_pages+ 0xee/0x1b0 pages_slowpath+ 0x3fc/0xe00 to_asm+ 0x35/0x70 to_asm+ 0x35/0x70 to_asm+ 0x35/0x70 to_asm+ 0x35/0x70 to_asm+ 0x35/0x70 to_asm+ 0x41/0x70 pages_nodemask+ 0x29a/0x2c0 current+ 0x6a/0xe0 cache_alloc+ 0x81/0xa0 cache_readahead +0x113/ 0x2c0 lookup_ slot+0x22/ 0x50 entry+0x1e/ 0x110 fault+0x2ad/ 0x6f0 fault+0x2ad/ 0x6f0 file_rmap+ 0x134/0x180 map_pages+ 0x181/0x390 fault+0x31/ 0x44 0x5b/0x115 mm_fault+ 0xdef/0x1290 mm_fault+ 0xb1/0x210 fault+0x281/ 0x4b0 fault+0x2e/ 0xe0 0x2f/0x50 0x45/0x50 b01470 EFLAGS: 00010206
[34274.346766] task PC stack pid father
[34274.346771] systemd D 0 1 0 0x00000000
[34274.346776] Call Trace:
[34274.346786] __schedule+
[34274.346792] ? mempool_
[34274.346795] schedule+0x2c/0x80
[34274.346798] schedule_
[34274.346804] ? __next_
[34274.346808] ? wait_woken+
[34274.346812] io_schedule_
[34274.346815] mempool_
[34274.346820] ? wait_woken+
[34274.346825] bio_alloc_
[34274.346830] __split_
[34274.346834] __split_
[34274.346838] dm_make_
[34274.346843] ? SyS_madvise+
[34274.346847] generic_
[34274.346850] submit_
[34274.346853] ? submit_
[34274.346856] ? get_swap_
[34274.346861] __swap_
[34274.346865] ? __frontswap_
[34274.346869] swap_writepage+
[34274.346872] pageout.
[34274.346878] shrink_
[34274.346883] shrink_
[34274.346887] shrink_
[34274.346892] shrink_
[34274.346896] ? shrink_
[34274.346900] do_try_
[34274.346904] try_to_
[34274.346910] __alloc_
[34274.346914] ? __switch_
[34274.346917] ? __switch_
[34274.346920] ? __switch_
[34274.346924] ? __switch_
[34274.346929] ? __switch_
[34274.346932] ? __switch_
[34274.346936] __alloc_
[34274.346940] alloc_pages_
[34274.346944] __page_
[34274.346948] __do_page_
[34274.346952] ? radix_tree_
[34274.346956] ? find_get_
[34274.346959] filemap_
[34274.346968] ? filemap_
[34274.346971] ? page_add_
[34274.346975] ? filemap_
[34274.346980] ext4_filemap_
[34274.346748] sysrq: SysRq : Show Blocked State
[34274.346984] __do_fault+
[34274.346988] __handle_
[34274.346992] handle_
[34274.346997] __do_page_
[34274.347001] do_page_
[34274.347004] ? page_fault+
[34274.347008] page_fault+
[34274.347011] RIP: 0033:0x7fa9446ee83a
[34274.347015] RSP: 002b:00007ffccc
[34274.347019] RAX: 0000000000000001 RBX: 00005615eff63650 RCX: 00007fa944bcebb7
[34274.347021] RDX: 0000000000000093 RSI: 00007ffcccb01470 RDI: 0000000000000000
[34274.347025] RBP: 00007ffcccb01c60 R08: 0000000000000000 R09: 0000000000000008
[34274.347027] R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000001
[34274.347032] R13: ffffffffffffffff R14: 00007ffcccb01470 R15: 0000000000000001
Another detail that may be relevant it that show-task-states outputs about 5000 lines of this kind:
[34830.962684] in-flight: 3235:kcryptd_crypt [dm_crypt], 3237:kcryptd_crypt [dm_crypt], 6056:kcryptd_crypt [dm_crypt], 6058:kcryptd_crypt [dm_crypt], 6055:kcryptd_crypt [dm_crypt], 6057:kcryptd_crypt [dm_crypt], 3992:kcryptd_crypt [dm_crypt], 4861:kcryptd_crypt [dm_crypt], 32431:kcryptd_crypt [dm_crypt], 2682:kcryptd_crypt [dm_crypt], 4850:kcryptd_crypt [dm_crypt], 1429:kcryptd_crypt [dm_crypt], 6054:kcryptd_crypt [dm_crypt], 6060:kcryptd_crypt [dm_crypt], 4862:kcryptd_crypt [dm_crypt], 1862:kcryptd_crypt [dm_crypt]
[34830.874519] DefaultDispatch 22853 343906.389378 12131 120
[34830.962714] delayed: kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
[34830.962761] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
[34830.881008] .nr_spread_over : 0
[34830.962862] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
Is there someone who can interpret all this? If it is helpful I can attach the full blocked-tasks output.
(kernel version is 4.15.0-62-generic)