Comment 743 for bug 1690085

Revision history for this message
In , pghjvanblokland (pghjvanblokland-linux-kernel-bugs) wrote :

Ever since upgrading to a Ryzen (1700X), I have experienced frequent system freezes, which may be related to the problems discussed here.

The freeze mostly happens during a certain heavily threaded task with disk io.

Symptoms:

* Screen completely freezes, including mouse pointer,
* Existing SSH connections die, no new connection can be established,
* System can no longer switch to text console,
* LEDs indicate **unceasing disk activity**,
* System still responds to pings,
* Alt-SysRq keys remain active, but cannot output to screen even if already in text console.

I've succeeded in capturing kernel logging after a freeze using netconsole:

This timeout message appears:

    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290] __schedule+0x24e/0x880
    [35042.581295] ? bit_wait+0x60/0x60
    [35042.581300] schedule+0x2c/0x80
    [35042.581304] io_schedule+0x16/0x40
    [35042.581308] bit_wait_io+0x11/0x60
    [35042.581313] __wait_on_bit+0x4c/0x90
    [35042.581317] out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323] ? bit_waitqueue+0x40/0x40
    [35042.581328] __wait_on_buffer+0x32/0x40
    [35042.581333] jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337] ? __switch_to_asm+0x41/0x70
    [35042.581343] kjournald2+0xc8/0x270
    [35042.581347] ? kjournald2+0xc8/0x270
    [35042.581351] ? wait_woken+0x80/0x80
    [35042.581355] kthread+0x121/0x140
    [35042.581359] ? commit_timeout+0x20/0x20
    [35042.581363] ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366] ret_from_fork+0x22/0x40
    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290] __schedule+0x24e/0x880
    [35042.581295] ? bit_wait+0x60/0x60
    [35042.581300] schedule+0x2c/0x80
    [35042.581304] io_schedule+0x16/0x40
    [35042.581308] bit_wait_io+0x11/0x60
    [35042.581313] __wait_on_bit+0x4c/0x90
    [35042.581317] out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323] ? bit_waitqueue+0x40/0x40
    [35042.581328] __wait_on_buffer+0x32/0x40
    [35042.581333] jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337] ? __switch_to_asm+0x41/0x70
    [35042.581343] kjournald2+0xc8/0x270
    [35042.581347] ? kjournald2+0xc8/0x270
    [35042.581351] ? wait_woken+0x80/0x80
    [35042.581355] kthread+0x121/0x140
    [35042.581359] ? commit_timeout+0x20/0x20
    [35042.581363] ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366] ret_from_fork+0x22/0x40

Also, I have thousands of lines of output for blocked tasks. Most traces look more or less like this:

    [34274.346748] sysrq: SysRq : Show Blocked State
    [34274.346766] task PC stack pid father
    [34274.346771] systemd D 0 1 0 0x00000000
    [34274.346776] Call Trace:
    [34274.346786] __schedule+0x24e/0x880
    [34274.346792] ? mempool_alloc_slab+0x15/0x20
    [34274.346795] schedule+0x2c/0x80
    [34274.346798] schedule_timeout+0x15d/0x350
    [34274.346804] ? __next_timer_interrupt+0xe0/0xe0
    [34274.346808] ? wait_woken+0x80/0x80
    [34274.346812] io_schedule_timeout+0x1e/0x50
    [34274.346815] mempool_alloc+0x15d/0x190
    [34274.346820] ? wait_woken+0x80/0x80
    [34274.346825] bio_alloc_bioset+0xa9/0x1e0
    [34274.346830] __split_and_process_non_flush+0x147/0x2c0
    [34274.346834] __split_and_process_bio+0x139/0x2a0
    [34274.346838] dm_make_request+0x7a/0xd0
    [34274.346843] ? SyS_madvise+0x990/0x990
    [34274.346847] generic_make_request+0x124/0x300
    [34274.346850] submit_bio+0x73/0x140
    [34274.346853] ? submit_bio+0x73/0x140
    [34274.346856] ? get_swap_bio+0xcd/0x100
    [34274.346861] __swap_writepage+0x323/0x3b0
    [34274.346865] ? __frontswap_store+0x73/0x100
    [34274.346869] swap_writepage+0x34/0x90
    [34274.346872] pageout.isra.54+0x11b/0x350
    [34274.346878] shrink_page_list+0x99a/0xbc0
    [34274.346883] shrink_inactive_list+0x242/0x590
    [34274.346887] shrink_node_memcg+0x364/0x770
    [34274.346892] shrink_node+0xf7/0x300
    [34274.346896] ? shrink_node+0xf7/0x300
    [34274.346900] do_try_to_free_pages+0xc9/0x330
    [34274.346904] try_to_free_pages+0xee/0x1b0
    [34274.346910] __alloc_pages_slowpath+0x3fc/0xe00
    [34274.346914] ? __switch_to_asm+0x35/0x70
    [34274.346917] ? __switch_to_asm+0x35/0x70
    [34274.346920] ? __switch_to_asm+0x35/0x70
    [34274.346924] ? __switch_to_asm+0x35/0x70
    [34274.346929] ? __switch_to_asm+0x35/0x70
    [34274.346932] ? __switch_to_asm+0x41/0x70
    [34274.346936] __alloc_pages_nodemask+0x29a/0x2c0
    [34274.346940] alloc_pages_current+0x6a/0xe0
    [34274.346944] __page_cache_alloc+0x81/0xa0
    [34274.346948] __do_page_cache_readahead+0x113/0x2c0
    [34274.346952] ? radix_tree_lookup_slot+0x22/0x50
    [34274.346956] ? find_get_entry+0x1e/0x110
    [34274.346959] filemap_fault+0x2ad/0x6f0
    [34274.346968] ? filemap_fault+0x2ad/0x6f0
    [34274.346971] ? page_add_file_rmap+0x134/0x180
    [34274.346975] ? filemap_map_pages+0x181/0x390
    [34274.346980] ext4_filemap_fault+0x31/0x44
    [34274.346748] sysrq: SysRq : Show Blocked State
    [34274.346984] __do_fault+0x5b/0x115
    [34274.346988] __handle_mm_fault+0xdef/0x1290
    [34274.346992] handle_mm_fault+0xb1/0x210
    [34274.346997] __do_page_fault+0x281/0x4b0
    [34274.347001] do_page_fault+0x2e/0xe0
    [34274.347004] ? page_fault+0x2f/0x50
    [34274.347008] page_fault+0x45/0x50
    [34274.347011] RIP: 0033:0x7fa9446ee83a
    [34274.347015] RSP: 002b:00007ffcccb01470 EFLAGS: 00010206
    [34274.347019] RAX: 0000000000000001 RBX: 00005615eff63650 RCX: 00007fa944bcebb7
    [34274.347021] RDX: 0000000000000093 RSI: 00007ffcccb01470 RDI: 0000000000000000
    [34274.347025] RBP: 00007ffcccb01c60 R08: 0000000000000000 R09: 0000000000000008
    [34274.347027] R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000001
    [34274.347032] R13: ffffffffffffffff R14: 00007ffcccb01470 R15: 0000000000000001

Another detail that may be relevant it that show-task-states outputs about 5000 lines of this kind:

    [34830.962684] in-flight: 3235:kcryptd_crypt [dm_crypt], 3237:kcryptd_crypt [dm_crypt], 6056:kcryptd_crypt [dm_crypt], 6058:kcryptd_crypt [dm_crypt], 6055:kcryptd_crypt [dm_crypt], 6057:kcryptd_crypt [dm_crypt], 3992:kcryptd_crypt [dm_crypt], 4861:kcryptd_crypt [dm_crypt], 32431:kcryptd_crypt [dm_crypt], 2682:kcryptd_crypt [dm_crypt], 4850:kcryptd_crypt [dm_crypt], 1429:kcryptd_crypt [dm_crypt], 6054:kcryptd_crypt [dm_crypt], 6060:kcryptd_crypt [dm_crypt], 4862:kcryptd_crypt [dm_crypt], 1862:kcryptd_crypt [dm_crypt]
    [34830.874519] DefaultDispatch 22853 343906.389378 12131 120
    [34830.962714] delayed: kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
    [34830.962761] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
    [34830.881008] .nr_spread_over : 0
    [34830.962862] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]

Is there someone who can interpret all this? If it is helpful I can attach the full blocked-tasks output.

(kernel version is 4.15.0-62-generic)