zfs recv PANIC at range_tree.c:304:range_tree_find_impl()

Bug #1861235 reported by Seth Arnold
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
zfs-linux (Ubuntu)
Incomplete
Undecided
Unassigned
Bionic
Confirmed
Undecided
Unassigned

Bug Description

Same as bug 1861228 but with a newer kernel installed.

[ 790.702566] VERIFY(size != 0) failed
[ 790.702590] PANIC at range_tree.c:304:range_tree_find_impl()
[ 790.702611] Showing stack for process 28685
[ 790.702614] CPU: 17 PID: 28685 Comm: receive_writer Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 790.702615] Hardware name: Supermicro SSG-6038R-E1CR16L/X10DRH-iT, BIOS 2.0 12/17/2015
[ 790.702616] Call Trace:
[ 790.702626] dump_stack+0x6d/0x8e
[ 790.702637] spl_dumpstack+0x42/0x50 [spl]
[ 790.702640] spl_panic+0xc8/0x110 [spl]
[ 790.702645] ? __switch_to_asm+0x41/0x70
[ 790.702714] ? arc_prune_task+0x1a/0x40 [zfs]
[ 790.702740] ? dbuf_dirty+0x43d/0x850 [zfs]
[ 790.702745] ? getrawmonotonic64+0x43/0xd0
[ 790.702746] ? getrawmonotonic64+0x43/0xd0
[ 790.702775] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 790.702778] ? getrawmonotonic64+0x43/0xd0
[ 790.702805] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 790.702807] ? mutex_lock+0x12/0x40
[ 790.702833] ? dbuf_rele_and_unlock+0x1a8/0x4b0 [zfs]
[ 790.702866] range_tree_find_impl+0x88/0x90 [zfs]
[ 790.702870] ? spl_kmem_zalloc+0xdc/0x1a0 [spl]
[ 790.702902] range_tree_clear+0x4f/0x60 [zfs]
[ 790.702930] dnode_free_range+0x11f/0x5a0 [zfs]
[ 790.702957] dmu_object_free+0x53/0x90 [zfs]
[ 790.702983] dmu_free_long_object+0x9f/0xc0 [zfs]
[ 790.703010] receive_freeobjects.isra.12+0x7a/0x100 [zfs]
[ 790.703036] receive_writer_thread+0x6d2/0xa60 [zfs]
[ 790.703040] ? set_curr_task_fair+0x2b/0x60
[ 790.703043] ? spl_kmem_free+0x33/0x40 [spl]
[ 790.703048] ? kfree+0x165/0x180
[ 790.703073] ? receive_free.isra.13+0xc0/0xc0 [zfs]
[ 790.703078] thread_generic_wrapper+0x74/0x90 [spl]
[ 790.703081] kthread+0x121/0x140
[ 790.703084] ? __thread_exit+0x20/0x20 [spl]
[ 790.703085] ? kthread_create_worker_on_cpu+0x70/0x70
[ 790.703088] ret_from_fork+0x35/0x40
[ 967.636923] INFO: task txg_quiesce:14810 blocked for more than 120 seconds.
[ 967.636979] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 967.637024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 967.637076] txg_quiesce D 0 14810 2 0x80000000
[ 967.637080] Call Trace:
[ 967.637089] __schedule+0x24e/0x880
[ 967.637092] schedule+0x2c/0x80
[ 967.637106] cv_wait_common+0x11e/0x140 [spl]
[ 967.637114] ? wait_woken+0x80/0x80
[ 967.637122] __cv_wait+0x15/0x20 [spl]
[ 967.637210] txg_quiesce_thread+0x2cb/0x3d0 [zfs]
[ 967.637278] ? txg_delay+0x1b0/0x1b0 [zfs]
[ 967.637286] thread_generic_wrapper+0x74/0x90 [spl]
[ 967.637291] kthread+0x121/0x140
[ 967.637297] ? __thread_exit+0x20/0x20 [spl]
[ 967.637299] ? kthread_create_worker_on_cpu+0x70/0x70
[ 967.637304] ret_from_fork+0x35/0x40
[ 967.637326] INFO: task zfs:28590 blocked for more than 120 seconds.
[ 967.637371] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 967.637416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 967.637467] zfs D 0 28590 28587 0x80000080
[ 967.637470] Call Trace:
[ 967.637474] __schedule+0x24e/0x880
[ 967.637477] schedule+0x2c/0x80
[ 967.637486] cv_wait_common+0x11e/0x140 [spl]
[ 967.637491] ? wait_woken+0x80/0x80
[ 967.637498] __cv_wait+0x15/0x20 [spl]
[ 967.637554] dmu_recv_stream+0xa51/0xef0 [zfs]
[ 967.637630] zfs_ioc_recv_impl+0x306/0x1100 [zfs]
[ 967.637679] ? dbuf_read+0x34a/0x920 [zfs]
[ 967.637725] ? dbuf_rele+0x36/0x40 [zfs]
[ 967.637728] ? _cond_resched+0x19/0x40
[ 967.637798] zfs_ioc_recv_new+0x33d/0x410 [zfs]
[ 967.637809] ? spl_kmem_alloc_impl+0xe5/0x1a0 [spl]
[ 967.637816] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 967.637828] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 967.637834] ? nv_mem_zalloc.isra.0+0x2e/0x40 [znvpair]
[ 967.637840] ? nvlist_xalloc.part.2+0x50/0xb0 [znvpair]
[ 967.637905] zfsdev_ioctl+0x451/0x610 [zfs]
[ 967.637913] do_vfs_ioctl+0xa8/0x630
[ 967.637917] ? __audit_syscall_entry+0xbc/0x110
[ 967.637924] ? syscall_trace_enter+0x1da/0x2d0
[ 967.637927] SyS_ioctl+0x79/0x90
[ 967.637930] do_syscall_64+0x73/0x130
[ 967.637935] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 967.637938] RIP: 0033:0x7fc305a905d7
[ 967.637940] RSP: 002b:00007ffc45e39618 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 967.637943] RAX: ffffffffffffffda RBX: 0000000000005a46 RCX: 00007fc305a905d7
[ 967.637945] RDX: 00007ffc45e39630 RSI: 0000000000005a46 RDI: 0000000000000006
[ 967.637946] RBP: 00007ffc45e39630 R08: 00007fc305d65e20 R09: 0000000000000000
[ 967.637948] R10: 00005648313c7010 R11: 0000000000000246 R12: 00007ffc45e3cc60
[ 967.637949] R13: 0000000000000006 R14: 00005648313cef10 R15: 000000000000000c
[ 967.637960] INFO: task receive_writer:28685 blocked for more than 120 seconds.
[ 967.638010] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 967.638055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 967.638107] receive_writer D 0 28685 2 0x80000080
[ 967.638109] Call Trace:
[ 967.638113] __schedule+0x24e/0x880
[ 967.638117] schedule+0x2c/0x80
[ 967.638126] spl_panic+0xfa/0x110 [spl]
[ 967.638175] ? arc_prune_task+0x1a/0x40 [zfs]
[ 967.638223] ? dbuf_dirty+0x43d/0x850 [zfs]
[ 967.638229] ? getrawmonotonic64+0x43/0xd0
[ 967.638232] ? getrawmonotonic64+0x43/0xd0
[ 967.638289] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 967.638293] ? getrawmonotonic64+0x43/0xd0
[ 967.638344] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 967.638348] ? mutex_lock+0x12/0x40
[ 967.638395] ? dbuf_rele_and_unlock+0x1a8/0x4b0 [zfs]
[ 967.638461] range_tree_find_impl+0x88/0x90 [zfs]
[ 967.638468] ? spl_kmem_zalloc+0xdc/0x1a0 [spl]
[ 967.638530] range_tree_clear+0x4f/0x60 [zfs]
[ 967.638583] dnode_free_range+0x11f/0x5a0 [zfs]
[ 967.638635] dmu_object_free+0x53/0x90 [zfs]
[ 967.638685] dmu_free_long_object+0x9f/0xc0 [zfs]
[ 967.638738] receive_freeobjects.isra.12+0x7a/0x100 [zfs]
[ 967.638787] receive_writer_thread+0x6d2/0xa60 [zfs]
[ 967.638792] ? set_curr_task_fair+0x2b/0x60
[ 967.638800] ? spl_kmem_free+0x33/0x40 [spl]
[ 967.638805] ? kfree+0x165/0x180
[ 967.638852] ? receive_free.isra.13+0xc0/0xc0 [zfs]
[ 967.638860] thread_generic_wrapper+0x74/0x90 [spl]
[ 967.638863] kthread+0x121/0x140
[ 967.638869] ? __thread_exit+0x20/0x20 [spl]
[ 967.638872] ? kthread_create_worker_on_cpu+0x70/0x70
[ 967.638876] ret_from_fork+0x35/0x40
[ 1088.467860] INFO: task txg_quiesce:14810 blocked for more than 120 seconds.
[ 1088.467913] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 1088.467958] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.468010] txg_quiesce D 0 14810 2 0x80000000
[ 1088.468013] Call Trace:
[ 1088.468020] __schedule+0x24e/0x880
[ 1088.468023] schedule+0x2c/0x80
[ 1088.468035] cv_wait_common+0x11e/0x140 [spl]
[ 1088.468039] ? wait_woken+0x80/0x80
[ 1088.468048] __cv_wait+0x15/0x20 [spl]
[ 1088.468124] txg_quiesce_thread+0x2cb/0x3d0 [zfs]
[ 1088.468193] ? txg_delay+0x1b0/0x1b0 [zfs]
[ 1088.468201] thread_generic_wrapper+0x74/0x90 [spl]
[ 1088.468204] kthread+0x121/0x140
[ 1088.468210] ? __thread_exit+0x20/0x20 [spl]
[ 1088.468213] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1088.468217] ret_from_fork+0x35/0x40
[ 1088.468234] INFO: task zfs:28590 blocked for more than 120 seconds.
[ 1088.468278] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 1088.468322] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.468373] zfs D 0 28590 28587 0x80000080
[ 1088.468376] Call Trace:
[ 1088.468379] __schedule+0x24e/0x880
[ 1088.468382] schedule+0x2c/0x80
[ 1088.468391] cv_wait_common+0x11e/0x140 [spl]
[ 1088.468396] ? wait_woken+0x80/0x80
[ 1088.468403] __cv_wait+0x15/0x20 [spl]
[ 1088.468459] dmu_recv_stream+0xa51/0xef0 [zfs]
[ 1088.468534] zfs_ioc_recv_impl+0x306/0x1100 [zfs]
[ 1088.468582] ? dbuf_read+0x34a/0x920 [zfs]
[ 1088.468631] ? dbuf_rele+0x36/0x40 [zfs]
[ 1088.468634] ? _cond_resched+0x19/0x40
[ 1088.468704] zfs_ioc_recv_new+0x33d/0x410 [zfs]
[ 1088.468714] ? spl_kmem_alloc_impl+0xe5/0x1a0 [spl]
[ 1088.468721] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1088.468729] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1088.468735] ? nv_mem_zalloc.isra.0+0x2e/0x40 [znvpair]
[ 1088.468740] ? nvlist_xalloc.part.2+0x50/0xb0 [znvpair]
[ 1088.468805] zfsdev_ioctl+0x451/0x610 [zfs]
[ 1088.468811] do_vfs_ioctl+0xa8/0x630
[ 1088.468815] ? __audit_syscall_entry+0xbc/0x110
[ 1088.468821] ? syscall_trace_enter+0x1da/0x2d0
[ 1088.468824] SyS_ioctl+0x79/0x90
[ 1088.468827] do_syscall_64+0x73/0x130
[ 1088.468831] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1088.468834] RIP: 0033:0x7fc305a905d7
[ 1088.468836] RSP: 002b:00007ffc45e39618 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1088.468839] RAX: ffffffffffffffda RBX: 0000000000005a46 RCX: 00007fc305a905d7
[ 1088.468840] RDX: 00007ffc45e39630 RSI: 0000000000005a46 RDI: 0000000000000006
[ 1088.468842] RBP: 00007ffc45e39630 R08: 00007fc305d65e20 R09: 0000000000000000
[ 1088.468843] R10: 00005648313c7010 R11: 0000000000000246 R12: 00007ffc45e3cc60
[ 1088.468845] R13: 0000000000000006 R14: 00005648313cef10 R15: 000000000000000c
[ 1088.468853] INFO: task receive_writer:28685 blocked for more than 120 seconds.
[ 1088.468903] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 1088.468948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1088.469000] receive_writer D 0 28685 2 0x80000080
[ 1088.469002] Call Trace:
[ 1088.469006] __schedule+0x24e/0x880
[ 1088.469009] schedule+0x2c/0x80
[ 1088.469018] spl_panic+0xfa/0x110 [spl]
[ 1088.469066] ? arc_prune_task+0x1a/0x40 [zfs]
[ 1088.469116] ? dbuf_dirty+0x43d/0x850 [zfs]
[ 1088.469120] ? getrawmonotonic64+0x43/0xd0
[ 1088.469123] ? getrawmonotonic64+0x43/0xd0
[ 1088.469178] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 1088.469182] ? getrawmonotonic64+0x43/0xd0
[ 1088.469234] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 1088.469237] ? mutex_lock+0x12/0x40
[ 1088.469286] ? dbuf_rele_and_unlock+0x1a8/0x4b0 [zfs]
[ 1088.469352] range_tree_find_impl+0x88/0x90 [zfs]
[ 1088.469360] ? spl_kmem_zalloc+0xdc/0x1a0 [spl]
[ 1088.469421] range_tree_clear+0x4f/0x60 [zfs]
[ 1088.469475] dnode_free_range+0x11f/0x5a0 [zfs]
[ 1088.469526] dmu_object_free+0x53/0x90 [zfs]
[ 1088.469575] dmu_free_long_object+0x9f/0xc0 [zfs]
[ 1088.469627] receive_freeobjects.isra.12+0x7a/0x100 [zfs]
[ 1088.469678] receive_writer_thread+0x6d2/0xa60 [zfs]
[ 1088.469681] ? set_curr_task_fair+0x2b/0x60
[ 1088.469689] ? spl_kmem_free+0x33/0x40 [spl]
[ 1088.469693] ? kfree+0x165/0x180
[ 1088.469740] ? receive_free.isra.13+0xc0/0xc0 [zfs]
[ 1088.469747] thread_generic_wrapper+0x74/0x90 [spl]
[ 1088.469750] kthread+0x121/0x140
[ 1088.469756] ? __thread_exit+0x20/0x20 [spl]
[ 1088.469759] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1088.469763] ret_from_fork+0x35/0x40
[ 1209.298375] INFO: task txg_quiesce:14810 blocked for more than 120 seconds.
[ 1209.298429] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 1209.298474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1209.298526] txg_quiesce D 0 14810 2 0x80000000
[ 1209.298530] Call Trace:
[ 1209.298536] __schedule+0x24e/0x880
[ 1209.298540] schedule+0x2c/0x80
[ 1209.298551] cv_wait_common+0x11e/0x140 [spl]
[ 1209.298556] ? wait_woken+0x80/0x80
[ 1209.298564] __cv_wait+0x15/0x20 [spl]
[ 1209.298644] txg_quiesce_thread+0x2cb/0x3d0 [zfs]
[ 1209.298713] ? txg_delay+0x1b0/0x1b0 [zfs]
[ 1209.298721] thread_generic_wrapper+0x74/0x90 [spl]
[ 1209.298724] kthread+0x121/0x140
[ 1209.298730] ? __thread_exit+0x20/0x20 [spl]
[ 1209.298733] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1209.298738] ret_from_fork+0x35/0x40
[ 1209.298754] INFO: task zfs:28590 blocked for more than 120 seconds.
[ 1209.298799] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 1209.298843] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1209.298894] zfs D 0 28590 28587 0x80000080
[ 1209.298897] Call Trace:
[ 1209.298900] __schedule+0x24e/0x880
[ 1209.298903] schedule+0x2c/0x80
[ 1209.298912] cv_wait_common+0x11e/0x140 [spl]
[ 1209.298916] ? wait_woken+0x80/0x80
[ 1209.298924] __cv_wait+0x15/0x20 [spl]
[ 1209.298981] dmu_recv_stream+0xa51/0xef0 [zfs]
[ 1209.299056] zfs_ioc_recv_impl+0x306/0x1100 [zfs]
[ 1209.299104] ? dbuf_read+0x34a/0x920 [zfs]
[ 1209.299151] ? dbuf_rele+0x36/0x40 [zfs]
[ 1209.299154] ? _cond_resched+0x19/0x40
[ 1209.299223] zfs_ioc_recv_new+0x33d/0x410 [zfs]
[ 1209.299232] ? spl_kmem_alloc_impl+0xe5/0x1a0 [spl]
[ 1209.299239] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1209.299247] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1209.299255] ? nv_mem_zalloc.isra.0+0x2e/0x40 [znvpair]
[ 1209.299261] ? nvlist_xalloc.part.2+0x50/0xb0 [znvpair]
[ 1209.299325] zfsdev_ioctl+0x451/0x610 [zfs]
[ 1209.299331] do_vfs_ioctl+0xa8/0x630
[ 1209.299334] ? __audit_syscall_entry+0xbc/0x110
[ 1209.299339] ? syscall_trace_enter+0x1da/0x2d0
[ 1209.299341] SyS_ioctl+0x79/0x90
[ 1209.299345] do_syscall_64+0x73/0x130
[ 1209.299349] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1209.299352] RIP: 0033:0x7fc305a905d7
[ 1209.299353] RSP: 002b:00007ffc45e39618 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1209.299358] RAX: ffffffffffffffda RBX: 0000000000005a46 RCX: 00007fc305a905d7
[ 1209.299361] RDX: 00007ffc45e39630 RSI: 0000000000005a46 RDI: 0000000000000006
[ 1209.299362] RBP: 00007ffc45e39630 R08: 00007fc305d65e20 R09: 0000000000000000
[ 1209.299365] R10: 00005648313c7010 R11: 0000000000000246 R12: 00007ffc45e3cc60
[ 1209.299366] R13: 0000000000000006 R14: 00005648313cef10 R15: 000000000000000c
[ 1209.299375] INFO: task receive_writer:28685 blocked for more than 120 seconds.
[ 1209.299426] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 1209.299471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1209.299523] receive_writer D 0 28685 2 0x80000080
[ 1209.299525] Call Trace:
[ 1209.299529] __schedule+0x24e/0x880
[ 1209.299532] schedule+0x2c/0x80
[ 1209.299541] spl_panic+0xfa/0x110 [spl]
[ 1209.299591] ? arc_prune_task+0x1a/0x40 [zfs]
[ 1209.299638] ? dbuf_dirty+0x43d/0x850 [zfs]
[ 1209.299644] ? getrawmonotonic64+0x43/0xd0
[ 1209.299647] ? getrawmonotonic64+0x43/0xd0
[ 1209.299703] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 1209.299707] ? getrawmonotonic64+0x43/0xd0
[ 1209.299759] ? dmu_zfetch+0x49a/0x500 [zfs]
[ 1209.299762] ? mutex_lock+0x12/0x40
[ 1209.299810] ? dbuf_rele_and_unlock+0x1a8/0x4b0 [zfs]
[ 1209.299875] range_tree_find_impl+0x88/0x90 [zfs]
[ 1209.299883] ? spl_kmem_zalloc+0xdc/0x1a0 [spl]
[ 1209.299944] range_tree_clear+0x4f/0x60 [zfs]
[ 1209.300000] dnode_free_range+0x11f/0x5a0 [zfs]
[ 1209.300052] dmu_object_free+0x53/0x90 [zfs]
[ 1209.300102] dmu_free_long_object+0x9f/0xc0 [zfs]
[ 1209.300153] receive_freeobjects.isra.12+0x7a/0x100 [zfs]
[ 1209.300203] receive_writer_thread+0x6d2/0xa60 [zfs]
[ 1209.300208] ? set_curr_task_fair+0x2b/0x60
[ 1209.300215] ? spl_kmem_free+0x33/0x40 [spl]
[ 1209.300219] ? kfree+0x165/0x180
[ 1209.300266] ? receive_free.isra.13+0xc0/0xc0 [zfs]
[ 1209.300273] thread_generic_wrapper+0x74/0x90 [spl]
[ 1209.300278] kthread+0x121/0x140
[ 1209.300284] ? __thread_exit+0x20/0x20 [spl]
[ 1209.300287] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1209.300291] ret_from_fork+0x35/0x40
[ 1330.129221] INFO: task txg_quiesce:14810 blocked for more than 120 seconds.
[ 1330.129282] Tainted: P O 4.15.0-76-generic #86-Ubuntu
[ 1330.129328] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1330.129382] txg_quiesce D 0 14810 2 0x80000000
[ 1330.129387] Call Trace:
[ 1330.129399] __schedule+0x24e/0x880
[ 1330.129403] schedule+0x2c/0x80
[ 1330.129420] cv_wait_common+0x11e/0x140 [spl]
[ 1330.129429] ? wait_woken+0x80/0x80
[ 1330.129437] __cv_wait+0x15/0x20 [spl]
[ 1330.129547] txg_quiesce_thread+0x2cb/0x3d0 [zfs]
[ 1330.129617] ? txg_delay+0x1b0/0x1b0 [zfs]
[ 1330.129625] thread_generic_wrapper+0x74/0x90 [spl]
[ 1330.129631] kthread+0x121/0x140
[ 1330.129638] ? __thread_exit+0x20/0x20 [spl]
[ 1330.129641] ? kthread_create_worker_on_cpu+0x70/0x70
[ 1330.129646] ret_from_fork+0x35/0x40

Thanks

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-76-generic 4.15.0-76.86
ProcVersionSignature: Ubuntu 4.15.0-76.86-generic 4.15.18
Uname: Linux 4.15.0-76-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jan 28 13:31 seq
 crw-rw---- 1 root audio 116, 33 Jan 28 13:31 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.9
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Tue Jan 28 15:24:10 2020
HibernationDevice: RESUME=UUID=d34df57d-ae32-4002-be2c-a25efa8678e4
InstallationDate: Installed on 2016-04-04 (1394 days ago)
InstallationMedia: Ubuntu-Server 16.04 LTS "Xenial Xerus" - Beta amd64 (20160325)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
MachineType: Supermicro SSG-6038R-E1CR16L
PciMultimedia:

ProcEnviron:
 TERM=rxvt-unicode-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-76-generic root=UUID=d1ab8bd8-41d5-4142-92d6-d0926539807b ro
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-76-generic N/A
 linux-backports-modules-4.15.0-76-generic N/A
 linux-firmware 1.173.14
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2018-08-16 (530 days ago)
dmi.bios.date: 12/17/2015
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.0
dmi.board.asset.tag: Default string
dmi.board.name: X10DRH-iT
dmi.board.vendor: Supermicro
dmi.board.version: 1.01
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 1
dmi.chassis.vendor: Supermicro
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.0:bd12/17/2015:svnSupermicro:pnSSG-6038R-E1CR16L:pvr123456789:rvnSupermicro:rnX10DRH-iT:rvr1.01:cvnSupermicro:ct1:cvrDefaultstring:
dmi.product.family: SMC X10
dmi.product.name: SSG-6038R-E1CR16L
dmi.product.version: 123456789
dmi.sys.vendor: Supermicro

Revision history for this message
Seth Arnold (seth-arnold) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Colin Ian King (colin-king) wrote :

Can you describe the zfs environment and the command that was being actioned that triggered this issue?

Revision history for this message
Seth Arnold (seth-arnold) wrote :

The two machines involved are:

Receiver, bionic, probably running 4.15.0-76-generic and zfsutils-linux 0.7.5-1ubuntu16.7

Sender, focal, probably running 5.4.0-12-generic and zfsutils-linux 0.8.3-1ubuntu3

I'm using sanoid and syncoid to automate snapshot management and sending and receiving. The combination worked well for about a week before I got this failure.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

These are just the zfs bookmark and zfs send commands from the sender.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

These are all the zpool and zfs commands on the receiver except snapshots, renames, and destroys, associated with my Ubuntu archive mirror rsync. (7000-ish lines of juggling 30-snapshots. I should do something better here.)

Revision history for this message
Colin Ian King (colin-king) wrote :

What is interesting is the following commit modifies range_tree_clear() so it performs a zero size check and returns before calling range_tree_find_impl(). This commit is not in 18.10 and 19.04 Ubuntu ZFS releases.

commit a1d477c
Author: Matthew Ahrens <email address hidden>
Date: Thu Sep 22 09:30:13 2016 -0700

OpenZFS 7614, 9064 - zfs device evacuation/removal

OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

the specific change is:

@@ -560,6 +536,9 @@ range_tree_clear(range_tree_t *rt, uint64_t start, uint64_t size)
 {
        range_seg_t *rs;

+ if (size == 0)
+ return;
+
        while ((rs = range_tree_find_impl(rt, start, size)) != NULL) {
                uint64_t free_start = MAX(rs->rs_start, start);
                uint64_t free_end = MIN(rs->rs_end, start + size);

I'm not sure why this check was added, but I guess it handles the cases were zero sized allocations are allowed and stops these from doing any unnecessary clearing and avoids the assertion. But the semantics change is not clear in the commit message.

Revision history for this message
Colin Ian King (colin-king) wrote :

I've uploaded a potential fix to a PPA, do you mind testing this using the zfs-dkms kernel modules as follows:

sudo add-apt-repository ppa:colin-king/zfs-sru-1861235
sudo apt-get update
sudo apt-get install zfs-dkms

and reboot.

Then check the correct ZFS module is being used by:

dmesg | grep ZFS

It should be the 0.7.5-1ubuntu16.9~lp1861235 version.

And see if this helps avoid this issue.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Changed in zfs-linux (Ubuntu):
status: New → Incomplete
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
no longer affects: linux (Ubuntu)
no longer affects: linux (Ubuntu Bionic)
Revision history for this message
Colin Ian King (colin-king) wrote :

Please ignore the above. Apparently the issue needs a little more digging and the workaround is insufficient.

Revision history for this message
Colin Ian King (colin-king) wrote :

In this scenario, dmu_object_free has been called and this calls dnode_free_range using dnode_free_range(dn, 0, DMU_OBJECT_END, tx)

In this case, in dnode_free_range len is DMU_OBJECT_ENDand offset is 0, so..:

    if (len == DMU_OBJECT_END) {
            len = UINT64_MAX - off;
            trunc = TRUE;
    }

makes len is UINT64_MAX and trunc = TRUE. If one follows the code through we have this stanza:

    blkid = off >> blkshift;
    nblks = len >> blkshift;
    if (trunc)
            nblks += 1;

and this is the last place nblks gets updated. This implies blkshift is zero, nblks = UINT64_MAX >> 0 is UINT64_MAX, trunc is TRUE so nblks +=1 wraps nblks back to 0, hence the assert later on.

For blkshift to be zero blksz dn->dn_datablksz is zero, so this looks like some form of corruption.

Revision history for this message
Colin Ian King (colin-king) wrote :

Incidentally, dnode_free_range hasn't changed much in the focal 8.2.x zfs, so I'm not sure if upgrading to that will make much of a difference.

Revision history for this message
Colin Ian King (colin-king) wrote :

BTW, Is this still and open issue?

Revision history for this message
Colin Ian King (colin-king) wrote :

*still an open issue?

Revision history for this message
Seth Arnold (seth-arnold) wrote :
Download full text (19.5 KiB)

Hello Colin, yes, this is still an open issue:

Linux wopr 4.15.0-91-generic #92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Apr 22 19:10:03 wopr zed[12576]: eid=8352 class=history_event pool_guid=0xB3B099B638F02EEF
Apr 22 19:10:03 wopr kernel: VERIFY(size != 0) failed
Apr 22 19:10:03 wopr kernel: PANIC at range_tree.c:304:range_tree_find_impl()
Apr 22 19:10:03 wopr kernel: Showing stack for process 12577
Apr 22 19:10:03 wopr kernel: CPU: 8 PID: 12577 Comm: receive_writer Tainted: P O 4.15.0-91-generic #92-Ubuntu
Apr 22 19:10:03 wopr kernel: Hardware name: Supermicro SSG-6038R-E1CR16L/X10DRH-iT, BIOS 2.0 12/17/2015
Apr 22 19:10:03 wopr kernel: Call Trace:
Apr 22 19:10:03 wopr kernel: dump_stack+0x6d/0x8e
Apr 22 19:10:03 wopr kernel: spl_dumpstack+0x42/0x50 [spl]
Apr 22 19:10:03 wopr kernel: spl_panic+0xc8/0x110 [spl]
Apr 22 19:10:03 wopr kernel: ? __switch_to_asm+0x41/0x70
Apr 22 19:10:03 wopr kernel: ? abd_iter_map+0xa/0x90 [zfs]
Apr 22 19:10:03 wopr kernel: ? dbuf_dirty+0x43d/0x850 [zfs]
Apr 22 19:10:03 wopr kernel: ? getrawmonotonic64+0x43/0xd0
Apr 22 19:10:03 wopr kernel: ? getrawmonotonic64+0x43/0xd0
Apr 22 19:10:03 wopr kernel: ? dmu_zfetch+0x49a/0x500 [zfs]
Apr 22 19:10:03 wopr kernel: ? getrawmonotonic64+0x43/0xd0
Apr 22 19:10:03 wopr kernel: ? dmu_zfetch+0x49a/0x500 [zfs]
Apr 22 19:10:03 wopr kernel: ? mutex_lock+0x12/0x40
Apr 22 19:10:03 wopr kernel: ? dbuf_rele_and_unlock+0x1a8/0x4b0 [zfs]
Apr 22 19:10:03 wopr kernel: range_tree_find_impl+0x88/0x90 [zfs]
Apr 22 19:10:03 wopr kernel: ? spl_kmem_zalloc+0xdc/0x1a0 [spl]
Apr 22 19:10:03 wopr kernel: range_tree_clear+0x4f/0x60 [zfs]
Apr 22 19:10:03 wopr kernel: dnode_free_range+0x11f/0x5a0 [zfs]
Apr 22 19:10:03 wopr kernel: dmu_object_free+0x53/0x90 [zfs]
Apr 22 19:10:03 wopr kernel: dmu_free_long_object+0x9f/0xc0 [zfs]
Apr 22 19:10:03 wopr kernel: receive_freeobjects.isra.12+0x7a/0x100 [zfs]
Apr 22 19:10:03 wopr kernel: receive_writer_thread+0x6d2/0xa60 [zfs]
Apr 22 19:10:03 wopr kernel: ? set_curr_task_fair+0x2b/0x60
Apr 22 19:10:03 wopr kernel: ? spl_kmem_free+0x33/0x40 [spl]
Apr 22 19:10:03 wopr kernel: ? kfree+0x165/0x180
Apr 22 19:10:03 wopr kernel: ? receive_free.isra.13+0xc0/0xc0 [zfs]
Apr 22 19:10:03 wopr kernel: thread_generic_wrapper+0x74/0x90 [spl]
Apr 22 19:10:03 wopr kernel: kthread+0x121/0x140
Apr 22 19:10:03 wopr kernel: ? __thread_exit+0x20/0x20 [spl]
Apr 22 19:10:03 wopr kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Apr 22 19:10:03 wopr kernel: ret_from_fork+0x35/0x40
Apr 22 19:12:56 wopr kernel: INFO: task txg_quiesce:2265 blocked for more than 120 seconds.
Apr 22 19:12:56 wopr kernel: Tainted: P O 4.15.0-91-generic #92-Ubuntu
Apr 22 19:12:56 wopr kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 19:12:56 wopr kernel: txg_quiesce D 0 2265 2 0x80000000
Apr 22 19:12:56 wopr kernel: Call Trace:
Apr 22 19:12:56 wopr kernel: __schedule+0x24e/0x880
Apr 22 19:12:56 wopr kernel: schedule+0x2c/0x80
Apr 22 19:12:56 wopr kernel: cv_wait_common+0x11e/0x140 [spl]
Apr 22 19:12:56 wopr kernel: ? wait_woken+0x80/0x80
Apr 22 19:12:56 w...

Changed in linux:
status: Unknown → New
Revision history for this message
Colin Ian King (colin-king) wrote :

i've uploaded a debug zfs-dkms package to https://launchpad.net/~colin-king/+archive/ubuntu/zfs-sru-1861235 for testing. This will dump out internal state of the driver to get a better idea of what is happening during this crash.

Please can you test this by doing the following:

sudo add-apt-repository ppa:colin-king/zfs-sru-1861235
sudo apt-get update
sudo apt-get install zfs-dkms

let this debug dkms zfs module build and reboot. when the failure occurs again please supply as much of the dmesg output as possible

Thank you

Revision history for this message
Seth Arnold (seth-arnold) wrote :
Download full text (10.0 KiB)

Here's the part that looks important; the whole dmesg is in the attachment.

Thanks

[ 761.730488] dnone_free_ramge: nblks = 0, trunc = 1, len = 18446744073709551615, blkshift = 0
[ 761.730542] dnode_free_range: nblks == 0, len == 18446744073709551615
               , off=0
[ 761.730543] range_tree_clear: size == 0
[ 761.730605] VERIFY(size != 0) failed
[ 761.730625] PANIC at range_tree.c:307:range_tree_find_impl()
[ 761.730653] Showing stack for process 23913
[ 761.730656] CPU: 31 PID: 23913 Comm: receive_writer Tainted: P OE 4.15.0-99-generic #100-Ubuntu
[ 761.730657] Hardware name: Supermicro SSG-6038R-E1CR16L/X10DRH-iT, BIOS 2.0 12/17/2015
[ 761.730658] Call Trace:
[ 761.730668] dump_stack+0x6d/0x8e
[ 761.730679] spl_dumpstack+0x42/0x50 [spl]
[ 761.730684] spl_panic+0xc8/0x110 [spl]
[ 761.730690] ? fbcon_cursor+0x13e/0x1c0
[ 761.730695] ? atomic_notifier_call_chain+0x1a/0x20
[ 761.730700] ? vt_console_print+0x240/0x410
[ 761.730705] ? up+0x32/0x50
[ 761.730708] ? console_unlock+0x2a1/0x560
[ 761.730767] range_tree_find_impl+0x88/0x90 [zfs]
[ 761.730769] ? printk+0x52/0x6e
[ 761.730810] range_tree_clear+0x54/0x70 [zfs]
[ 761.730843] dnode_free_range+0x15b/0x720 [zfs]
[ 761.730874] dmu_object_free+0x75/0xc0 [zfs]
[ 761.730904] dmu_free_long_object+0x9f/0xc0 [zfs]
[ 761.730934] receive_freeobjects.isra.12+0x7a/0x100 [zfs]
[ 761.730963] receive_writer_thread+0x6d2/0xa60 [zfs]
[ 761.730965] ? set_curr_task_fair+0x2b/0x60
[ 761.730970] ? spl_kmem_free+0x33/0x40 [spl]
[ 761.730973] ? kfree+0x162/0x180
[ 761.731001] ? receive_free.isra.13+0xc0/0xc0 [zfs]
[ 761.731006] thread_generic_wrapper+0x74/0x90 [spl]
[ 761.731008] kthread+0x121/0x140
[ 761.731012] ? __thread_exit+0x20/0x20 [spl]
[ 761.731014] ? kthread_create_worker_on_cpu+0x70/0x70
[ 761.731018] ret_from_fork+0x35/0x40
[ 967.430681] INFO: task txg_quiesce:4049 blocked for more than 120 seconds.
[ 967.430739] Tainted: P OE 4.15.0-99-generic #100-Ubuntu
[ 967.430785] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 967.430836] txg_quiesce D 0 4049 2 0x80000000
[ 967.430840] Call Trace:
[ 967.430853] __schedule+0x24e/0x880
[ 967.430858] schedule+0x2c/0x80
[ 967.430875] cv_wait_common+0x11e/0x140 [spl]
[ 967.430885] ? wait_woken+0x80/0x80
[ 967.430893] __cv_wait+0x15/0x20 [spl]
[ 967.431010] txg_quiesce_thread+0x2cb/0x3d0 [zfs]
[ 967.431078] ? txg_delay+0x1b0/0x1b0 [zfs]
[ 967.431088] thread_generic_wrapper+0x74/0x90 [spl]
[ 967.431093] kthread+0x121/0x140
[ 967.431099] ? __thread_exit+0x20/0x20 [spl]
[ 967.431102] ? kthread_create_worker_on_cpu+0x70/0x70
[ 967.431107] ret_from_fork+0x35/0x40
[ 967.431140] INFO: task zfs:23834 blocked for more than 120 seconds.
[ 967.431184] Tainted: P OE 4.15.0-99-generic #100-Ubuntu
[ 967.431229] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 967.431279] zfs D 0 23834 23831 0x80000080
[ 967.431282] Call Trace:
[ 967.431286] __schedule+0x24e/0x880
[ 967.431290] schedule+0x2c/0x80
[ 967.431298] cv_wait_common+0x11e/0x140 [spl]
[ ...

Revision history for this message
Colin Ian King (colin-king) wrote :

Thanks for the data.

I've added a couple more lines of debug now to figure out some earlier missing information once we pop the stack. I've updated the package in the PPA, do you mind updating the zfs-dkms to 0.7.5-1ubuntu16.10~lp1861235.2 and re-testing?

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Thanks Colin!

[ 271.628232] dnone_free_ramge: nblks = 0, trunc = 1, len = 18446744073709551615, blkshift = 0
[ 271.628297] dnode_free_range: nblks == 0, len == 18446744073709551615
               , off=0
[ 271.628298] range_tree_clear: size == 0
[ 271.628375] range_tree_find_impl: size == 0
[ 271.628393] VERIFY(size != 0) failed
[ 271.628409] PANIC at range_tree.c:168:range_tree_add()
[ 271.628429] Showing stack for process 7960
[ 271.628432] CPU: 26 PID: 7960 Comm: receive_writer Tainted: P OE 4.15.0-99-generic #100-Ubuntu
[ 271.628433] Hardware name: Supermicro SSG-6038R-E1CR16L/X10DRH-iT, BIOS 2.0 12/17/2015
[ 271.628433] Call Trace:
[ 271.628442] dump_stack+0x6d/0x8e
[ 271.628458] spl_dumpstack+0x42/0x50 [spl]
[ 271.628462] spl_panic+0xc8/0x110 [spl]
[ 271.628468] ? vt_console_print+0x240/0x410
[ 271.628473] ? console_unlock+0x2a1/0x560
[ 271.628475] ? vprintk_emit+0x104/0x2c0
[ 271.628477] ? vprintk_default+0x29/0x50
[ 271.628479] ? printk+0x52/0x6e
[ 271.628566] range_tree_add+0x29b/0x300 [zfs]
[ 271.628600] ? range_tree_clear+0xa4/0xb0 [zfs]
[ 271.628602] ? printk+0x52/0x6e
[ 271.628635] dnode_free_range+0x178/0x720 [zfs]
[ 271.628673] dmu_object_free+0x6d/0xc0 [zfs]
[ 271.628698] dmu_free_long_object+0x9f/0xc0 [zfs]
[ 271.628723] receive_freeobjects.isra.12+0x7a/0x100 [zfs]
[ 271.628747] receive_writer_thread+0x6d2/0xa60 [zfs]
[ 271.628750] ? set_curr_task_fair+0x2b/0x60
[ 271.628753] ? spl_kmem_free+0x33/0x40 [spl]
[ 271.628757] ? kfree+0x162/0x180
[ 271.628781] ? receive_free.isra.13+0xc0/0xc0 [zfs]
[ 271.628785] thread_generic_wrapper+0x74/0x90 [spl]
[ 271.628788] kthread+0x121/0x140
[ 271.628791] ? __thread_exit+0x20/0x20 [spl]
[ 271.628792] ? kthread_create_worker_on_cpu+0x70/0x70
[ 271.628796] ret_from_fork+0x35/0x40

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi, I was expecting some extra information and a panic message in a different place, are you sure this is running the latest zfs-dkms? One can check that by using dmesg | grep ZFS and check this is the .2 version of the debug zfd-dkms package.

Revision history for this message
Colin Ian King (colin-king) wrote :

Oh, I see you are using the latest version: v0.7.5-1ubuntu16.10~lp1861235.2

I may need to figure out a .3 version to try next.

Revision history for this message
Colin Ian King (colin-king) wrote :

Thanks for the data for the .2 tests. I've updated the packages with a .3 test build.

I've added a couple more lines of debug now to figure out some earlier missing information once we pop the stack. I've updated the package in the PPA, do you mind updating the zfs-dkms to 0.7.5-1ubuntu16.10~lp1861235.3 and re-testing?

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Hello Colin, trying the zfs recv operation on the .3 dkms eventually kills my system dead. There was nothing on the console. The cursor on the console stopped blinking; I couldn't switch VTs. My ssh sessions were hung. ping reported destination host unreachable.

It ran for about two and a half minutes, transferred around 10 datasets or snapshots:
$ journalctl -b -1 | grep -c "COMMAND=/sbin/zfs receive"
10

Sadly, journalctl doesn't have the dmesg from the previous boot:

$ journalctl -b -1 -k
[...]
May 15 17:25:10 wopr kernel: ZFS: Loaded module v0.7.5-1ubuntu16.10~lp1861235.3, ZFS pool version 5000, ZFS filesystem version 5
May 15 17:25:15 wopr kernel: zd32: p1
May 15 17:25:20 wopr kernel: pps pps0: new PPS source ptp0
May 15 17:25:20 wopr kernel: ixgbe 0000:05:00.0: registered PHC device on enp5s0f0
May 15 17:25:20 wopr kernel: IPv6: ADDRCONF(NETDEV_UP): enp5s0f0: link is not ready
May 15 17:25:20 wopr kernel: new mount options do not match the existing superblock, will be ignored
May 15 17:25:21 wopr kernel: Process accounting resumed
May 15 17:25:24 wopr kernel: ixgbe 0000:05:00.0 enp5s0f0: NIC Link is Up 1 Gbps, Flow Control: None
May 15 17:25:24 wopr kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp5s0f0: link becomes ready
May 15 17:25:29 wopr kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
May 15 17:25:29 wopr kernel: NFSD: starting 90-second grace period (net f00000a9)
May 15 17:25:32 wopr kernel: ip6_tables: (C) 2000-2006 Netfilter Core Team
May 15 17:25:32 wopr kernel: Ebtables v2.0 registered
May 15 17:25:33 wopr kernel: bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
May 15 17:25:33 wopr kernel: virbr0: port 1(virbr0-nic) entered blocking state
May 15 17:25:33 wopr kernel: virbr0: port 1(virbr0-nic) entered disabled state
May 15 17:25:33 wopr kernel: device virbr0-nic entered promiscuous mode
May 15 17:25:33 wopr kernel: nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
May 15 17:25:33 wopr kernel: virbr0: port 1(virbr0-nic) entered blocking state
May 15 17:25:33 wopr kernel: virbr0: port 1(virbr0-nic) entered listening state
May 15 17:25:33 wopr kernel: virbr0: port 1(virbr0-nic) entered disabled state

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote : Re: [Bug 1861235] Re: zfs recv PANIC at range_tree.c:304:range_tree_find_impl()

On Sat, May 16, 2020 at 01:56:08AM -0000, Seth Arnold wrote:
> Sadly, journalctl doesn't have the dmesg from the previous boot:

I meant to say, journalctl's copy of dmesg from the previous boot doesn't
have the new debug output. Sorry.

Thanks

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Seth, I've upload a .4 - it may add a lot more debug but it will allow us to see the object type that is failing before the crash.

Revision history for this message
Seth Arnold (seth-arnold) wrote :
Download full text (16.7 KiB)

Colin, you're not kidding, there's a lot more debug information in this one. Here's the part that looks important:

[ 289.167264] dmu_object_free: object = 0x8319, dn->dn_type = 196
[ 289.167659] dmu_object_free: object = 0x8318, dn->dn_type = 13
[ 289.167922] dmu_object_free: object = 0x8317, dn->dn_type = 15
[ 289.168197] dmu_object_free: object = 0x8316, dn->dn_type = 12
[ 289.168500] dmu_object_free: object = 0x861b, dn->dn_type = 195
[ 289.351934] dmu_object_free: object = 0x5ef6, dn->dn_type = 50
[ 289.353287] dmu_object_free: object = 0x8553, dn->dn_type = 37
[ 289.353704] dmu_object_free: object = 0x65bb, dn->dn_type = 196
[ 289.917945] dmu_object_free: object = 0x210, dn->dn_type = 19
[ 289.918616] dmu_object_free: object = 0x211, dn->dn_type = 32
[ 289.919025] dnone_free_ramge: nblks = 0, trunc = 1, len = 0xffffffffffffffff, blkshift = 0
[ 289.919442] dnode_free_range: nblks == 0, len == 18446744073709551615
               , off=0
[ 289.919443] range_tree_clear: size == 0
[ 289.920690] range_tree_find_impl: size == 0
[ 289.921123] VERIFY(size != 0) failed
[ 289.921566] PANIC at range_tree.c:319:range_tree_find_impl()
[ 289.922043] Showing stack for process 28938
[ 289.922047] CPU: 15 PID: 28938 Comm: receive_writer Tainted: P OE 4.15.0-101-generic #102-Ubuntu
[ 289.922049] Hardware name: Supermicro SSG-6038R-E1CR16L/X10DRH-iT, BIOS 2.0 12/17/2015
[ 289.922050] Call Trace:
[ 289.922064] dump_stack+0x6d/0x8e
[ 289.922082] spl_dumpstack+0x42/0x50 [spl]
[ 289.922089] spl_panic+0xc8/0x110 [spl]
[ 289.922098] ? console_unlock+0x2a1/0x560
[ 289.922102] ? vprintk_emit+0x104/0x2c0
[ 289.922106] ? vprintk_default+0x29/0x50
[ 289.922111] ? vprintk_func+0x47/0xc0
[ 289.922114] ? printk+0x52/0x6e
[ 289.922201] range_tree_find_impl+0x94/0xa0 [zfs]
[ 289.922205] ? printk+0x52/0x6e
[ 289.922268] range_tree_clear+0x54/0x70 [zfs]
[ 289.922322] dnode_free_range+0x15b/0x720 [zfs]
[ 289.922372] dmu_object_free+0x80/0xd0 [zfs]
[ 289.922420] dmu_free_long_object+0x9f/0xc0 [zfs]
[ 289.922469] receive_freeobjects.isra.12+0x7a/0x100 [zfs]
[ 289.922517] receive_writer_thread+0x6d2/0xa60 [zfs]
[ 289.922521] ? set_curr_task_fair+0x2b/0x60
[ 289.922528] ? spl_kmem_free+0x33/0x40 [spl]
[ 289.922535] ? kfree+0x168/0x180
[ 289.922581] ? receive_free.isra.13+0xc0/0xc0 [zfs]
[ 289.922589] thread_generic_wrapper+0x74/0x90 [spl]
[ 289.922594] kthread+0x121/0x140
[ 289.922600] ? __thread_exit+0x20/0x20 [spl]
[ 289.922603] ? kthread_create_worker_on_cpu+0x70/0x70
[ 289.922610] ret_from_fork+0x35/0x40
[ 484.358156] INFO: task txg_quiesce:3136 blocked for more than 120 seconds.
[ 484.358778] Tainted: P OE 4.15.0-101-generic #102-Ubuntu
[ 484.359320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.359898] txg_quiesce D 0 3136 2 0x80000000
[ 484.359902] Call Trace:
[ 484.359914] __schedule+0x24e/0x880
[ 484.359918] schedule+0x2c/0x80
[ 484.359935] cv_wait_common+0x11e/0x140 [spl]
[ 484.359942] ? wait_woken+0x80/0x80
[ 484.359951] __cv_wait+0x15/0x20 [spl]
[ 484.360056] txg_quiesce_thread+0x2cb/0x3d0 [zfs]
[ 484.3...

Revision history for this message
Colin Ian King (colin-king) wrote :

OK so object 0x211 looks like the curlprit, I just can't see the pool name being dumped - do you know which pool is triggering this issue?

If you can identify the poolt then we can next dump the potentially corrupt object using:

sudo zdb -dddd <pool>/<dataset> 529

Revision history for this message
Seth Arnold (seth-arnold) wrote :

I picked the last dataset given in the command output from an earlier, but not the most recent, comment:

$ sudo zdb -dddd srv/backups/millbarge/rpool/var/log 529
Dataset srv/backups/millbarge/rpool/var/log [ZPL], ID 39694, cr_txg 23197757, 554M, 274 objects, rootbp DVA[0]=<1:1d000d42000:1000> DVA[1]=<2:24416f4b000:1000> [L0 DMU objset] sha256 uncompressed LE contiguous unique double size=800L/800P birth=23437543L/23437543P fill=274 cksum=cf5fcb2f986c768c:7cc8638aeb1a5162:9de9f236f072e16:40c6a08d90b54951

    Object lvl iblk dblk dsize dnsize lsize %full type
       529 0 0 0 0 512 0 -nan DSL permissions (K=UNKNOWN) (Z=UNKNOWN)
                                                 0 bonus UNKNOWN
 dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED SPILL_BLKPTR
 dnode maxblkid: 0
 UNKNOWN OBJECT TYPE
c < (1ULL << 24) >> 9 (0x7fffffffffffff < 0x8000)
ASSERT at ../../module/zfs/zio.c:266:zio_buf_alloc()Aborted

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Is it interesting that I'm seeing both 19 and 20 in my dmesg?

[1229822.406130] dmu_object_free: object = 0x266d2904, dn->dn_type = 20
[1229823.980888] dmu_object_free: object = 0x266d0d5a, dn->dn_type = 20
[1229823.994690] dmu_object_free: object = 0x266d0d5b, dn->dn_type = 20
[1229823.998123] dmu_object_free: object = 0x266d00c1, dn->dn_type = 19
[1229823.998191] dmu_object_free: object = 0x266d00c3, dn->dn_type = 19
[1229823.998324] dmu_object_free: object = 0x266d00c5, dn->dn_type = 19
[1229823.998377] dmu_object_free: object = 0x266d00c4, dn->dn_type = 20

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

Hi @seth-arnold,

adding some details to this bug:

 dn->dn_type = 19 = DMU_OT_PLAIN_FILE_CONTENTS
 dn->dn_type = 20 = DMU_OT_DIRECTORY_CONTENTS

So they look correct, they seem to be just plain files and directories.

However, the specific object that is causing the panic seems to have dn_type=32 that is DMU_OT_DSL_PERMS (that's a bit odd, I don't see any other object types like this in your dmesg), moreover the output of "zdb -dddd" shows a lot of UNKNONW stuff (checksum and compression type for example, they are UNKNOWN).

So, it seems that we may have some corrupted data here.

Do you see any error if you try to run a simple `zpool status -v`? Do you know exactly which pool / volume / filesystem is causing this problem?

Revision history for this message
Andrea Righi (arighi) wrote :

I just noticed the bug on github that also mentions about potential data corruption. It looks like we need to recompile zfs with --enable-debug. I'll try to produce a debug package with that.

Revision history for this message
Andrea Righi (arighi) wrote :

I've uploaded the latest version of zfs with debugging enabled (--enable-debug) in this ppa:
https://launchpad.net/~arighi/+archive/ubuntu/zfs-linux

sudo add-apt-repository ppa:arighi/zfs-linux
sudo apt-get update
sudo apt-get install zfs-dkms

It'd be interesting to repeat the test using this version. Make sure the loaded zfs module is matching the right version:

$ modinfo zfs | grep ^version
version: 0.7.5-1ubuntu16.10+lp1861235

Moreover, if the data got corrupted, it'd be interesting to know why... can you provide some smartctl information? Did you notice any I/O error in your logs? Thanks!

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Hello Andrea, thanks for looking into my problem.

# zpool status -v
  pool: fst
 state: ONLINE
  scan: scrub repaired 0B in 2h37m with 0 errors on Sun Nov 8 03:01:53 2020
config:

 NAME STATE READ WRITE CKSUM
 fst ONLINE 0 0 0
   raidz1-0 ONLINE 0 0 0
     wwn-0x5002538e4095da39 ONLINE 0 0 0
     wwn-0x5002538e4095bdd6 ONLINE 0 0 0
     wwn-0x5002538e4093c6fd ONLINE 0 0 0
     wwn-0x5002538e4095da30 ONLINE 0 0 0

errors: No known data errors

  pool: srv
 state: ONLINE
  scan: scrub repaired 0B in 3h44m with 0 errors on Sun Nov 8 04:08:16 2020
config:

 NAME STATE READ WRITE CKSUM
 srv ONLINE 0 0 0
   mirror-0 ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P8KTWJYY ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G620TW ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G66E2U ONLINE 0 0 0
   mirror-1 ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G69TKU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G69TXU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G69U2U ONLINE 0 0 0
   mirror-2 ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G6EBUU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G6ESAU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G6G70U ONLINE 0 0 0
 logs
   nvme0n1p1 ONLINE 0 0 0
 cache
   nvme0n1p2 ONLINE 0 0 0

errors: No known data errors

I'll swap over to your dkms build and report back.

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Picking the previous bad object (which I'm pretty sure should still be intact):

$ modinfo zfs | grep ^version
version: 0.7.5-1ubuntu16.10+lp1861235
$ uname -a
Linux wopr 4.15.0-126-generic #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ sudo zdb -dddd srv/backups/millbarge/rpool/var/log 529
Dataset srv/backups/millbarge/rpool/var/log [ZPL], ID 39694, cr_txg 23197757, 554M, 274 objects, rootbp DVA[0]=<1:1d000d42000:1000> DVA[1]=<2:24416f4b000:1000> [L0 DMU objset] sha256 uncompressed LE contiguous unique double size=800L/800P birth=23437543L/23437543P fill=274 cksum=cf5fcb2f986c768c:7cc8638aeb1a5162:9de9f236f072e16:40c6a08d90b54951

    Object lvl iblk dblk dsize dnsize lsize %full type
dn->dn_nblkptr >= 1 (0x0 >= 0x1)
ASSERT at ../../module/zfs/dnode.c:278:dnode_verify()Aborted
$ dmesg | tail
[ 43.017302] Ebtables v2.0 registered
[ 44.977437] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[ 47.047114] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 47.047966] virbr0: port 1(virbr0-nic) entered blocking state
[ 47.047968] virbr0: port 1(virbr0-nic) entered disabled state
[ 47.048063] device virbr0-nic entered promiscuous mode
[ 47.433533] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[ 47.971009] virbr0: port 1(virbr0-nic) entered blocking state
[ 47.971013] virbr0: port 1(virbr0-nic) entered listening state
[ 48.321853] virbr0: port 1(virbr0-nic) entered disabled state
$

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Here's smartctl output on the disks in the pool. It's a bit hard to summarize nine drives of smartctl output but nothing stood out as interesting to me.

My journals only go back a few months so they wouldn't show any IO errors at the time of this problem being introduced, but this currently gives no results:

journalctl | grep -i "i/o error"

so there's nothing recent.

I started a new scrub, it'll finish in a few hours, if I forget to post results please do remind me. :)

Thanks

Revision history for this message
Seth Arnold (seth-arnold) wrote :

The scrub has finished, no new errors in zpool status -v, dmesg is filled with debugging messages, I hope something is useful. :)

Thanks

$ uname -a
Linux wopr 4.15.0-126-generic #129-Ubuntu SMP Mon Nov 23 18:53:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ modinfo zfs | grep -i version
version: 0.7.5-1ubuntu16.10+lp1861235
srcversion: 62D1F1B1A1EE81EB3AA8ABA
$ zpool status -v
  pool: fst
 state: ONLINE
  scan: scrub repaired 0B in 2h37m with 0 errors on Sun Nov 8 03:01:53 2020
config:

 NAME STATE READ WRITE CKSUM
 fst ONLINE 0 0 0
   raidz1-0 ONLINE 0 0 0
     wwn-0x5002538e4095da39 ONLINE 0 0 0
     wwn-0x5002538e4095bdd6 ONLINE 0 0 0
     wwn-0x5002538e4093c6fd ONLINE 0 0 0
     wwn-0x5002538e4095da30 ONLINE 0 0 0

errors: No known data errors

  pool: srv
 state: ONLINE
  scan: scrub repaired 0B in 4h12m with 0 errors on Wed Dec 2 18:53:44 2020
config:

 NAME STATE READ WRITE CKSUM
 srv ONLINE 0 0 0
   mirror-0 ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P8KTWJYY ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G620TW ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G66E2U ONLINE 0 0 0
   mirror-1 ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G69TKU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G69TXU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G69U2U ONLINE 0 0 0
   mirror-2 ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G6EBUU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G6ESAU ONLINE 0 0 0
     ata-HGST_HUS724030ALA640_PN2234P9G6G70U ONLINE 0 0 0
 logs
   nvme0n1p1 ONLINE 0 0 0
 cache
   nvme0n1p2 ONLINE 0 0 0

errors: No known data errors

Revision history for this message
Seth Arnold (seth-arnold) wrote :

My guess is the next step is trying the zfs send | ssh zfs recv commands that were causing the VERIFY(size != 0) failed message with traces. I've always been uneasy with the 'main' error coming only with a recv operation, relying upon two systems to recreate a bug is iffy at best. And since it wedges the one, I don't like doing it just for fun, but I'd also like to get to the bottom of this.

Is there anything else to try first?

Thanks

Revision history for this message
Andrea Righi (arighi) wrote :

This time I don't see any error or panic in `journalctl -b` and apparently everything is fine according to smartctl, so it looks like we are not getting any obvious I/O error at the moment.

IIUC you have done already a `zpool scrub` and that also didn't report any error, so apparently the zfs data integrigy checks passed locally.

If that's the case, yes, I think the next step would be to try to `zfs send | zfs recv` commands and see if we get more information with zfs debugging enabled. Alternatively we could try to stress the local zpool more, maybe running some I/O stress test (fio for example) on some volumes / filesystems created from the "potentially faulty" zpool. Apart than that I don't see any other useful test that we can do...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in zfs-linux (Ubuntu Bionic):
status: New → Confirmed
Changed in zfs-linux (Ubuntu):
assignee: Colin Ian King (colin-king) → Dimitri John Ledkov (xnox)
Changed in linux:
status: New → Fix Released
Changed in zfs-linux (Ubuntu):
assignee: Dimitri John Ledkov (xnox) → nobody
Changed in zfs-linux (Ubuntu):
importance: High → Undecided
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.