Comment 27 for bug 1654517

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Colin,

I'm afraid this is an ongoing issue. I have faced this in 4.4 and 4.13 kernels:

[ 240.568146] INFO: task txg_sync:2065 blocked for more than 120 seconds.
[ 240.568218] Tainted: P O 4.4.0-116-generic #140-Ubuntu
[ 240.568268] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 240.568290] txg_sync D ffff880739e63aa8 0 2065 2 0x00000000
[ 240.568294] ffff880739e63aa8 ffff880739e63a88 ffff88081af75400 ffff8800ac635400
[ 240.568296] ffff880739e64000 ffff88083ed972c0 7fffffffffffffff ffff88009fffc968
[ 240.568297] 0000000000000001 ffff880739e63ac0 ffffffff8184ae45 0000000000000000
[ 240.568300] Call Trace:
[ 240.568305] [<ffffffff8184ae45>] schedule+0x35/0x80
[ 240.568307] [<ffffffff8184df98>] schedule_timeout+0x1b8/0x270
[ 240.568310] [<ffffffff810aec12>] ? default_wake_function+0x12/0x20
[ 240.568313] [<ffffffff810c6318>] ? __wake_up_common+0x58/0x90
[ 240.568315] [<ffffffff810f8cce>] ? ktime_get+0x3e/0xb0
[ 240.568317] [<ffffffff8184a5d4>] io_schedule_timeout+0xa4/0x110
[ 240.568325] [<ffffffffc0395c2c>] cv_wait_common+0xbc/0x140 [spl]
[ 240.568327] [<ffffffff810c69d0>] ? wake_atomic_t_function+0x60/0x60
[ 240.568332] [<ffffffffc0395d08>] __cv_wait_io+0x18/0x20 [spl]
[ 240.568373] [<ffffffffc0957634>] zio_wait+0x114/0x200 [zfs]
[ 240.568397] [<ffffffffc08e0d88>] dsl_pool_sync+0xb8/0x430 [zfs]
[ 240.568424] [<ffffffffc08fc7b6>] spa_sync+0x366/0xb30 [zfs]
[ 240.568426] [<ffffffff810aec12>] ? default_wake_function+0x12/0x20
[ 240.568453] [<ffffffffc090dc4a>] txg_sync_thread+0x3ba/0x630 [zfs]
[ 240.568481] [<ffffffffc090d890>] ? txg_delay+0x180/0x180 [zfs]
[ 240.568486] [<ffffffffc0390dc0>] ? __thread_exit+0x20/0x20 [spl]
[ 240.568490] [<ffffffffc0390e33>] thread_generic_wrapper+0x73/0x80 [spl]
[ 240.568492] [<ffffffff810a30f7>] kthread+0xe7/0x100
[ 240.568494] [<ffffffff810a3010>] ? kthread_create_on_node+0x1e0/0x1e0
[ 240.568495] [<ffffffff8184f422>] ret_from_fork+0x42/0x80
[ 240.568497] [<ffffffff810a3010>] ? kthread_create_on_node+0x1e0/0x1e0
[ 240.568505] INFO: task lxd:2562 blocked for more than 120 seconds.
[ 240.568524] Tainted: P O 4.4.0-116-generic #140-Ubuntu
[ 240.568543] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 240.568565] lxd D ffff880734773ba0 0 2562 1 0x00000000
[ 240.568567] ffff880734773ba0 0000000000000246 ffff88081af71c00 ffff88080d18f000
[ 240.568569] ffff880734774000 ffff8808144dba20 ffff8808144dba48 ffff8808144dbae0
[ 240.568571] 0000000000000000 ffff880734773bb8 ffffffff8184ae45 ffff8808144dbad8
[ 240.568573] Call Trace:
[ 240.568575] [<ffffffff8184ae45>] schedule+0x35/0x80
[ 240.568579] [<ffffffffc0395c7b>] cv_wait_common+0x10b/0x140 [spl]
[ 240.568581] [<ffffffff810c69d0>] ? wake_atomic_t_function+0x60/0x60
[ 240.568586] [<ffffffffc0395cc5>] __cv_wait+0x15/0x20 [spl]
[ 240.568614] [<ffffffffc090d4b5>] txg_wait_synced+0xe5/0x130 [zfs]
[ 240.568643] [<ffffffffc0952e4b>] zil_replay+0xdb/0x120 [zfs]
[ 240.568673] [<ffffffffc093fcef>] zfs_sb_setup+0x16f/0x180 [zfs]
[ 240.568702] [<ffffffffc0940b38>] zfs_domount+0x2c8/0x350 [zfs]
[ 240.568731] [<ffffffffc095ecf0>] ? zpl_kill_sb+0x20/0x20 [zfs]
[ 240.568761] [<ffffffffc095ed1c>] zpl_fill_super+0x2c/0x40 [zfs]
[ 240.568763] [<ffffffff8121714f>] mount_nodev+0x4f/0xa0
[ 240.568792] [<ffffffffc095f18f>] zpl_mount+0x4f/0x80 [zfs]
[ 240.568794] [<ffffffff81217e4d>] mount_fs+0x3d/0x170
[ 240.568796] [<ffffffff81234647>] vfs_kern_mount+0x67/0x110
[ 240.568798] [<ffffffff81236cff>] do_mount+0x25f/0xda0
[ 240.568801] [<ffffffff81237b7f>] SyS_mount+0x9f/0x100
[ 240.568803] [<ffffffff8184efc8>] entry_SYSCALL_64_fastpath+0x1c/0xbb
[ 240.568810] INFO: task sync:7760 blocked for more than 120 seconds.
[ 240.568828] Tainted: P O 4.4.0-116-generic #140-Ubuntu
[ 240.568848] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

And I am still facing this with latest kernel and latest SPL and ZFS code

inaddy@workstation:~$ sudo cat /proc/2832/stack
[<0>] __cv_timedwait_common+0x12c/0x170 [spl]
[<0>] __cv_timedwait_io+0x19/0x20 [spl]
[<0>] zio_wait+0x119/0x270 [zfs]
[<0>] dsl_pool_sync+0xb8/0x420 [zfs]
[<0>] spa_sync+0x43e/0xd30 [zfs]
[<0>] txg_sync_thread+0x271/0x440 [zfs]
[<0>] thread_generic_wrapper+0x74/0x90 [spl]
[<0>] kthread+0x121/0x140
[<0>] ret_from_fork+0x22/0x40
[<0>] 0xffffffffffffffff

Linux workstation 4.16.0-rc7 #1 SMP Wed Mar 28 09:21:33 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

ii kmod-spl-4.16.0-rc7 0.7.0-31 amd64 spl kernel module(s) for 4.16.0-rc7
ii kmod-spl-devel 0.7.0-31 amd64 spl kernel module(s) devel common
ii kmod-spl-devel-4.16.0-rc7 0.7.0-31 amd64 spl kernel module(s) devel for 4.16.0-rc7

ii kmod-zfs-4.16.0-rc7 0.7.0-390 amd64 zfs kernel module(s) for 4.16.0-rc7
ii kmod-zfs-devel 0.7.0-390 amd64 zfs kernel module(s) devel common
ii kmod-zfs-devel-4.16.0-rc7 0.7.0-390 amd64 zfs kernel module(s) devel for 4.16.0-rc7
ii libzfs2 0.7.0-390 amd64 Native ZFS filesystem library for Linux
ii libzfs2-devel 0.7.0-390 amd64 Development headers
ii zfs 0.7.0-390 amd64 Commands to control the kernel modules and libraries

I could map upstream bugs:

https://github.com/zfsonlinux/zfs/issues/5867
https://github.com/zfsonlinux/zfs/pull/6133
https://github.com/zfsonlinux/zfs/issues/6127

And they all look, somehow, related to the issue according to the stack trace. Bug 5867, the closest, indicates there are 2 or 3 races that would have been fixed by some commits already included in the upstream version I'm testing.

I'll try to get a kdump from latest upstream kernel + latest upstream zfs/spl modules to check if I can visualize what other task is racing with "cv_wait_common" in all hung tasks.

With that, I'm re-opening this bug and will try to provide better feedback soon.