Linux workstation 4.16.0-rc7 #1 SMP Wed Mar 28 09:21:33 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
ii kmod-spl-4.16.0-rc7 0.7.0-31 amd64 spl kernel module(s) for 4.16.0-rc7
ii kmod-spl-devel 0.7.0-31 amd64 spl kernel module(s) devel common
ii kmod-spl-devel-4.16.0-rc7 0.7.0-31 amd64 spl kernel module(s) devel for 4.16.0-rc7
ii kmod-zfs-4.16.0-rc7 0.7.0-390 amd64 zfs kernel module(s) for 4.16.0-rc7
ii kmod-zfs-devel 0.7.0-390 amd64 zfs kernel module(s) devel common
ii kmod-zfs-devel-4.16.0-rc7 0.7.0-390 amd64 zfs kernel module(s) devel for 4.16.0-rc7
ii libzfs2 0.7.0-390 amd64 Native ZFS filesystem library for Linux
ii libzfs2-devel 0.7.0-390 amd64 Development headers
ii zfs 0.7.0-390 amd64 Commands to control the kernel modules and libraries
And they all look, somehow, related to the issue according to the stack trace. Bug 5867, the closest, indicates there are 2 or 3 races that would have been fixed by some commits already included in the upstream version I'm testing.
I'll try to get a kdump from latest upstream kernel + latest upstream zfs/spl modules to check if I can visualize what other task is racing with "cv_wait_common" in all hung tasks.
With that, I'm re-opening this bug and will try to provide better feedback soon.
Colin,
I'm afraid this is an ongoing issue. I have faced this in 4.4 and 4.13 kernels:
[ 240.568146] INFO: task txg_sync:2065 blocked for more than 120 seconds. kernel/ hung_task_ timeout_ secs" disables this message. e45>] schedule+0x35/0x80 f98>] schedule_ timeout+ 0x1b8/0x270 c12>] ? default_ wake_function+ 0x12/0x20 318>] ? __wake_ up_common+ 0x58/0x90 cce>] ? ktime_get+0x3e/0xb0 5d4>] io_schedule_ timeout+ 0xa4/0x110 c2c>] cv_wait_ common+ 0xbc/0x140 [spl] 9d0>] ? wake_atomic_ t_function+ 0x60/0x60 d08>] __cv_wait_ io+0x18/ 0x20 [spl] 634>] zio_wait+ 0x114/0x200 [zfs] d88>] dsl_pool_ sync+0xb8/ 0x430 [zfs] 7b6>] spa_sync+ 0x366/0xb30 [zfs] c12>] ? default_ wake_function+ 0x12/0x20 c4a>] txg_sync_ thread+ 0x3ba/0x630 [zfs] 890>] ? txg_delay+ 0x180/0x180 [zfs] dc0>] ? __thread_ exit+0x20/ 0x20 [spl] e33>] thread_ generic_ wrapper+ 0x73/0x80 [spl] 0f7>] kthread+0xe7/0x100 010>] ? kthread_ create_ on_node+ 0x1e0/0x1e0 422>] ret_from_ fork+0x42/ 0x80 010>] ? kthread_ create_ on_node+ 0x1e0/0x1e0 kernel/ hung_task_ timeout_ secs" disables this message. e45>] schedule+0x35/0x80 c7b>] cv_wait_ common+ 0x10b/0x140 [spl] 9d0>] ? wake_atomic_ t_function+ 0x60/0x60 cc5>] __cv_wait+0x15/0x20 [spl] 4b5>] txg_wait_ synced+ 0xe5/0x130 [zfs] e4b>] zil_replay+ 0xdb/0x120 [zfs] cef>] zfs_sb_ setup+0x16f/ 0x180 [zfs] b38>] zfs_domount+ 0x2c8/0x350 [zfs] cf0>] ? zpl_kill_ sb+0x20/ 0x20 [zfs] d1c>] zpl_fill_ super+0x2c/ 0x40 [zfs] 14f>] mount_nodev+ 0x4f/0xa0 18f>] zpl_mount+0x4f/0x80 [zfs] e4d>] mount_fs+0x3d/0x170 647>] vfs_kern_ mount+0x67/ 0x110 cff>] do_mount+ 0x25f/0xda0 b7f>] SyS_mount+ 0x9f/0x100 fc8>] entry_SYSCALL_ 64_fastpath+ 0x1c/0xbb kernel/ hung_task_ timeout_ secs" disables this message.
[ 240.568218] Tainted: P O 4.4.0-116-generic #140-Ubuntu
[ 240.568268] "echo 0 > /proc/sys/
[ 240.568290] txg_sync D ffff880739e63aa8 0 2065 2 0x00000000
[ 240.568294] ffff880739e63aa8 ffff880739e63a88 ffff88081af75400 ffff8800ac635400
[ 240.568296] ffff880739e64000 ffff88083ed972c0 7fffffffffffffff ffff88009fffc968
[ 240.568297] 0000000000000001 ffff880739e63ac0 ffffffff8184ae45 0000000000000000
[ 240.568300] Call Trace:
[ 240.568305] [<ffffffff8184a
[ 240.568307] [<ffffffff8184d
[ 240.568310] [<ffffffff810ae
[ 240.568313] [<ffffffff810c6
[ 240.568315] [<ffffffff810f8
[ 240.568317] [<ffffffff8184a
[ 240.568325] [<ffffffffc0395
[ 240.568327] [<ffffffff810c6
[ 240.568332] [<ffffffffc0395
[ 240.568373] [<ffffffffc0957
[ 240.568397] [<ffffffffc08e0
[ 240.568424] [<ffffffffc08fc
[ 240.568426] [<ffffffff810ae
[ 240.568453] [<ffffffffc090d
[ 240.568481] [<ffffffffc090d
[ 240.568486] [<ffffffffc0390
[ 240.568490] [<ffffffffc0390
[ 240.568492] [<ffffffff810a3
[ 240.568494] [<ffffffff810a3
[ 240.568495] [<ffffffff8184f
[ 240.568497] [<ffffffff810a3
[ 240.568505] INFO: task lxd:2562 blocked for more than 120 seconds.
[ 240.568524] Tainted: P O 4.4.0-116-generic #140-Ubuntu
[ 240.568543] "echo 0 > /proc/sys/
[ 240.568565] lxd D ffff880734773ba0 0 2562 1 0x00000000
[ 240.568567] ffff880734773ba0 0000000000000246 ffff88081af71c00 ffff88080d18f000
[ 240.568569] ffff880734774000 ffff8808144dba20 ffff8808144dba48 ffff8808144dbae0
[ 240.568571] 0000000000000000 ffff880734773bb8 ffffffff8184ae45 ffff8808144dbad8
[ 240.568573] Call Trace:
[ 240.568575] [<ffffffff8184a
[ 240.568579] [<ffffffffc0395
[ 240.568581] [<ffffffff810c6
[ 240.568586] [<ffffffffc0395
[ 240.568614] [<ffffffffc090d
[ 240.568643] [<ffffffffc0952
[ 240.568673] [<ffffffffc093f
[ 240.568702] [<ffffffffc0940
[ 240.568731] [<ffffffffc095e
[ 240.568761] [<ffffffffc095e
[ 240.568763] [<ffffffff81217
[ 240.568792] [<ffffffffc095f
[ 240.568794] [<ffffffff81217
[ 240.568796] [<ffffffff81234
[ 240.568798] [<ffffffff81236
[ 240.568801] [<ffffffff81237
[ 240.568803] [<ffffffff8184e
[ 240.568810] INFO: task sync:7760 blocked for more than 120 seconds.
[ 240.568828] Tainted: P O 4.4.0-116-generic #140-Ubuntu
[ 240.568848] "echo 0 > /proc/sys/
And I am still facing this with latest kernel and latest SPL and ZFS code
inaddy@ workstation: ~$ sudo cat /proc/2832/stack common+ 0x12c/0x170 [spl] io+0x19/ 0x20 [spl] 0x119/0x270 [zfs] sync+0xb8/ 0x420 [zfs] 0x43e/0xd30 [zfs] thread+ 0x271/0x440 [zfs] generic_ wrapper+ 0x74/0x90 [spl] fork+0x22/ 0x40
[<0>] __cv_timedwait_
[<0>] __cv_timedwait_
[<0>] zio_wait+
[<0>] dsl_pool_
[<0>] spa_sync+
[<0>] txg_sync_
[<0>] thread_
[<0>] kthread+0x121/0x140
[<0>] ret_from_
[<0>] 0xffffffffffffffff
Linux workstation 4.16.0-rc7 #1 SMP Wed Mar 28 09:21:33 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
ii kmod-spl-4.16.0-rc7 0.7.0-31 amd64 spl kernel module(s) for 4.16.0-rc7 devel-4. 16.0-rc7 0.7.0-31 amd64 spl kernel module(s) devel for 4.16.0-rc7
ii kmod-spl-devel 0.7.0-31 amd64 spl kernel module(s) devel common
ii kmod-spl-
ii kmod-zfs-4.16.0-rc7 0.7.0-390 amd64 zfs kernel module(s) for 4.16.0-rc7 devel-4. 16.0-rc7 0.7.0-390 amd64 zfs kernel module(s) devel for 4.16.0-rc7
ii kmod-zfs-devel 0.7.0-390 amd64 zfs kernel module(s) devel common
ii kmod-zfs-
ii libzfs2 0.7.0-390 amd64 Native ZFS filesystem library for Linux
ii libzfs2-devel 0.7.0-390 amd64 Development headers
ii zfs 0.7.0-390 amd64 Commands to control the kernel modules and libraries
I could map upstream bugs:
https:/ /github. com/zfsonlinux/ zfs/issues/ 5867 /github. com/zfsonlinux/ zfs/pull/ 6133 /github. com/zfsonlinux/ zfs/issues/ 6127
https:/
https:/
And they all look, somehow, related to the issue according to the stack trace. Bug 5867, the closest, indicates there are 2 or 3 races that would have been fixed by some commits already included in the upstream version I'm testing.
I'll try to get a kdump from latest upstream kernel + latest upstream zfs/spl modules to check if I can visualize what other task is racing with "cv_wait_common" in all hung tasks.
With that, I'm re-opening this bug and will try to provide better feedback soon.