Task (usually mongod) blocked more 120 seconds (lock-ups) in juju on lxc/lxd + zfs

Bug #1607355 reported by Rafael David Tinoco
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

I was able to reproduce this 2 or 3 times last 2 days. I have the following setup:

Containers for Trusty/kilo service machines:

inaddy@workstation:~$ lxc-ls | grep tk
tkcephmon01 RUNNING 0 - 192.168.65.52 -
tkcephmon02 RUNNING 0 - 192.168.65.51 -
tkcephmon03 RUNNING 0 - 192.168.65.48 -
tkcinder RUNNING 0 - 192.168.65.49 -
tkdash RUNNING 0 - 192.168.65.50 -
tkglance RUNNING 0 - 192.168.65.53 -
tkjuju RUNNING 0 - 192.168.65.15 -
tkkeystone RUNNING 0 - 192.168.65.54 -
tkmysql RUNNING 0 - 192.168.65.55 -
tknova RUNNING 0 - 192.168.65.56 -
tkrabbit RUNNING 0 - 192.168.65.57 -
tkswiftproxy RUNNING 0 - 192.168.65.58 -

And compute nodes + neutrongw as kvm guests:

inaddy@workstation:~$ virsh list --all | grep tk
 21 tkcompute01 running
 22 tkcompute02 running
 23 tkcompute03 running
 24 tkneutrongw running

All my LXC containers are on top of ZFS:

Linux workstation 4.4.0-32-generic #51-Ubuntu SMP Tue Jul 19 18:09:07 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

And my KVM guests are on top of ext4 + 1.2 raid0 stripped volume.

I'm getting the lockups bellow (usually for mongod, from tkjuju container, the juju controller). After the first lockup appears (schedule timeout coming from zfs sync logic most likely), JuJu controller starts giving me errors on "update-status". From "juju status":

glance/0 error idle 1.25.6 10 9292/tcp tkglance hook failed: "update-status"
keystone/0 error idle 1.25.6 11 tkkeystone hook failed: "update-status"
mysql/0 error idle 1.25.6 12 tkmysql hook failed: "config-changed"
neutron-api/0 error idle 1.25.6 4 9696/tcp tkneutrongw hook failed: "update-status"
nova-compute/0 error idle 1.25.6 1 tkcompute01 hook failed: "update-status"
nova-compute/1 error idle 1.25.6 2 tkcompute02 hook failed: "update-status"
nova-compute/2 error idle 1.25.6 3 tkcompute03 hook failed: "update-status"

Lockups:

[105601.816578] INFO: task mongod:14480 blocked for more than 120 seconds.
[105601.816583] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[105601.816584] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[105601.816586] mongod D ffff88010ec47ba8 0 14480 16855 0x00000100
[105601.816590] ffff88010ec47ba8 0000000057992eeb ffff880108e5ee00 ffff880108e58dc0
[105601.816592] ffff88010ec48000 ffff88081ecd6d00 7fffffffffffffff ffffffff8182a600
[105601.816594] ffff88010ec47d08 ffff88010ec47bc0 ffffffff81829e05 0000000000000000
[105601.816596] Call Trace:
[105601.816603] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[105601.816606] [<ffffffff81829e05>] schedule+0x35/0x80
[105601.816608] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[105601.816612] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[105601.816614] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[105601.816616] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[105601.816618] [<ffffffff8182a61b>] bit_wait_io+0x1b/0x70
[105601.816620] [<ffffffff8182a1ad>] __wait_on_bit+0x5d/0x90
[105601.816622] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[105601.816625] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[105601.816628] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[105601.816630] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[105601.816632] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[105601.816682] [<ffffffffc0902ea8>] zpl_fsync+0x38/0x90 [zfs]
[105601.816685] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[105601.816687] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[105601.816689] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
[117121.961545] INFO: task txg_sync:4589 blocked for more than 120 seconds.
[117121.961549] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[117121.961550] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[117121.961551] txg_sync D ffff8807e1fbbaa8 0 4589 2 0x00000000
[117121.961558] ffff8807e1fbbaa8 ffff88081ed96d00 ffff8807faf03700 ffff8807e2dbb700
[117121.961560] ffff8807e1fbc000 ffff88081ed16d00 7fffffffffffffff ffff88017a8f2208
[117121.961561] 0000000000000001 ffff8807e1fbbac0 ffffffff81829e05 0000000000000000
[117121.961563] Call Trace:
[117121.961569] [<ffffffff81829e05>] schedule+0x35/0x80
[117121.961571] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[117121.961574] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[117121.961576] [<ffffffff810c35e2>] ? __wake_up_common+0x52/0x90
[117121.961578] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[117121.961586] [<ffffffffc0759bec>] cv_wait_common+0xbc/0x140 [spl]
[117121.961589] [<ffffffff810c3ca0>] ? wake_atomic_t_function+0x60/0x60
[117121.961593] [<ffffffffc0759cc8>] __cv_wait_io+0x18/0x20 [spl]
[117121.961633] [<ffffffffc08fd2fe>] zio_wait+0x10e/0x1f0 [zfs]
[117121.961653] [<ffffffffc0886c58>] dsl_pool_sync+0xb8/0x430 [zfs]
[117121.961676] [<ffffffffc08a25b6>] spa_sync+0x366/0xb30 [zfs]
[117121.961677] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[117121.961701] [<ffffffffc08b3a4a>] txg_sync_thread+0x3ba/0x630 [zfs]
[117121.961725] [<ffffffffc08b3690>] ? txg_delay+0x180/0x180 [zfs]
[117121.961729] [<ffffffffc0754e31>] thread_generic_wrapper+0x71/0x80 [spl]
[117121.961732] [<ffffffffc0754dc0>] ? __thread_exit+0x20/0x20 [spl]
[117121.961734] [<ffffffff810a0808>] kthread+0xd8/0xf0
[117121.961736] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[117121.961737] [<ffffffff8182e28f>] ret_from_fork+0x3f/0x70
[117121.961739] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[117121.961831] INFO: task mongod:14480 blocked for more than 120 seconds.
[117121.961832] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[117121.961833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[117121.961834] mongod D ffff88010ec47ba8 0 14480 16855 0x00000100
[117121.961836] ffff88010ec47ba8 0000000057995bf6 ffff8807faf01b80 ffff880108e58dc0
[117121.961837] ffff88010ec48000 ffff88081ec96d00 7fffffffffffffff ffffffff8182a600
[117121.961839] ffff88010ec47d08 ffff88010ec47bc0 ffffffff81829e05 0000000000000000
[117121.961841] Call Trace:
[117121.961843] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[117121.961844] [<ffffffff81829e05>] schedule+0x35/0x80
[117121.961846] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[117121.961849] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[117121.961851] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[117121.961852] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[117121.961854] [<ffffffff8182a61b>] bit_wait_io+0x1b/0x70
[117121.961856] [<ffffffff8182a1ad>] __wait_on_bit+0x5d/0x90
[117121.961857] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[117121.961859] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[117121.961861] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[117121.961863] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[117121.961864] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[117121.961891] [<ffffffffc0902ea8>] zpl_fsync+0x38/0x90 [zfs]
[117121.961893] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[117121.961895] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[117121.961897] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
[147242.344176] INFO: task txg_sync:4589 blocked for more than 120 seconds.
[147242.344180] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[147242.344181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[147242.344182] txg_sync D ffff8807e1fbbaa8 0 4589 2 0x00000000
[147242.344185] ffff8807e1fbbaa8 ffff88081ed56d00 ffff8807faf02940 ffff8807e2dbb700
[147242.344191] ffff8807e1fbc000 ffff88081ecd6d00 7fffffffffffffff ffff8801789e5228
[147242.344192] 0000000000000001 ffff8807e1fbbac0 ffffffff81829e05 0000000000000000
[147242.344194] Call Trace:
[147242.344200] [<ffffffff81829e05>] schedule+0x35/0x80
[147242.344202] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[147242.344205] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[147242.344207] [<ffffffff810c35e2>] ? __wake_up_common+0x52/0x90
[147242.344209] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[147242.344217] [<ffffffffc0759bec>] cv_wait_common+0xbc/0x140 [spl]
[147242.344219] [<ffffffff810c3ca0>] ? wake_atomic_t_function+0x60/0x60
[147242.344224] [<ffffffffc0759cc8>] __cv_wait_io+0x18/0x20 [spl]
[147242.344264] [<ffffffffc08fd2fe>] zio_wait+0x10e/0x1f0 [zfs]
[147242.344285] [<ffffffffc0886c58>] dsl_pool_sync+0xb8/0x430 [zfs]
[147242.344307] [<ffffffffc08a25b6>] spa_sync+0x366/0xb30 [zfs]
[147242.344309] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[147242.344333] [<ffffffffc08b3a4a>] txg_sync_thread+0x3ba/0x630 [zfs]
[147242.344357] [<ffffffffc08b3690>] ? txg_delay+0x180/0x180 [zfs]
[147242.344360] [<ffffffffc0754e31>] thread_generic_wrapper+0x71/0x80 [spl]
[147242.344364] [<ffffffffc0754dc0>] ? __thread_exit+0x20/0x20 [spl]
[147242.344366] [<ffffffff810a0808>] kthread+0xd8/0xf0
[147242.344367] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[147242.344369] [<ffffffff8182e28f>] ret_from_fork+0x3f/0x70
[147242.344370] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[147242.344461] INFO: task mongod:14480 blocked for more than 120 seconds.
[147242.344462] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[147242.344463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[147242.344464] mongod D ffff88010ec47ba8 0 14480 16855 0x00000100
[147242.344466] ffff88010ec47ba8 000000005799d18d ffff8807faf03700 ffff880108e58dc0
[147242.344467] ffff88010ec48000 ffff88081ed16d00 7fffffffffffffff ffffffff8182a600
[147242.344469] ffff88010ec47d08 ffff88010ec47bc0 ffffffff81829e05 0000000000000000
[147242.344470] Call Trace:
[147242.344473] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[147242.344474] [<ffffffff81829e05>] schedule+0x35/0x80
[147242.344476] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[147242.344478] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[147242.344480] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[147242.344482] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[147242.344483] [<ffffffff8182a61b>] bit_wait_io+0x1b/0x70
[147242.344485] [<ffffffff8182a1ad>] __wait_on_bit+0x5d/0x90
[147242.344487] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[147242.344488] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[147242.344490] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[147242.344492] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[147242.344493] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[147242.344520] [<ffffffffc0902ea8>] zpl_fsync+0x38/0x90 [zfs]
[147242.344522] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[147242.344525] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[147242.344526] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1607355

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → In Progress
importance: Undecided → Medium
assignee: nobody → Rafael David Tinoco (inaddy)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (19.0 KiB)

Got some more stack traces:

[22201.670647] INFO: task mongod:1917 blocked for more than 120 seconds.
[22201.670650] Tainted: P O 4.4.0-34-generic #53-Ubuntu
[22201.670651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[22201.670653] mongod D ffff8805aecefba8 0 1917 5819 0x00000100
[22201.670655] ffff8805aecefba8 00000000579b1d9a ffff8807faf06040 ffff8806d6238000
[22201.670657] ffff8805aecf0000 ffff88081edd6d00 7fffffffffffffff ffffffff8182a610
[22201.670659] ffff8805aecefd08 ffff8805aecefbc0 ffffffff81829e15 0000000000000000
[22201.670661] Call Trace:
[22201.670666] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[22201.670668] [<ffffffff81829e15>] schedule+0x35/0x80
[22201.670670] [<ffffffff8182cf35>] schedule_timeout+0x1b5/0x270
[22201.670673] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[22201.670675] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[22201.670676] [<ffffffff81829344>] io_schedule_timeout+0xa4/0x110
[22201.670678] [<ffffffff8182a62b>] bit_wait_io+0x1b/0x70
[22201.670680] [<ffffffff8182a1bd>] __wait_on_bit+0x5d/0x90
[22201.670682] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[22201.670684] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[22201.670686] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[22201.670688] [<ffffffff8118ef91>] ? __filemap_fdatawrite_range+0xd1/0x100
[22201.670690] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[22201.670691] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[22201.670732] [<ffffffffc08f7ea8>] zpl_fsync+0x38/0x90 [zfs]
[22201.670734] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[22201.670736] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[22201.670738] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
[85684.344389] INFO: task mongod:1917 blocked for more than 120 seconds.
[85684.344393] Tainted: P O 4.4.0-34-generic #53-Ubuntu
[85684.344394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[85684.344395] mongod D ffff8805aecefba8 0 1917 5819 0x00000100
[85684.344398] ffff8805aecefba8 0000000000000003 ffffffff81e11500 ffff8806d6238000
[85684.344400] ffff8805aecf0000 ffff88081ec16d00 7fffffffffffffff ffffffff8182a610
[85684.344402] ffff8805aecefd08 ffff8805aecefbc0 ffffffff81829e15 0000000000000000
[85684.344403] Call Trace:
[85684.344409] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[85684.344411] [<ffffffff81829e15>] schedule+0x35/0x80
[85684.344413] [<ffffffff8182cf35>] schedule_timeout+0x1b5/0x270
[85684.344416] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[85684.344418] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[85684.344419] [<ffffffff81829344>] io_schedule_timeout+0xa4/0x110
[85684.344421] [<ffffffff8182a62b>] bit_wait_io+0x1b/0x70
[85684.344423] [<ffffffff8182a1bd>] __wait_on_bit+0x5d/0x90
[85684.344425] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[85684.344427] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[85684.344429] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[85684.344431] [<ffffffff8118ef91>] ? __filemap_fdatawrite_range+0xd1/0x100
[85...

Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
status: In Progress → Incomplete
Revision history for this message
Simos Xenitellis  (simosx) wrote :

This might be related to #1773392.

On the other hand, the issue in #1773392 appears to have been introduced in more recent kernels (16.04 not affected).

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.