Task (usually mongod) blocked more 120 seconds (lock-ups) in juju on lxc/lxd + zfs

Bug #1607355 reported by Rafael David Tinoco on 2016-07-28

This bug report will be marked for expiration in 51 days if no further activity occurs. (find out why)

10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned

Bug Description

I was able to reproduce this 2 or 3 times last 2 days. I have the following setup:

Containers for Trusty/kilo service machines:

inaddy@workstation:~$ lxc-ls | grep tk
tkcephmon01 RUNNING 0 - 192.168.65.52 -
tkcephmon02 RUNNING 0 - 192.168.65.51 -
tkcephmon03 RUNNING 0 - 192.168.65.48 -
tkcinder RUNNING 0 - 192.168.65.49 -
tkdash RUNNING 0 - 192.168.65.50 -
tkglance RUNNING 0 - 192.168.65.53 -
tkjuju RUNNING 0 - 192.168.65.15 -
tkkeystone RUNNING 0 - 192.168.65.54 -
tkmysql RUNNING 0 - 192.168.65.55 -
tknova RUNNING 0 - 192.168.65.56 -
tkrabbit RUNNING 0 - 192.168.65.57 -
tkswiftproxy RUNNING 0 - 192.168.65.58 -

And compute nodes + neutrongw as kvm guests:

inaddy@workstation:~$ virsh list --all | grep tk
 21 tkcompute01 running
 22 tkcompute02 running
 23 tkcompute03 running
 24 tkneutrongw running

All my LXC containers are on top of ZFS:

Linux workstation 4.4.0-32-generic #51-Ubuntu SMP Tue Jul 19 18:09:07 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

And my KVM guests are on top of ext4 + 1.2 raid0 stripped volume.

I'm getting the lockups bellow (usually for mongod, from tkjuju container, the juju controller). After the first lockup appears (schedule timeout coming from zfs sync logic most likely), JuJu controller starts giving me errors on "update-status". From "juju status":

glance/0 error idle 1.25.6 10 9292/tcp tkglance hook failed: "update-status"
keystone/0 error idle 1.25.6 11 tkkeystone hook failed: "update-status"
mysql/0 error idle 1.25.6 12 tkmysql hook failed: "config-changed"
neutron-api/0 error idle 1.25.6 4 9696/tcp tkneutrongw hook failed: "update-status"
nova-compute/0 error idle 1.25.6 1 tkcompute01 hook failed: "update-status"
nova-compute/1 error idle 1.25.6 2 tkcompute02 hook failed: "update-status"
nova-compute/2 error idle 1.25.6 3 tkcompute03 hook failed: "update-status"

Lockups:

[105601.816578] INFO: task mongod:14480 blocked for more than 120 seconds.
[105601.816583] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[105601.816584] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[105601.816586] mongod D ffff88010ec47ba8 0 14480 16855 0x00000100
[105601.816590] ffff88010ec47ba8 0000000057992eeb ffff880108e5ee00 ffff880108e58dc0
[105601.816592] ffff88010ec48000 ffff88081ecd6d00 7fffffffffffffff ffffffff8182a600
[105601.816594] ffff88010ec47d08 ffff88010ec47bc0 ffffffff81829e05 0000000000000000
[105601.816596] Call Trace:
[105601.816603] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[105601.816606] [<ffffffff81829e05>] schedule+0x35/0x80
[105601.816608] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[105601.816612] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[105601.816614] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[105601.816616] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[105601.816618] [<ffffffff8182a61b>] bit_wait_io+0x1b/0x70
[105601.816620] [<ffffffff8182a1ad>] __wait_on_bit+0x5d/0x90
[105601.816622] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[105601.816625] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[105601.816628] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[105601.816630] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[105601.816632] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[105601.816682] [<ffffffffc0902ea8>] zpl_fsync+0x38/0x90 [zfs]
[105601.816685] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[105601.816687] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[105601.816689] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
[117121.961545] INFO: task txg_sync:4589 blocked for more than 120 seconds.
[117121.961549] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[117121.961550] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[117121.961551] txg_sync D ffff8807e1fbbaa8 0 4589 2 0x00000000
[117121.961558] ffff8807e1fbbaa8 ffff88081ed96d00 ffff8807faf03700 ffff8807e2dbb700
[117121.961560] ffff8807e1fbc000 ffff88081ed16d00 7fffffffffffffff ffff88017a8f2208
[117121.961561] 0000000000000001 ffff8807e1fbbac0 ffffffff81829e05 0000000000000000
[117121.961563] Call Trace:
[117121.961569] [<ffffffff81829e05>] schedule+0x35/0x80
[117121.961571] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[117121.961574] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[117121.961576] [<ffffffff810c35e2>] ? __wake_up_common+0x52/0x90
[117121.961578] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[117121.961586] [<ffffffffc0759bec>] cv_wait_common+0xbc/0x140 [spl]
[117121.961589] [<ffffffff810c3ca0>] ? wake_atomic_t_function+0x60/0x60
[117121.961593] [<ffffffffc0759cc8>] __cv_wait_io+0x18/0x20 [spl]
[117121.961633] [<ffffffffc08fd2fe>] zio_wait+0x10e/0x1f0 [zfs]
[117121.961653] [<ffffffffc0886c58>] dsl_pool_sync+0xb8/0x430 [zfs]
[117121.961676] [<ffffffffc08a25b6>] spa_sync+0x366/0xb30 [zfs]
[117121.961677] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[117121.961701] [<ffffffffc08b3a4a>] txg_sync_thread+0x3ba/0x630 [zfs]
[117121.961725] [<ffffffffc08b3690>] ? txg_delay+0x180/0x180 [zfs]
[117121.961729] [<ffffffffc0754e31>] thread_generic_wrapper+0x71/0x80 [spl]
[117121.961732] [<ffffffffc0754dc0>] ? __thread_exit+0x20/0x20 [spl]
[117121.961734] [<ffffffff810a0808>] kthread+0xd8/0xf0
[117121.961736] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[117121.961737] [<ffffffff8182e28f>] ret_from_fork+0x3f/0x70
[117121.961739] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[117121.961831] INFO: task mongod:14480 blocked for more than 120 seconds.
[117121.961832] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[117121.961833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[117121.961834] mongod D ffff88010ec47ba8 0 14480 16855 0x00000100
[117121.961836] ffff88010ec47ba8 0000000057995bf6 ffff8807faf01b80 ffff880108e58dc0
[117121.961837] ffff88010ec48000 ffff88081ec96d00 7fffffffffffffff ffffffff8182a600
[117121.961839] ffff88010ec47d08 ffff88010ec47bc0 ffffffff81829e05 0000000000000000
[117121.961841] Call Trace:
[117121.961843] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[117121.961844] [<ffffffff81829e05>] schedule+0x35/0x80
[117121.961846] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[117121.961849] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[117121.961851] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[117121.961852] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[117121.961854] [<ffffffff8182a61b>] bit_wait_io+0x1b/0x70
[117121.961856] [<ffffffff8182a1ad>] __wait_on_bit+0x5d/0x90
[117121.961857] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[117121.961859] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[117121.961861] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[117121.961863] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[117121.961864] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[117121.961891] [<ffffffffc0902ea8>] zpl_fsync+0x38/0x90 [zfs]
[117121.961893] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[117121.961895] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[117121.961897] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
[147242.344176] INFO: task txg_sync:4589 blocked for more than 120 seconds.
[147242.344180] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[147242.344181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[147242.344182] txg_sync D ffff8807e1fbbaa8 0 4589 2 0x00000000
[147242.344185] ffff8807e1fbbaa8 ffff88081ed56d00 ffff8807faf02940 ffff8807e2dbb700
[147242.344191] ffff8807e1fbc000 ffff88081ecd6d00 7fffffffffffffff ffff8801789e5228
[147242.344192] 0000000000000001 ffff8807e1fbbac0 ffffffff81829e05 0000000000000000
[147242.344194] Call Trace:
[147242.344200] [<ffffffff81829e05>] schedule+0x35/0x80
[147242.344202] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[147242.344205] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[147242.344207] [<ffffffff810c35e2>] ? __wake_up_common+0x52/0x90
[147242.344209] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[147242.344217] [<ffffffffc0759bec>] cv_wait_common+0xbc/0x140 [spl]
[147242.344219] [<ffffffff810c3ca0>] ? wake_atomic_t_function+0x60/0x60
[147242.344224] [<ffffffffc0759cc8>] __cv_wait_io+0x18/0x20 [spl]
[147242.344264] [<ffffffffc08fd2fe>] zio_wait+0x10e/0x1f0 [zfs]
[147242.344285] [<ffffffffc0886c58>] dsl_pool_sync+0xb8/0x430 [zfs]
[147242.344307] [<ffffffffc08a25b6>] spa_sync+0x366/0xb30 [zfs]
[147242.344309] [<ffffffff810ac0c2>] ? default_wake_function+0x12/0x20
[147242.344333] [<ffffffffc08b3a4a>] txg_sync_thread+0x3ba/0x630 [zfs]
[147242.344357] [<ffffffffc08b3690>] ? txg_delay+0x180/0x180 [zfs]
[147242.344360] [<ffffffffc0754e31>] thread_generic_wrapper+0x71/0x80 [spl]
[147242.344364] [<ffffffffc0754dc0>] ? __thread_exit+0x20/0x20 [spl]
[147242.344366] [<ffffffff810a0808>] kthread+0xd8/0xf0
[147242.344367] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[147242.344369] [<ffffffff8182e28f>] ret_from_fork+0x3f/0x70
[147242.344370] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
[147242.344461] INFO: task mongod:14480 blocked for more than 120 seconds.
[147242.344462] Tainted: P O 4.4.0-32-generic #51-Ubuntu
[147242.344463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[147242.344464] mongod D ffff88010ec47ba8 0 14480 16855 0x00000100
[147242.344466] ffff88010ec47ba8 000000005799d18d ffff8807faf03700 ffff880108e58dc0
[147242.344467] ffff88010ec48000 ffff88081ed16d00 7fffffffffffffff ffffffff8182a600
[147242.344469] ffff88010ec47d08 ffff88010ec47bc0 ffffffff81829e05 0000000000000000
[147242.344470] Call Trace:
[147242.344473] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[147242.344474] [<ffffffff81829e05>] schedule+0x35/0x80
[147242.344476] [<ffffffff8182cf25>] schedule_timeout+0x1b5/0x270
[147242.344478] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[147242.344480] [<ffffffff8182a600>] ? bit_wait+0x60/0x60
[147242.344482] [<ffffffff81829334>] io_schedule_timeout+0xa4/0x110
[147242.344483] [<ffffffff8182a61b>] bit_wait_io+0x1b/0x70
[147242.344485] [<ffffffff8182a1ad>] __wait_on_bit+0x5d/0x90
[147242.344487] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[147242.344488] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[147242.344490] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[147242.344492] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[147242.344493] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[147242.344520] [<ffffffffc0902ea8>] zpl_fsync+0x38/0x90 [zfs]
[147242.344522] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[147242.344525] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[147242.344526] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1607355

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → In Progress
importance: Undecided → Medium
assignee: nobody → Rafael David Tinoco (inaddy)
Rafael David Tinoco (inaddy) wrote :
Download full text (19.0 KiB)

Got some more stack traces:

[22201.670647] INFO: task mongod:1917 blocked for more than 120 seconds.
[22201.670650] Tainted: P O 4.4.0-34-generic #53-Ubuntu
[22201.670651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[22201.670653] mongod D ffff8805aecefba8 0 1917 5819 0x00000100
[22201.670655] ffff8805aecefba8 00000000579b1d9a ffff8807faf06040 ffff8806d6238000
[22201.670657] ffff8805aecf0000 ffff88081edd6d00 7fffffffffffffff ffffffff8182a610
[22201.670659] ffff8805aecefd08 ffff8805aecefbc0 ffffffff81829e15 0000000000000000
[22201.670661] Call Trace:
[22201.670666] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[22201.670668] [<ffffffff81829e15>] schedule+0x35/0x80
[22201.670670] [<ffffffff8182cf35>] schedule_timeout+0x1b5/0x270
[22201.670673] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[22201.670675] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[22201.670676] [<ffffffff81829344>] io_schedule_timeout+0xa4/0x110
[22201.670678] [<ffffffff8182a62b>] bit_wait_io+0x1b/0x70
[22201.670680] [<ffffffff8182a1bd>] __wait_on_bit+0x5d/0x90
[22201.670682] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[22201.670684] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[22201.670686] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[22201.670688] [<ffffffff8118ef91>] ? __filemap_fdatawrite_range+0xd1/0x100
[22201.670690] [<ffffffff8118d1e4>] filemap_fdatawait_range+0x14/0x30
[22201.670691] [<ffffffff8118f0df>] filemap_write_and_wait_range+0x3f/0x70
[22201.670732] [<ffffffffc08f7ea8>] zpl_fsync+0x38/0x90 [zfs]
[22201.670734] [<ffffffff8124116b>] vfs_fsync_range+0x4b/0xb0
[22201.670736] [<ffffffff811cabee>] SyS_msync+0x17e/0x1f0
[22201.670738] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
[85684.344389] INFO: task mongod:1917 blocked for more than 120 seconds.
[85684.344393] Tainted: P O 4.4.0-34-generic #53-Ubuntu
[85684.344394] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[85684.344395] mongod D ffff8805aecefba8 0 1917 5819 0x00000100
[85684.344398] ffff8805aecefba8 0000000000000003 ffffffff81e11500 ffff8806d6238000
[85684.344400] ffff8805aecf0000 ffff88081ec16d00 7fffffffffffffff ffffffff8182a610
[85684.344402] ffff8805aecefd08 ffff8805aecefbc0 ffffffff81829e15 0000000000000000
[85684.344403] Call Trace:
[85684.344409] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[85684.344411] [<ffffffff81829e15>] schedule+0x35/0x80
[85684.344413] [<ffffffff8182cf35>] schedule_timeout+0x1b5/0x270
[85684.344416] [<ffffffff8118d939>] ? find_get_pages_tag+0x109/0x190
[85684.344418] [<ffffffff8182a610>] ? bit_wait+0x60/0x60
[85684.344419] [<ffffffff81829344>] io_schedule_timeout+0xa4/0x110
[85684.344421] [<ffffffff8182a62b>] bit_wait_io+0x1b/0x70
[85684.344423] [<ffffffff8182a1bd>] __wait_on_bit+0x5d/0x90
[85684.344425] [<ffffffff8118d04b>] wait_on_page_bit+0xcb/0xf0
[85684.344427] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40
[85684.344429] [<ffffffff8118d163>] __filemap_fdatawait_range+0xf3/0x160
[85684.344431] [<ffffffff8118ef91>] ? __filemap_fdatawrite_range+0xd1/0x100
[85...

Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
status: In Progress → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers