CI: Undercloud install failed due to timeout while setting up mistral

Bug #1730714 reported by Alex Schultz on 2017-11-07
This bug affects 2 people
Affects Status Importance Assigned to Milestone

Bug Description

While running the undercloud install, the mistral setup timed out. This happened on a stable/pike CI job in the gate.

2017-11-07 16:40:58 | 2017-11-07 16:40:58,959 ERROR: TIMEOUT waiting for execution 3382c596-948b-4a52-82a4-142a1ae47ace to finish. State: RUNNING
2017-11-07 16:40:58 | 2017-11-07 16:40:58,960 DEBUG: An exception occurred
2017-11-07 16:40:58 | Traceback (most recent call last):
2017-11-07 16:40:58 | File "/usr/lib/python2.7/site-packages/instack_undercloud/", line 1780, in install
2017-11-07 16:40:58 | _post_config(instack_env)
2017-11-07 16:40:58 | File "/usr/lib/python2.7/site-packages/instack_undercloud/", line 1697, in _post_config
2017-11-07 16:40:58 | _post_config_mistral(instack_env, mistral, swift)
2017-11-07 16:40:58 | File "/usr/lib/python2.7/site-packages/instack_undercloud/", line 1654, in _post_config_mistral
2017-11-07 16:40:58 | _create_default_plan(mistral, plans)
2017-11-07 16:40:58 | File "/usr/lib/python2.7/site-packages/instack_undercloud/", line 1595, in _create_default_plan
2017-11-07 16:40:58 | fail_on_error=True)
2017-11-07 16:40:58 | File "/usr/lib/python2.7/site-packages/instack_undercloud/", line 1544, in _wait_for_mistral_execution
2017-11-07 16:40:58 | raise RuntimeError(error_message)
2017-11-07 16:40:58 | RuntimeError: TIMEOUT waiting for execution 3382c596-948b-4a52-82a4-142a1ae47ace to finish. State: RUNNING
2017-11-07 16:40:58 | 2017-11-07 16:40:58,960 ERROR:

Dougal Matthews (d0ugal) on 2017-11-07
tags: added: workflows
Thomas Herve (therve) wrote :

To me it sounds like an hypervisor issue of the undercloud host. We can see the following errors in messages:

INFO: task kworker/u16:1:72 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u16:1 D ffff88003657e0f8 0 72 2 0x00000000
kernel: Workqueue: writeback bdi_writeback_workfn (flush-252:0)
kernel: ffff880234def830 0000000000000046 ffff880234dbbf40 ffff880234deffd8
kernel: ffff880234deffd8 ffff880234deffd8 ffff880234dbbf40 ffff88003657e0f0
kernel: ffff88003657e0f4 ffff880234dbbf40 00000000ffffffff ffff88003657e0f8
kernel: Call Trace:
kernel: [<ffffffff816aa4a9>] schedule_preempt_disabled+0x29/0x70
kernel: [<ffffffff816a83d7>] __mutex_lock_slowpath+0xc7/0x1d0
kernel: [<ffffffff816a77ef>] mutex_lock+0x1f/0x2f
kernel: [<ffffffffc01284a8>] __jbd2_log_wait_for_space+0xc8/0x1f0 [jbd2]
kernel: [<ffffffffc01223d3>] add_transaction_credits+0x2d3/0x2f0 [jbd2]
kernel: [<ffffffff81302a1a>] ? __blk_mq_run_hw_queue+0x9a/0xb0
kernel: [<ffffffffc01225e1>] start_this_handle+0x1a1/0x430 [jbd2]
kernel: [<ffffffff8130578b>] ? blk_mq_flush_plug_list+0x13b/0x160
kernel: [<ffffffff811df64a>] ? kmem_cache_alloc+0x1ba/0x1e0
kernel: [<ffffffffc0122a93>] jbd2__journal_start+0xf3/0x1e0 [jbd2]
kernel: [<ffffffffc0149afc>] ? ext4_writepages+0x42c/0xd30 [ext4]
kernel: [<ffffffffc0177ad9>] __ext4_journal_start_sb+0x69/0xe0 [ext4]
kernel: [<ffffffffc0149afc>] ext4_writepages+0x42c/0xd30 [ext4]
kernel: [<ffffffff8118f02e>] do_writepages+0x1e/0x40
kernel: [<ffffffff8122d8e0>] __writeback_single_inode+0x40/0x220
kernel: [<ffffffff8122e524>] writeback_sb_inodes+0x1c4/0x490
kernel: [<ffffffff8122e88f>] __writeback_inodes_wb+0x9f/0xd0
kernel: [<ffffffff8122f0c3>] wb_writeback+0x263/0x2f0
kernel: [<ffffffff8121bd4c>] ? get_nr_inodes+0x4c/0x70
kernel: [<ffffffff8122f56b>] bdi_writeback_workfn+0x2cb/0x460
kernel: [<ffffffff810a882a>] process_one_work+0x17a/0x440
kernel: [<ffffffff810a94f6>] worker_thread+0x126/0x3c0
kernel: [<ffffffff810a93d0>] ? manage_workers.isra.24+0x2a0/0x2a0
kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
kernel: [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
kernel: [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40

For a bunch of processes. Looks like the VM hang for about 10 minutes, and then recover. I think mistral handled it fine, but it took 11 minutes to handle the task, and the caller timed out at 10. Not sure there is anything specific to fix.

Alex Schultz (alex-schultz) wrote :

Given this appears to be infra related, I'll close this now but we should reopen if it occurs again.

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers