Dear Stackers,
I have a timeout issue while doing an openstack overcloud deploy <options> on a containerized pike.
START with options: ['stack', 'failures', 'list', 'overcloud']
command: stack failures list -> heatclient.osc.v1.stack_failures.ListStackFailures (auth=True)
Using auth plugin: password
overcloud.AllNodesDeploySteps:
resource_type: OS::TripleO::PostDeploySteps
physical_resource_id: 754393c3-a37f-47d0-9389-db8ff99c4bd2
status: UPDATE_FAILED
status_reason: |
UPDATE aborted
END return value: 0
Heat Stack update failed.
Heat Stack update failed.
END return value: 1
If I check mistral "engine" log, I can see two tasks are "delayed":
(http://paste.openstack.org/show/719890/)
2018-04-25 06:48:46.223 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Scheduler captured 2 delayed calls. _capture_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:193
2018-04-25 06:48:46.224 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Preparing next delayed call. [ID=5da7a0f3-7710-4658-9a40-25c24a65d13f, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_complete, method_arguments={u'wf_ex_id': u'a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a'}] _prepare_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:217
2018-04-25 06:48:46.227 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Preparing next delayed call. [ID=da0f2dfe-560d-4cd7-96a7-4f11fc12e742, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_complete, method_arguments={u'wf_ex_id': u'18ca00db-840b-47ae-8462-27b79578b3b7'}] _prepare_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:217
Those tasks are still present, even when the deploy has failed.
If I check the tasks linked to one of the wf_ex_id, we can see a stalled "install_ceph" task:
(http://paste.openstack.org/show/719889/)
(undercloud) [stack@undercloud tripleo-heat-templates]$ mistral task-list a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a
+--------------------------------------+--------------------------+---------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ID | Name | Workflow name | Execution ID | State | State info | Created at | Updated at |
+--------------------------------------+--------------------------+---------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| 731d9314-c32d-4dce-a245-b2e71454a921 | collect_puppet_hieradata | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:22 | 2018-04-24 17:11:24 |
| 6adec1a4-4d9d-41b0-a595-d97468eaf6d5 | check_hieradata | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:24 | 2018-04-24 17:11:25 |
| 09b92a53-51e9-4c65-a976-8dc880236867 | merge_ip_lists | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:25 | 2018-04-24 17:11:26 |
| 520ac3d0-ee01-476b-8ea3-4a75454be7ec | set_ip_lists | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:25 | 2018-04-24 17:11:25 |
| f998bdc2-90a4-4fc6-8535-c526d3962132 | set_blacklisted_ips | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:25 | 2018-04-24 17:11:25 |
| e090146d-52ce-4040-a177-84ed8c6c3679 | enable_ssh_admin | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:26 | 2018-04-24 17:12:45 |
| 6ba3f614-f957-4e7a-9a75-077e953a690b | get_private_key | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:45 | 2018-04-24 17:12:46 |
| 164b70fc-3bc6-410e-aed5-fb8993b2777e | make_fetch_directory | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:46 | 2018-04-24 17:12:46 |
| 71e66c77-c119-4c1a-8238-7165349df2b3 | collect_nodes_uuid | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:46 | 2018-04-24 17:12:55 |
| bd43d2e9-557e-4cf2-90fc-e1715273dbfe | parse_node_data_lookup | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:55 | 2018-04-24 17:12:56 |
| cd8093d5-d557-4d14-83ff-3fdd46ac906c | set_ip_uuids | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:55 | 2018-04-24 17:12:55 |
| 26205d7e-4369-4da8-a17d-4fa47e20c636 | map_node_data_lookup | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:56 | 2018-04-24 17:12:56 |
| 36effdb7-aa88-417f-ba38-3899513f54fa | set_role_vars | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:56 | 2018-04-24 17:12:56 |
| 6a1c8361-e1c1-4009-a8ef-97c51bb49591 | build_extra_vars | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:56 | 2018-04-24 17:12:57 |
| ff70b0d9-ca05-440f-b820-916f8e962d00 | ceph_install | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | RUNNING | None | 2018-04-24 17:12:57 | <none> |
+--------------------------------------+--------------------------+---------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
Any way to get the stack back to a working state? Even if I delete the two "delayed tasks", it won't help with the recovery.
If I dig a bit further:
(http://paste.openstack.org/show/719891/)
(undercloud) [stack@undercloud tripleo-heat-templates]$ mistral execution-list | grep -i ceph
| bafc9652-877f-4ea9-a218-441962eaa469 | 9ebefed3-70cd-4125-a203-83aadbdb1de7 | tripleo.storage.v1.ceph-install | sub-workflow execution | b6a2f16b-e1a6-4aea-ae40-de37a0354185 | RUNNING | None | 2018-04-21 15:02:17 | 2018-04-21 15:02:17 |
I don't know if I can safely delete some stuff in mistral and re-start a deploy on a cleaned "workflow" thing (i.e. will the deploy recreate them, or are there some links that will be badly broken?)
Thank you for your support.
Cheers,
C.
One more info: when I compare with another environment (the lab - no issue on it -.-), I can see the task aren't the same. Like… not at all: paste.openstack .org/show/ 719892/)
(http://
(undercloud) [LAB stack@undercloud tripleo- heat-templates] $ mistral task-list 984f680d- 80f5-4e0b- ba71-e8cbf65724 18 ------- ------- ------- ------- ----+-- ------- ------- ------- -----+- ------- ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- -+----- ----+-- ------- ---+--- ------- ------- ----+-- ------- ------- -----+ ------- ------- ------- ------- ----+-- ------- ------- ------- -----+- ------- ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- -+----- ----+-- ------- ---+--- ------- ------- ----+-- ------- ------- -----+ 67de-4b50- 8d1a-cfbaa3d74b 92 | ceph_base_ ansible_ workflow | tripleo. overcloud. workflow_ tasks.step2 | 984f680d- 80f5-4e0b- ba71-e8cbf65724 18 | SUCCESS | None | 2018-04-25 06:40:34 | 2018-04-25 07:00:29 | ------- ------- ------- ------- ----+-- ------- ------- ------- -----+- ------- ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- -+----- ----+-- ------- ---+--- ------- ------- ----+-- ------- ------- -----+
+------
| ID | Name | Workflow name | Execution ID | State | State info | Created at | Updated at |
+------
| ba20b5d8-
+------
I just don't really understand…