[Pike Containers] Stalled ceph_install step leads to several issues

Bug #1766784 reported by Cédric Jeanneret deactivated
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Low
Cédric Jeanneret deactivated

Bug Description

Dear Stackers,

I have a timeout issue while doing an openstack overcloud deploy <options> on a containerized pike.

START with options: ['stack', 'failures', 'list', 'overcloud']
command: stack failures list -> heatclient.osc.v1.stack_failures.ListStackFailures (auth=True)
Using auth plugin: password
overcloud.AllNodesDeploySteps:
  resource_type: OS::TripleO::PostDeploySteps
  physical_resource_id: 754393c3-a37f-47d0-9389-db8ff99c4bd2
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
END return value: 0
Heat Stack update failed.
Heat Stack update failed.
END return value: 1

If I check mistral "engine" log, I can see two tasks are "delayed":
(http://paste.openstack.org/show/719890/)

2018-04-25 06:48:46.223 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Scheduler captured 2 delayed calls. _capture_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:193
2018-04-25 06:48:46.224 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Preparing next delayed call. [ID=5da7a0f3-7710-4658-9a40-25c24a65d13f, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_complete, method_arguments={u'wf_ex_id': u'a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a'}] _prepare_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:217
2018-04-25 06:48:46.227 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Preparing next delayed call. [ID=da0f2dfe-560d-4cd7-96a7-4f11fc12e742, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_complete, method_arguments={u'wf_ex_id': u'18ca00db-840b-47ae-8462-27b79578b3b7'}] _prepare_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:217

Those tasks are still present, even when the deploy has failed.

If I check the tasks linked to one of the wf_ex_id, we can see a stalled "install_ceph" task:
(http://paste.openstack.org/show/719889/)

(undercloud) [stack@undercloud tripleo-heat-templates]$ mistral task-list a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a
+--------------------------------------+--------------------------+---------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ID | Name | Workflow name | Execution ID | State | State info | Created at | Updated at |
+--------------------------------------+--------------------------+---------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| 731d9314-c32d-4dce-a245-b2e71454a921 | collect_puppet_hieradata | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:22 | 2018-04-24 17:11:24 |
| 6adec1a4-4d9d-41b0-a595-d97468eaf6d5 | check_hieradata | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:24 | 2018-04-24 17:11:25 |
| 09b92a53-51e9-4c65-a976-8dc880236867 | merge_ip_lists | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:25 | 2018-04-24 17:11:26 |
| 520ac3d0-ee01-476b-8ea3-4a75454be7ec | set_ip_lists | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:25 | 2018-04-24 17:11:25 |
| f998bdc2-90a4-4fc6-8535-c526d3962132 | set_blacklisted_ips | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:25 | 2018-04-24 17:11:25 |
| e090146d-52ce-4040-a177-84ed8c6c3679 | enable_ssh_admin | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:11:26 | 2018-04-24 17:12:45 |
| 6ba3f614-f957-4e7a-9a75-077e953a690b | get_private_key | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:45 | 2018-04-24 17:12:46 |
| 164b70fc-3bc6-410e-aed5-fb8993b2777e | make_fetch_directory | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:46 | 2018-04-24 17:12:46 |
| 71e66c77-c119-4c1a-8238-7165349df2b3 | collect_nodes_uuid | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:46 | 2018-04-24 17:12:55 |
| bd43d2e9-557e-4cf2-90fc-e1715273dbfe | parse_node_data_lookup | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:55 | 2018-04-24 17:12:56 |
| cd8093d5-d557-4d14-83ff-3fdd46ac906c | set_ip_uuids | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:55 | 2018-04-24 17:12:55 |
| 26205d7e-4369-4da8-a17d-4fa47e20c636 | map_node_data_lookup | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:56 | 2018-04-24 17:12:56 |
| 36effdb7-aa88-417f-ba38-3899513f54fa | set_role_vars | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:56 | 2018-04-24 17:12:56 |
| 6a1c8361-e1c1-4009-a8ef-97c51bb49591 | build_extra_vars | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | SUCCESS | None | 2018-04-24 17:12:56 | 2018-04-24 17:12:57 |
| ff70b0d9-ca05-440f-b820-916f8e962d00 | ceph_install | tripleo.storage.v1.ceph-install | a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a | RUNNING | None | 2018-04-24 17:12:57 | <none> |
+--------------------------------------+--------------------------+---------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+

Any way to get the stack back to a working state? Even if I delete the two "delayed tasks", it won't help with the recovery.

If I dig a bit further:
(http://paste.openstack.org/show/719891/)

(undercloud) [stack@undercloud tripleo-heat-templates]$ mistral execution-list | grep -i ceph
| bafc9652-877f-4ea9-a218-441962eaa469 | 9ebefed3-70cd-4125-a203-83aadbdb1de7 | tripleo.storage.v1.ceph-install | sub-workflow execution | b6a2f16b-e1a6-4aea-ae40-de37a0354185 | RUNNING | None | 2018-04-21 15:02:17 | 2018-04-21 15:02:17 |

I don't know if I can safely delete some stuff in mistral and re-start a deploy on a cleaned "workflow" thing (i.e. will the deploy recreate them, or are there some links that will be badly broken?)

Thank you for your support.

Cheers,

C.

Revision history for this message
Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) wrote :

One more info: when I compare with another environment (the lab - no issue on it -.-), I can see the task aren't the same. Like… not at all:
(http://paste.openstack.org/show/719892/)

(undercloud) [LAB stack@undercloud tripleo-heat-templates]$ mistral task-list 984f680d-80f5-4e0b-ba71-e8cbf6572418
+--------------------------------------+----------------------------+----------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ID | Name | Workflow name | Execution ID | State | State info | Created at | Updated at |
+--------------------------------------+----------------------------+----------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ba20b5d8-67de-4b50-8d1a-cfbaa3d74b92 | ceph_base_ansible_workflow | tripleo.overcloud.workflow_tasks.step2 | 984f680d-80f5-4e0b-ba71-e8cbf6572418 | SUCCESS | None | 2018-04-25 06:40:34 | 2018-04-25 07:00:29 |
+--------------------------------------+----------------------------+----------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+

I just don't really understand…

Revision history for this message
Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) wrote :

After many struggles (thank you, gfidente!) the issue is spotted:

in order to understand and see what was going on, I set the debug level for ceph-ansible to 3 (yep, huge one):

parameter_defaults:
  CephAnsiblePlaybookVerbosity: 3

Apparently, the output is so huge it prevented mistral to get the proper output/state of the running task, and it stayed "RUNNING", even when the stack update crashed due to a timeout.

In order to work around this issue, I had to:
- set the verbosity to a lower level (1 is good enough)
- cleanup all the mistral running tasks/executions
- restart the deploy

Would be good to check why mistral isn't able to get the execution state properly if we need a real debug :).

Cheers,

C.

Changed in tripleo:
status: New → Triaged
importance: Undecided → Medium
milestone: none → rocky-2
importance: Medium → Low
Changed in tripleo:
status: Triaged → Fix Released
assignee: nobody → Cédric Jeanneret (cjeanneret-c2c)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.