tripleo

[Pike Containers] Stalled ceph_install step leads to several issues

Bug #1766784 reported by Cédric Jeanneret deactivated on 2018-04-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Low	Cédric Jeanneret deactivated	tripleo rocky-2

Bug Description

Dear Stackers,

I have a timeout issue while doing an openstack overcloud deploy <options> on a containerized pike.

START with options: ['stack', 'failures', 'list', 'overcloud']
command: stack failures list -> heatclient.osc.v1.stack_failures.ListStackFailures (auth=True)
Using auth plugin: password
overcloud.AllNodesDeploySteps:
  resource_type: OS::TripleO::PostDeploySteps
  physical_resource_id: 754393c3-a37f-47d0-9389-db8ff99c4bd2
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
END return value: 0
Heat Stack update failed.
Heat Stack update failed.
END return value: 1

If I check mistral "engine" log, I can see two tasks are "delayed":
(http://paste.openstack.org/show/719890/)

2018-04-25 06:48:46.223 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Scheduler captured 2 delayed calls. _capture_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:193
2018-04-25 06:48:46.224 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Preparing next delayed call. [ID=5da7a0f3-7710-4658-9a40-25c24a65d13f, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_complete, method_arguments={u'wf_ex_id': u'a0a1ed4d-d0a0-46ef-84ab-17a820e9e42a'}] _prepare_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:217
2018-04-25 06:48:46.227 7526 DEBUG mistral.services.scheduler [req-444b8e1f-f6b5-458f-afae-1dc22dd2646c f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Preparing next delayed call. [ID=da0f2dfe-560d-4cd7-96a7-4f11fc12e742, factory_method_path=None, target_method_name=mistral.engine.workflow_handler._check_and_complete, method_arguments={u'wf_ex_id': u'18ca00db-840b-47ae-8462-27b79578b3b7'}] _prepare_calls /usr/lib/python2.7/site-packages/mistral/services/scheduler.py:217

Those tasks are still present, even when the deploy has failed.

If I check the tasks linked to one of the wf_ex_id, we can see a stalled "install_ceph" task:
(http://paste.openstack.org/show/719889/)

Any way to get the stack back to a working state? Even if I delete the two "delayed tasks", it won't help with the recovery.

If I dig a bit further:
(http://paste.openstack.org/show/719891/)

I don't know if I can safely delete some stuff in mistral and re-start a deploy on a cleaned "workflow" thing (i.e. will the deploy recreate them, or are there some links that will be badly broken?)

Thank you for your support.

Cheers,

Revision history for this message

Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) wrote on 2018-04-25:

One more info: when I compare with another environment (the lab - no issue on it -.-), I can see the task aren't the same. Like… not at all:
(http://paste.openstack.org/show/719892/)

(undercloud) [LAB stack@undercloud tripleo-heat-templates]$ mistral task-list 984f680d-80f5-4e0b-ba71-e8cbf6572418
+--------------------------------------+----------------------------+----------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ID | Name | Workflow name | Execution ID | State | State info | Created at | Updated at |
+--------------------------------------+----------------------------+----------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+
| ba20b5d8-67de-4b50-8d1a-cfbaa3d74b92 | ceph_base_ansible_workflow | tripleo.overcloud.workflow_tasks.step2 | 984f680d-80f5-4e0b-ba71-e8cbf6572418 | SUCCESS | None | 2018-04-25 06:40:34 | 2018-04-25 07:00:29 |
+--------------------------------------+----------------------------+----------------------------------------+--------------------------------------+---------+------------+---------------------+---------------------+

I just don't really understand…

Revision history for this message

Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) wrote on 2018-04-25:

After many struggles (thank you, gfidente!) the issue is spotted:

in order to understand and see what was going on, I set the debug level for ceph-ansible to 3 (yep, huge one):

parameter_defaults:
CephAnsiblePlaybookVerbosity: 3

Apparently, the output is so huge it prevented mistral to get the proper output/state of the running task, and it stayed "RUNNING", even when the stack update crashed due to a timeout.

In order to work around this issue, I had to:
- set the verbosity to a lower level (1 is good enough)
- cleanup all the mistral running tasks/executions
- restart the deploy

Would be good to check why mistral isn't able to get the execution state properly if we need a real debug :).

Cheers,

Emilien Macchi (emilienm) on 2018-04-25

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → rocky-2
importance:	Medium → Low

Cédric Jeanneret deactivated (cjeanneret-c2c-deactivated) on 2018-05-03

Changed in tripleo:
status:	Triaged → Fix Released
assignee:	nobody → Cédric Jeanneret (cjeanneret-c2c)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.