tripleo

openstack overlay deploy failed in heat resource_type OS::Mistral::ExternalResource

Bug #1786434 reported by Diego Abelenda on 2018-08-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Invalid	Medium	Cédric Jeanneret	tripleo stein-1

Bug Description

After trying to scale up the overcloud I have a failed state.

The heat state is as follows:

openstack software deployment list | grep -v COMPLETE
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
| id | config_id | server_id | action | status |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+

openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID | Stack Name | Project | Stack Status | Creation Time | Updated Time |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| 0880522e-5edc-4c30-b992-1cffb953732b | overcloud | 553ad7544f5a4479b4eb346dc7a76a82 | UPDATE_FAILED | 2018-03-29T08:17:10Z | 2018-08-03T06:37:22Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+

openstack stack list --nested | grep -vi complete
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------------+----------------------+----------------------+--------------------------------------+
| ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | Parent |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------------+----------------------+----------------------+--------------------------------------+
| 549a1e3f-c906-4ef3-b03b-76611a42f353 | overcloud-AllNodesDeploySteps-kzw3ig62ikjl | 553ad7544f5a4479b4eb346dc7a76a82 | UPDATE_FAILED | 2018-03-29T08:53:05Z | 2018-08-03T06:49:39Z | 0880522e-5edc-4c30-b992-1cffb953732b |
| 0880522e-5edc-4c30-b992-1cffb953732b | overcloud | 553ad7544f5a4479b4eb346dc7a76a82 | UPDATE_FAILED | 2018-03-29T08:17:10Z | 2018-08-03T06:37:22Z | None |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------------+----------------------+----------------------+--------------------------------------+

openstack stack resource list 549a1e3f-c906-4ef3-b03b-76611a42f353 | grep -vi complete
+---------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-----------------+----------------------+
| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+---------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-----------------+----------------------+
| WorkflowTasks_Step2_Execution | b3998930-efda-467c-bda0-058f841d4cdc | OS::Mistral::ExternalResource | CREATE_FAILED | 2018-08-03T06:50:20Z |
+---------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-----------------+----------------------+

openstack stack resource show 549a1e3f-c906-4ef3-b03b-76611a42f353 WorkflowTasks_Step2_Execution
+------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes | {u'output': u'{}'} |
| creation_time | 2018-08-03T06:50:20Z |
| description | |
| links | [{u'href': u'http://10.27.100.1:8004/v1/553ad7544f5a4479b4eb346dc7a76a82/stacks/overcloud-AllNodesDeploySteps-kzw3ig62ikjl/549a1e3f-c906-4ef3-b03b-76611a42f353/resources/WorkflowTasks_Step2_Execution', u'rel': u'self'}, {u'href': u'http://10.27.100.1:8004/v1/553ad7544f5a4479b4eb346dc7a76a82/stacks/overcloud-AllNodesDeploySteps-kzw3ig62ikjl/549a1e3f-c906-4ef3-b03b-76611a42f353', u'rel': u'stack'}] |
| logical_resource_id | WorkflowTasks_Step2_Execution |
| parent_resource | AllNodesDeploySteps |
| physical_resource_id | b3998930-efda-467c-bda0-058f841d4cdc |
| required_by | [u'ControllerDeployment_Step2', u'BlockStorageDeployment_Step2', u'ObjectStorageDeployment_Step2', u'CephStorageDeployment_Step2', u'ComputeDeployment_Step2'] |
| resource_name | WorkflowTasks_Step2_Execution |
| resource_status | CREATE_FAILED |
| resource_status_reason | resources.WorkflowTasks_Step2_Execution: ERROR |
| resource_type | OS::Mistral::ExternalResource |
| updated_time | 2018-08-03T06:50:20Z |
+------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I can provide more logs, commands outputs etc, but tell me since I am blocked right now and no documentation about how to troubleshoot tripleo tells me more than what I have here.

Cédric Jeanneret (cjeanner) on 2018-08-10

Changed in tripleo:
importance:	Undecided → Medium
milestone:	none → stein-1
status:	New → Triaged

Revision history for this message

Cédric Jeanneret (cjeanner) wrote on 2018-08-10:

Hello Diego,

Care to check mistral logs on the undercloud, as well as the ceph-ansible-install logs? Apparently the workflow failed for some reason...

Cheers,

Changed in tripleo:
assignee:	nobody → Cédric Jeanneret (cjeanner)

Revision history for this message

Diego Abelenda (aaj6xu7ugcbx75sq) wrote on 2018-08-10:

Ok so in Mistral log on undercloud I found an error for ansible-playbook execution:
2018-08-03 08:59:57.306 1718 INFO workflow_trace [req-97fa3f83-a4f9-41c6-8825-dc45daf75b3e f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Task 'ceph_install' (c1c42d62-d561-49af-8879-8162f02b4d1e) [RUNNING -> ERROR, msg=Failed to run action [action_ex_id=e8ca8b24-d404-4144-aa30-729abc519a46, action_cls='<class 'mistral.actions.action_factory.AnsiblePlaybookAction'>', attributes='{}', params='{u'remote_user': u'tripleo-admin', u'inventory':

(I skip the veeeerryyy long unreadable sequence of chars of mistral output)

failed: [10.27.100.25 -> 10.27.100.25] (item=10.27.100.25) => {"changed": true, "cmd": ["/tmp/restart_mds_daemon.sh"], "delta": "0:01:16.656719", "end": "2018-08-03 06:59:56.455285", "item": "10.27.100.25", "msg": "non-zero return code", "rc": 1, "start": "2018-08-03 06:58:39.798566", "stderr": "Error response from daemon: No such container: ceph-mds-lab-controller-1", "stderr_lines": ["Error response from daemon: No such container: ceph-mds-lab-controller-1"], "stdout": "Socket file /var/run/ceph/ceph-mds.lab-controller-1.asok could not be found, which means the Metadata Server is not running.", "stdout_lines": ["Socket file /var/run/ceph/ceph-mds.lab-controller-1.asok could not be found, which means the Metadata Server is not running."]}\nskipping: [10.27.100.25] => (item=10.27.100.11) => {"changed": false, "item": "10.27.100.11", "skip_reason": "Conditional result was False"}\n\nRUNNING HANDLER [ceph-defaults : set _mds_handler_called after restart] ********

On the controller pointed out (10.27.100.25) I have:
# docker ps -a | grep ceph-mds-lab-controller
a54baaac47ef docker.io/ceph/daemon:v3.0.1-stable-3.0-jewel-centos-7-x86_64 "/entrypoint.sh" 7 days ago Up 7 days

To make things clear the container was started ~20 minutes AFTER the error:
"StartedAt": "2018-08-03T07:18:00.530216222Z"

Revision history for this message

Diego Abelenda (aaj6xu7ugcbx75sq) wrote on 2018-08-10:

I can see this action a few times in the ansible-playbook log, before the error

TASK [ceph-defaults : check for a mds container] *******************************\nok: [10.27.100.25] => {"changed": false, "cmd": ["docker", "ps", "-q", "--filter=name=ceph-mds-lab-controller-1"], "delta": "0:00:00.057503", "end": "2018-08-03 06:56:43.948391", "failed_when_result": false, "rc": 0, "start": "2018-08-03 06:56:43.890888", "stderr": "", "stderr_lines": [], "stdout": "01502777aa39", "stdout_lines": ["01502777aa39"]}

Note that the ID changed so the container was deleted and started by something else.

Revision history for this message

Diego Abelenda (aaj6xu7ugcbx75sq) wrote on 2018-08-10:

Download full text (4.6 KiB)

After extracting the output and format it correctly, I can see this:

TASK [ceph-defaults : check for a mds container] *******************************
ok: [10.27.100.21] => {"changed": false, "cmd": ["docker", "ps", "-q", "--filter=name=ceph-mds-lab-controller-2"], "delta": "0:00:00.047481", "end": "2018-08-03 06:53:29.466869", "failed_when_result": false, "rc": 0, "start": "2018-08-03 06:53:29.419388", "stderr": "", "stderr_lines": [], "stdout": "88301871d4ca", "stdout_lines": ["88301871d4ca"]}

TASK [ceph-docker-common : inspect ceph mds container] *************************
ok: [10.27.100.21] => {"changed": false, "cmd": ["docker", "inspect", "88301871d4ca"], "delta": "0:00:00.037797", "end": "2018-08-03 06:53:46.278203", "rc": 0, "start": "2018-08-03 06:53:46.240406", "stderr": "", "stderr_lines": [], "stdout": "[\
    {\
        \\"Id\\": \\"88301871d4ca267cea34edf54650670141665447dec9912a1802ff029f4bcb07\\",\
[...]
        \\"Image\\": \\"sha256:a57b618c839bf5d988fad360fd5dc8dbaac45e86dd521a4a0bd893eee9f48ca5\\",\
[...]
        \\"Name\\": \\"/ceph-mds-lab-controller-2\\",\
[...]

TASK [ceph-docker-common : inspecting ceph mds container image before pulling] ***
ok: [10.27.100.21] => {"changed": false, "cmd": ["docker", "inspect", "sha256:a57b618c839bf5d988fad360fd5dc8dbaac45e86dd521a4a0bd893eee9f48ca5"], "delta": "0:00:00.047199", "end": "2018-08-03 06:53:50.358497", "failed_when_result": false, "rc": 0, "start": "2018-08-03 06:53:50.311298", "stderr": "", "stderr_lines": [], "stdout": "[\
    {\
        \\"Id\\": \\"sha256:a57b618c839bf5d988fad360fd5dc8dbaac45e86dd521a4a0bd893eee9f48ca5\\",\
        \\"RepoTags\\": [\
           \\"docker.io/ceph/daemon:v3.0.1-stable-3.0-jewel-centos-7-x86_64\\"\
[...]

TASK [ceph-docker-common : set_fact ceph_mds_image_repodigest_before_pulling] ***
ok: [10.27.100.21] => {"ansible_facts": {"ceph_mds_image_repodigest_before_pulling": "sha256:23c0fdc4f571f1a8f1ff8c9ae3bfb6c8d1e073afd03f46d447e272674f790d88"}, "changed": false}

TASK [ceph-docker-common : create bootstrap directories] ***********************
changed: [10.27.100.21] => (item=/var/lib/ceph/bootstrap-mds) => {"changed": true, "gid": 64045, "group": "64045", "item": "/var/lib/ceph/bootstrap-mds", "mode": "0755", "owner": "64045", "path": "/var/lib/ceph/bootstrap-mds", "secontext": "system_u:object_r:svirt_sandbox_file_t:s0", "size": 26, "state": "directory", "uid": 64045}

RUNNING HANDLER [ceph-defaults : set _mds_handler_called before restart] *******
ok: [10.27.100.21] => {"ansible_facts": {"_mds_handler_called": true}, "changed": false}

RUNNING HANDLER [ceph-defaults : copy mds restart script] **********************
changed: [10.27.100.21] => {"changed": true, "checksum": "0cd890eafd8a6ff85eba5513f6de327600d6cdad", "dest": "/tmp/restart_mds_daemon.sh", "gid": 0, "group": "root", "md5sum": "808257142c67e59068780935fc608638", "mode": "0750", "owner": "root", "secontext": "unconfined_u:object_r:user_home_t:s0", "size": 585, "src": "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1533279306.77-211642201726585/source", "state": "file", "uid": 0}

RUNNING HANDLER [ceph-defaults : restart ceph mds daemon(s) - container] *******
changed: [10.27...

After extracting the output and format it correctly, I can see this:

RUNNING HANDLER [ceph-defaults : set _mds_handler_called before restart] *******
ok: [10.27.100.21] => {"ansible_facts": {"_mds_handler_called": true}, "changed": false}

RUNNING HANDLER [ceph-defaults : restart ceph mds daemon(s) - container] *******
changed: [10.27.100.21 -> 10.27.100.21] => (item=10.27.100.21) => {"changed": true, "cmd": ["/tmp/restart_mds_daemon.sh"], "delta": "0:00:56.046729", "end": "2018-08-03 06:56:05.145951", "item": "10.27.100.21", "rc": 0, "start": "2018-08-03 06:55:09.099222", "stderr": "Error response from daemon: No such container: ceph-mds-lab-controller-2", "stderr_lines": ["Error response from daemon: No such container: ceph-mds-lab-controller-2"], "stdout": "", "stdout_lines": []}
skipping: [10.27.100.21] => (item=10.27.100.25)  => {"changed": false, "item": "10.27.100.25", "skip_reason": "Conditional result was False"}
skipping: [10.27.100.21] => (item=10.27.100.11)  => {"changed": false, "item": "10.27.100.11", "skip_reason": "Conditional result was False"}

HERE I do not understand what the script is checking because on the host I effectively not have the socket
[root@lab-controller-2 heat-admin]# ls /var/run/ceph/ceph-mds.lab-controller-2.asok
ls: cannot access /var/run/ceph/ceph-mds.lab-controller-2.asok: No such file or directory

but inside the container I do:
[root@lab-controller-2 heat-admin]# docker exec -it ceph-mds-lab-controller-2 ls /var/run/ceph/ceph-mds.lab-controller-2.asok
/var/run/ceph/ceph-mds.lab-controller-2.asok

RUNNING HANDLER [ceph-defaults : set _mds_handler_called after restart] ********
ok: [10.27.100.21] => {"ansible_facts": {"_mds_handler_called": false}, "changed": false}

Apparently it loops on that sequence of steps for all hosts multiple times and then fails.

Revision history for this message

Diego Abelenda (aaj6xu7ugcbx75sq) wrote on 2018-08-21:

Hello,

Since this is a test/lab tripleo using virtual machines that I can snapshot and revert I can retry as many times as I want in any order. With as many commands to check things/changes before obtaining the error as we want.
I have a prod deployment, and if I cannot do things on the test/lab I don't want to touch the prod. I need to add hosts to the production cluster and since I cannot do it on the test/lab I don't want to break the production deployment.

In the current state, I get this error for any action that touches updates the stack. So Removing a node, Adding a node, etc.

Plus I have a single virtual node dead, I need to remove it and change it with a new one (the QCOW2 image is completely corrupt no idea how this is possible). And I cannot do it due to this problem...
Most likely the node being unreachable makes updating the overcloud impossible too, even without counting this issue.

I can give you as many additional info as you need. I reproduced it a few hours back again, after a rollback of the VMs.

Revision history for this message

Diego Abelenda (aaj6xu7ugcbx75sq) wrote on 2018-08-28:

As we discussed on #tripleo IRC channel, it really was a timeout issue for ceph-ansible container restart waiting. 5 times 10 seconds was not enough as it took 63 seconds on my setup for the MDS to create the socket.

So this is fixed now by overriding the timeout values.

Cédric Jeanneret (cjeanner) on 2018-08-28