openstack overlay deploy failed in heat resource_type OS::Mistral::ExternalResource

Bug #1786434 reported by Diego Abelenda
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Medium
Cédric Jeanneret

Bug Description

After trying to scale up the overcloud I have a failed state.

The heat state is as follows:

openstack software deployment list | grep -v COMPLETE
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
| id | config_id | server_id | action | status |
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+
+--------------------------------------+--------------------------------------+--------------------------------------+--------+----------+

openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID | Stack Name | Project | Stack Status | Creation Time | Updated Time |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| 0880522e-5edc-4c30-b992-1cffb953732b | overcloud | 553ad7544f5a4479b4eb346dc7a76a82 | UPDATE_FAILED | 2018-03-29T08:17:10Z | 2018-08-03T06:37:22Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+

openstack stack list --nested | grep -vi complete
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------------+----------------------+----------------------+--------------------------------------+
| ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | Parent |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------------+----------------------+----------------------+--------------------------------------+
| 549a1e3f-c906-4ef3-b03b-76611a42f353 | overcloud-AllNodesDeploySteps-kzw3ig62ikjl | 553ad7544f5a4479b4eb346dc7a76a82 | UPDATE_FAILED | 2018-03-29T08:53:05Z | 2018-08-03T06:49:39Z | 0880522e-5edc-4c30-b992-1cffb953732b |
| 0880522e-5edc-4c30-b992-1cffb953732b | overcloud | 553ad7544f5a4479b4eb346dc7a76a82 | UPDATE_FAILED | 2018-03-29T08:17:10Z | 2018-08-03T06:37:22Z | None |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------------+----------------------+----------------------+--------------------------------------+

openstack stack resource list 549a1e3f-c906-4ef3-b03b-76611a42f353 | grep -vi complete
+---------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-----------------+----------------------+
| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+---------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-----------------+----------------------+
| WorkflowTasks_Step2_Execution | b3998930-efda-467c-bda0-058f841d4cdc | OS::Mistral::ExternalResource | CREATE_FAILED | 2018-08-03T06:50:20Z |
+---------------------------------+---------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+-----------------+----------------------+

openstack stack resource show 549a1e3f-c906-4ef3-b03b-76611a42f353 WorkflowTasks_Step2_Execution
+------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| attributes | {u'output': u'{}'} |
| creation_time | 2018-08-03T06:50:20Z |
| description | |
| links | [{u'href': u'http://10.27.100.1:8004/v1/553ad7544f5a4479b4eb346dc7a76a82/stacks/overcloud-AllNodesDeploySteps-kzw3ig62ikjl/549a1e3f-c906-4ef3-b03b-76611a42f353/resources/WorkflowTasks_Step2_Execution', u'rel': u'self'}, {u'href': u'http://10.27.100.1:8004/v1/553ad7544f5a4479b4eb346dc7a76a82/stacks/overcloud-AllNodesDeploySteps-kzw3ig62ikjl/549a1e3f-c906-4ef3-b03b-76611a42f353', u'rel': u'stack'}] |
| logical_resource_id | WorkflowTasks_Step2_Execution |
| parent_resource | AllNodesDeploySteps |
| physical_resource_id | b3998930-efda-467c-bda0-058f841d4cdc |
| required_by | [u'ControllerDeployment_Step2', u'BlockStorageDeployment_Step2', u'ObjectStorageDeployment_Step2', u'CephStorageDeployment_Step2', u'ComputeDeployment_Step2'] |
| resource_name | WorkflowTasks_Step2_Execution |
| resource_status | CREATE_FAILED |
| resource_status_reason | resources.WorkflowTasks_Step2_Execution: ERROR |
| resource_type | OS::Mistral::ExternalResource |
| updated_time | 2018-08-03T06:50:20Z |
+------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I can provide more logs, commands outputs etc, but tell me since I am blocked right now and no documentation about how to troubleshoot tripleo tells me more than what I have here.

Changed in tripleo:
importance: Undecided → Medium
milestone: none → stein-1
status: New → Triaged
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Hello Diego,

Care to check mistral logs on the undercloud, as well as the ceph-ansible-install logs? Apparently the workflow failed for some reason...

Cheers,

C.

Changed in tripleo:
assignee: nobody → Cédric Jeanneret (cjeanner)
Revision history for this message
Diego Abelenda (aaj6xu7ugcbx75sq) wrote :

Ok so in Mistral log on undercloud I found an error for ansible-playbook execution:
2018-08-03 08:59:57.306 1718 INFO workflow_trace [req-97fa3f83-a4f9-41c6-8825-dc45daf75b3e f3cef0caace94cf0af3d7c21b3706583 553ad7544f5a4479b4eb346dc7a76a82 - default default] Task 'ceph_install' (c1c42d62-d561-49af-8879-8162f02b4d1e) [RUNNING -> ERROR, msg=Failed to run action [action_ex_id=e8ca8b24-d404-4144-aa30-729abc519a46, action_cls='<class 'mistral.actions.action_factory.AnsiblePlaybookAction'>', attributes='{}', params='{u'remote_user': u'tripleo-admin', u'inventory':

(I skip the veeeerryyy long unreadable sequence of chars of mistral output)

failed: [10.27.100.25 -> 10.27.100.25] (item=10.27.100.25) => {"changed": true, "cmd": ["/tmp/restart_mds_daemon.sh"], "delta": "0:01:16.656719", "end": "2018-08-03 06:59:56.455285", "item": "10.27.100.25", "msg": "non-zero return code", "rc": 1, "start": "2018-08-03 06:58:39.798566", "stderr": "Error response from daemon: No such container: ceph-mds-lab-controller-1", "stderr_lines": ["Error response from daemon: No such container: ceph-mds-lab-controller-1"], "stdout": "Socket file /var/run/ceph/ceph-mds.lab-controller-1.asok could not be found, which means the Metadata Server is not running.", "stdout_lines": ["Socket file /var/run/ceph/ceph-mds.lab-controller-1.asok could not be found, which means the Metadata Server is not running."]}\nskipping: [10.27.100.25] => (item=10.27.100.11) => {"changed": false, "item": "10.27.100.11", "skip_reason": "Conditional result was False"}\n\nRUNNING HANDLER [ceph-defaults : set _mds_handler_called after restart] ********

On the controller pointed out (10.27.100.25) I have:
# docker ps -a | grep ceph-mds-lab-controller
a54baaac47ef docker.io/ceph/daemon:v3.0.1-stable-3.0-jewel-centos-7-x86_64 "/entrypoint.sh" 7 days ago Up 7 days

To make things clear the container was started ~20 minutes AFTER the error:
"StartedAt": "2018-08-03T07:18:00.530216222Z"

Revision history for this message
Diego Abelenda (aaj6xu7ugcbx75sq) wrote :

I can see this action a few times in the ansible-playbook log, before the error

TASK [ceph-defaults : check for a mds container] *******************************\nok: [10.27.100.25] => {"changed": false, "cmd": ["docker", "ps", "-q", "--filter=name=ceph-mds-lab-controller-1"], "delta": "0:00:00.057503", "end": "2018-08-03 06:56:43.948391", "failed_when_result": false, "rc": 0, "start": "2018-08-03 06:56:43.890888", "stderr": "", "stderr_lines": [], "stdout": "01502777aa39", "stdout_lines": ["01502777aa39"]}

Note that the ID changed so the container was deleted and started by something else.

Revision history for this message
Diego Abelenda (aaj6xu7ugcbx75sq) wrote :
Download full text (4.6 KiB)

After extracting the output and format it correctly, I can see this:

TASK [ceph-defaults : check for a mds container] *******************************
ok: [10.27.100.21] => {"changed": false, "cmd": ["docker", "ps", "-q", "--filter=name=ceph-mds-lab-controller-2"], "delta": "0:00:00.047481", "end": "2018-08-03 06:53:29.466869", "failed_when_result": false, "rc": 0, "start": "2018-08-03 06:53:29.419388", "stderr": "", "stderr_lines": [], "stdout": "88301871d4ca", "stdout_lines": ["88301871d4ca"]}

TASK [ceph-docker-common : inspect ceph mds container] *************************
ok: [10.27.100.21] => {"changed": false, "cmd": ["docker", "inspect", "88301871d4ca"], "delta": "0:00:00.037797", "end": "2018-08-03 06:53:46.278203", "rc": 0, "start": "2018-08-03 06:53:46.240406", "stderr": "", "stderr_lines": [], "stdout": "[\
    {\
        \\"Id\\": \\"88301871d4ca267cea34edf54650670141665447dec9912a1802ff029f4bcb07\\",\
[...]
        \\"Image\\": \\"sha256:a57b618c839bf5d988fad360fd5dc8dbaac45e86dd521a4a0bd893eee9f48ca5\\",\
[...]
        \\"Name\\": \\"/ceph-mds-lab-controller-2\\",\
[...]

TASK [ceph-docker-common : inspecting ceph mds container image before pulling] ***
ok: [10.27.100.21] => {"changed": false, "cmd": ["docker", "inspect", "sha256:a57b618c839bf5d988fad360fd5dc8dbaac45e86dd521a4a0bd893eee9f48ca5"], "delta": "0:00:00.047199", "end": "2018-08-03 06:53:50.358497", "failed_when_result": false, "rc": 0, "start": "2018-08-03 06:53:50.311298", "stderr": "", "stderr_lines": [], "stdout": "[\
    {\
        \\"Id\\": \\"sha256:a57b618c839bf5d988fad360fd5dc8dbaac45e86dd521a4a0bd893eee9f48ca5\\",\
        \\"RepoTags\\": [\
           \\"docker.io/ceph/daemon:v3.0.1-stable-3.0-jewel-centos-7-x86_64\\"\
[...]

TASK [ceph-docker-common : set_fact ceph_mds_image_repodigest_before_pulling] ***
ok: [10.27.100.21] => {"ansible_facts": {"ceph_mds_image_repodigest_before_pulling": "sha256:23c0fdc4f571f1a8f1ff8c9ae3bfb6c8d1e073afd03f46d447e272674f790d88"}, "changed": false}

TASK [ceph-docker-common : create bootstrap directories] ***********************
changed: [10.27.100.21] => (item=/var/lib/ceph/bootstrap-mds) => {"changed": true, "gid": 64045, "group": "64045", "item": "/var/lib/ceph/bootstrap-mds", "mode": "0755", "owner": "64045", "path": "/var/lib/ceph/bootstrap-mds", "secontext": "system_u:object_r:svirt_sandbox_file_t:s0", "size": 26, "state": "directory", "uid": 64045}

RUNNING HANDLER [ceph-defaults : set _mds_handler_called before restart] *******
ok: [10.27.100.21] => {"ansible_facts": {"_mds_handler_called": true}, "changed": false}

RUNNING HANDLER [ceph-defaults : copy mds restart script] **********************
changed: [10.27.100.21] => {"changed": true, "checksum": "0cd890eafd8a6ff85eba5513f6de327600d6cdad", "dest": "/tmp/restart_mds_daemon.sh", "gid": 0, "group": "root", "md5sum": "808257142c67e59068780935fc608638", "mode": "0750", "owner": "root", "secontext": "unconfined_u:object_r:user_home_t:s0", "size": 585, "src": "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1533279306.77-211642201726585/source", "state": "file", "uid": 0}

RUNNING HANDLER [ceph-defaults : restart ceph mds daemon(s) - container] *******
changed: [10.27...

Read more...

Revision history for this message
Diego Abelenda (aaj6xu7ugcbx75sq) wrote :

Hello,

Since this is a test/lab tripleo using virtual machines that I can snapshot and revert I can retry as many times as I want in any order. With as many commands to check things/changes before obtaining the error as we want.
I have a prod deployment, and if I cannot do things on the test/lab I don't want to touch the prod. I need to add hosts to the production cluster and since I cannot do it on the test/lab I don't want to break the production deployment.

In the current state, I get this error for any action that touches updates the stack. So Removing a node, Adding a node, etc.

Plus I have a single virtual node dead, I need to remove it and change it with a new one (the QCOW2 image is completely corrupt no idea how this is possible). And I cannot do it due to this problem...
Most likely the node being unreachable makes updating the overcloud impossible too, even without counting this issue.

I can give you as many additional info as you need. I reproduced it a few hours back again, after a rollback of the VMs.

Revision history for this message
Diego Abelenda (aaj6xu7ugcbx75sq) wrote :

As we discussed on #tripleo IRC channel, it really was a timeout issue for ceph-ansible container restart waiting. 5 times 10 seconds was not enough as it took 63 seconds on my setup for the MDS to create the socket.

So this is fixed now by overriding the timeout values.

Changed in tripleo:
status: Triaged → Invalid
Revision history for this message
John Fulton (jfulton-org) wrote :

Here's more info on overriding those variables for a similar update:

 https://bugzilla.redhat.com/show_bug.cgi?id=1620699#c5

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.