YAQLEvaluationException in workflow error message for create_admin

Bug #1734747 reported by Julie Pichon
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

I'm currently trying to debug an issue, where the stack failures list amounts to "ERROR":

(undercloud) [stack@undercloud-0 ~]$ openstack stack failures list --long overcloud
overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::Mistral::ExternalResource
  physical_resource_id: 3a7e9a89-4283-4e8a-842e-2cde7b8b3e23
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR

It seems to be caused by something going wrong while running the workflows at https://github.com/openstack/tripleo-common/blob/master/workbooks/access.yaml , however due to an error in how the errors or messages are parsed in the workflow, it's difficult to find the root of the problems. There are many pages of YAQL evaluation errors, I'm pasting a few that seem relevant and ideally should be fixed to ease debugging in the future:

  ceph_base_ansible_workflow [task_ex_id=e3e09097-c89d-4b7d-aef5-5169f7f09c57] -> Failure caused by error in tasks: enable_ssh_admin

  enable_ssh_admin [task_ex_id=fe7e29f1-26ca-4c53-bc35-e72feb19ea58] -> Failure caused by error in tasks: create_admin_via_nova

  create_admin_via_nova [task_ex_id=d8ae0d9c-58f3-4f4e-9fa3-bf3189a61a76] -> Failure caused by error in tasks: create_admin

  create_admin [task_ex_id=22bf7ba4-c11f-48cb-a110-48548576da06] -> One or more actions had failed.
    [wf_ex_id=009ce7eb-0e88-4f08-800a-0716dcaa17ac, idx=0]: Failed to run task [error=Can not evaluate YAQL expression [expression=task(deploy_config).result.deploy_stderr, error=Unknown function "#property#depl
oy_stderr", data={}], wf=tripleo.deployment.v1.deploy_on_server, task=send_message]:

YaqlEvaluationException: Can not evaluate YAQL expression [expression=task(deploy_config).result.deploy_stderr, error=Unknown function "#property#deploy_stderr", data={}]

- - -

2017-11-27 13:30:34.431 1352 INFO workflow_trace [req-9ad76c03-46ba-455c-b02a-0536bce603d9 2c93e7a436fe4ef8ae0b91c6ec5e921c 48958151e805416f939c5579ae578435 - default default] Workflow 'tripleo.access.v1.create_admin_via_nova' [RUNNING -> ERROR, msg=Failure caused by error in tasks: create_admin
...
  create_admin_via_nova [task_ex_id=d8ae0d9c-58f3-4f4e-9fa3-bf3189a61a76] -> Failure caused by error in tasks: create_admin

  create_admin [task_ex_id=22bf7ba4-c11f-48cb-a110-48548576da06] -> One or more actions had failed.
    [wf_ex_id=009ce7eb-0e88-4f08-800a-0716dcaa17ac, idx=0]: Failed to run task [error=Can not evaluate YAQL expression [expression=task(deploy_config).result.deploy_stderr, error=Unknown function "#property#deploy_stderr", data={}], wf=tripleo.deployment.v1.deploy_on_server, task=send_message]:

  create_admin [task_ex_id=22bf7ba4-c11f-48cb-a110-48548576da06] -> One or more actions had failed.
    [wf_ex_id=009ce7eb-0e88-4f08-800a-0716dcaa17ac, idx=0]: Failed to run task [error=Can not evaluate YAQL expression [expression=task(deploy_config).result.deploy_stderr, error=Unknown function "#property#depl
oy_stderr", data={}], wf=tripleo.deployment.v1.deploy_on_server, task=send_message]:

[...]
  File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 119, in evaluate
    cls).evaluate(trim_expr, data_context)
  File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 73, in evaluate
    ", data=%s]" % (expression, str(e), data_context)
YaqlEvaluationException: Can not evaluate YAQL expression [expression=task(deploy_config).result.deploy_stderr, error=Unknown function "#property#deploy_stderr", data={}]

Some information about the deployment: I'm trying to deploy 3 controllers + 2 compute + 3 ceph nodes, on Pike. The following environments are enabled:
- path: overcloud-resource-registry-puppet.yaml
- path: environments/docker.yaml
- path: environments/docker-ha.yaml
- path: environments/containers-default-parameters.yaml
- path: environments/ceph-ansible/ceph-ansible.yaml

Revision history for this message
Dougal Matthews (d0ugal) wrote :

The only "deploy_stderr" I can see is here. The task name (deploy_config) also matches up with your error.

https://github.com/openstack/tripleo-common/blob/db340be059f69d22cfb9f4edb8f7d3a8663b571a/workbooks/deployment.yaml#L46

The action tripleo.deployment.config is probably failing. Unfortunately, the way this workflow has been written, the error handling is broken. If this action fails, send_message is called and it attempts to access the deploy_stdout and deploy_stderr properties. However, when actions fail the output is only a string.

You should be able to find the error by looking at the task output or the mistral executor log.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/523332

Revision history for this message
Dougal Matthews (d0ugal) wrote :

I created the above patch to hopefully make this issue easier to debug. The workflow should still send a message now if the action fails and hopefully it will contain enough information to explain what happened.

Revision history for this message
Giulio Fidente (gfidente) wrote :

FWIW, I have seen this happening when using TLS and the undercloud cert had not been places amongst those considered valid by the overcloud nodes. In that scenario os-collect-config was unable to return completion to heat, because the https tempurl to hit was "unreachable".

Revision history for this message
Julie Pichon (jpichon) wrote :

I think that was it, thank you for the pointer Giulio!! My undercloud was deployed with SSL and a self-signed cert. The error output is completely impossible to read at the moment and I don't think I would have guessed for another long while.

I believe I was able to get past this particular error in my overcloud deployment by enabling a new environment that includes the contents of /etc/pki/ca-trust/source/anchors/undercloud-cacert.pem:

parameter_defaults:
  CAMap:
    overcloud-ca:
      content: |
       -----BEGIN CERTIFICATE-----
       [...]
       -----END CERTIFICATE-----

I'm still getting only an "ERROR" in the stack failure at the moment, but there is more information in /var/log/mistral/ceph-install-workflow.log now to debug it, which is why I think I'm a bit further along.

Revision history for this message
Attila Darazs (adarazs) wrote :
Changed in tripleo:
importance: High → Critical
tags: added: ci promotion-blocker
Revision history for this message
Julie Pichon (jpichon) wrote :

Maybe we could add a Depends-On in the promotion job onto the proposed patches to see if it helps to surface the actual error? It seems like that sparse "WorkflowTasks_Step2_Execution: ERROR" can hide a lot of different problems.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/523372
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=5d4cadef9eff3e681e61b57c7bb4a0f975c52d1d
Submitter: Zuul
Branch: master

commit 5d4cadef9eff3e681e61b57c7bb4a0f975c52d1d
Author: Dougal Matthews <email address hidden>
Date: Tue Nov 28 10:05:13 2017 +0000

    Log the error from OrchestrationDeployAction

    If the action fails, it returns an error to be used in the workflow.
    While this is useful it means there is little trace of the error in the
    log files. Adding the error also to the logs will make it easier to
    debug and track down.

    Related-Bug: #1734747
    Change-Id: I5faba1feec7e87bb595edb3448a096caa0812a9b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/524537

Revision history for this message
Ronelle Landy (rlandy) wrote :

With regards to the promotion blocker. the master promotion succeeded after https://review.openstack.org/#/c/523945/ merged.

From a CI perspective, we would be ready to close out this bug, however, https://review.openstack.org/524537 has not yet merged. The unmerged change adds debugging - but is not critical to the actual issue being fixed.

Ronelle Landy (rlandy)
tags: removed: promotion-blocker
Changed in tripleo:
milestone: queens-2 → queens-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/523332
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=d9c8bce4605efc4d9ba16085a4e9f5121c501160
Submitter: Zuul
Branch: master

commit d9c8bce4605efc4d9ba16085a4e9f5121c501160
Author: Dougal Matthews <email address hidden>
Date: Tue Nov 28 08:14:54 2017 +0000

    Handle error in the deploy_on_server workflow

    Related-Bug: #1734747
    Change-Id: I0324644ce0e1b509b7c92c246572a3a8c6dff9b9

Changed in tripleo:
milestone: queens-3 → queens-rc1
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Closing this out as it seems to be addressed. If there are open issues, feel free to reopen.

Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/pike)

Reviewed: https://review.openstack.org/524537
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=e2eabaa2fdbc084de0ec3169560c9e579f190986
Submitter: Zuul
Branch: stable/pike

commit e2eabaa2fdbc084de0ec3169560c9e579f190986
Author: Dougal Matthews <email address hidden>
Date: Tue Nov 28 10:05:13 2017 +0000

    Log the error from OrchestrationDeployAction

    If the action fails, it returns an error to be used in the workflow.
    While this is useful it means there is little trace of the error in the
    log files. Adding the error also to the logs will make it easier to
    debug and track down.

    Related-Bug: #1734747
    Change-Id: I5faba1feec7e87bb595edb3448a096caa0812a9b
    (cherry picked from commit 5d4cadef9eff3e681e61b57c7bb4a0f975c52d1d)

tags: added: in-stable-pike
Revision history for this message
Julie Pichon (jpichon) wrote :

Closing seems fair, in my case it turned out to be user/configuration error, just incredibly difficult to debug because of the lack of logging. Hopefully that is also resolved now.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.