introspect workflow doesn't report failures to zaqar properly

Bug #1716671 reported by Honza Pokorny
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Dougal Matthews

Bug Description

When running tripleo.baremetal.v1.introspect encounters an error on a node, the error isn't reported correctly to Zaqar. An error message from the ".introspect" workflow never arrives; instead, we get a message from the internal sub-workflow, "._introspect". At first, I thought I had to wait for any retries as the documentation says. It says that it takes up to 20mins. But the parent workflow finishes very quickly after the sub-workflow does.

$ mistral execution-list -c 'Workflow name' -c Description -c 'Updated at'

+-----------------------------------------+------------------------+---------------------+
| Workflow name | Description | Updated at |
+-----------------------------------------+------------------------+---------------------+
| tripleo.baremetal.v1.introspect | | 2017-09-12 11:44:15 |
| tripleo.baremetal.v1._introspect | sub-workflow execution | 2017-09-12 11:44:12 |
+-----------------------------------------+------------------------+---------------------+

The nature of the error is irrelevant in this case.

This bug prevents the GUI from properly displaying an error message.

Honza Pokorny (hpokorny)
Changed in tripleo:
milestone: none → queens-1
tags: added: pike-backport-potential
Revision history for this message
Honza Pokorny (hpokorny) wrote :

Running introspection on the same node via the CLI:

$ openstack overcloud node introspect 9248b98c-67a3-4861-9489-23b977097ba3

Produces an error on the sub-workflow and then waits a very long time. I'm not sure why that is. There is nothing to retry since the child didn't time out.

Waiting for introspection to finish...
Started Mistral Workflow tripleo.baremetal.v1.introspect. Execution ID: 174ef8f5-cb66-47df-9ac1-10e3dba074c5
Waiting for messages on queue '4c78980a-e60c-48c1-8055-877bb271cacd' with no timeout.
Failed to run action [action_ex_id=07d339a3-ba84-43aa-bf5c-261454f9b1cf, action_cls='<class 'mistral.actions.action_factory.BaremetalIntrospectionAction'>', attributes='{u'client_method_name': u'introspect'}', params='{u'uuid': u'9248b98c-67a3-4861-9489-23b977097ba3'}']
 BaremetalIntrospectionAction.introspect failed: Invalid provision state for introspection: "available", valid states are "['enroll', 'manageable', 'inspect failed', 'inspecting']"

Revision history for this message
Honza Pokorny (hpokorny) wrote :

Just to be clear: I know that I can't introspect an available node. That is not the issue here. I'm working on error handling for the GUI so I'm purposely doing something wrong.

The issue is that those error messages aren't sent to Zaqar properly.

Revision history for this message
Dougal Matthews (d0ugal) wrote :

The error you are seeing from the CLI is coming via a zaqar message. This should be available to the GUI also. I agree the formatting of it isn't good - but that is essentially how Mistral formats errors - they are not really intended to be for "end users". We should look into improving that in pike.

Is there a regression here? How was the error formatted previously? What does the GUI need?

Revision history for this message
Honza Pokorny (hpokorny) wrote :

The GUI kicks off the execution, and then it waits for a websocket message to learn about the result of the execution. It receives the following message (with some fields omitted for brevity):

{
    "body": {
        "payload": {
            "execution": { ... },
            "introspected_node": null,
            "message": "Failed to run action ...",
            "status": "FAILED"
        },
        "type": "tripleo.baremetal.v1._introspect"
    },
    "queue_name": "tripleo",
    "ttl": 3600
}

The failure is correctly included in the message (ie "Failed to run..."). The issue is the "type" field. It's saying that some internal workflow that the GUI doesn't care about failed. This error should bubble up to the parent and be reported there. The GUI has a switch statement on the "type" field.

Note the underscore in the "type" field.

Honestly, I don't know what it was doing before, sorry :(

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/503600

Changed in tripleo:
assignee: nobody → Dougal Matthews (d0ugal)
status: Triaged → In Progress
Revision history for this message
Dougal Matthews (d0ugal) wrote :

I think we can make an exception for this workflow as it is a special case (as it is internal/reserved).

However, it is worth noting that if you are only looking for messages with the type of the workflow you initially executed that you may be missing messages in other cases. Workflows hard-code the type. This means that if workflow A calls workflow B and it then sends messages they will have the type "B".

You can see the logic used in tripleoclient for handling this here: https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/workflows/base.py#L48-L76

(Remember that tripleoclient uses a unique zaqar queue for each execution, so things are a bit different and it might not directly apply)

Revision history for this message
Honza Pokorny (hpokorny) wrote :

In preparing the temporary fix for the GUI, I found the place where it's expecting an error message from the tripleo.baremetal.v1.introspect workflow. In case it fails, it then looks at the "message" value of the result, and expects it to be an array. It filters out objects without the "message" key, and then pulls out the "message" values into a string.

message.filter(m => m.message).map(m => m.message)

If I were to introduce a temporary fix, it would also require an extra if-statement to check if the "message" in the execution result is an array.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ui (master)

Fix proposed to branch: master
Review: https://review.openstack.org/504035

Changed in tripleo:
assignee: Dougal Matthews (d0ugal) → Honza Pokorny (hpokorny)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Dougal Matthews (<email address hidden>) on branch: master
Review: https://review.openstack.org/503600
Reason: Ditching this for now in favour of patching the GUI.

Revision history for this message
Honza Pokorny (hpokorny) wrote :

The issue that remains is that the parent workflow doesn't notify zaqar when finished. The execution can be seen in "mistral execution-list" as failed almost immediately but the message doesn't arrive. The UI depends on it.

Revision history for this message
Honza Pokorny (hpokorny) wrote :

I think this might be related (from /var/log/mistral/engine.log):

2017-09-14 11:10:04.635 24192 INFO workflow_trace [-] Workflow 'tripleo.baremetal.v1.introspect' [RUNNING -> ERROR, msg=Failed to handle action completion [error=Can not evaluate YAQL expression [expression='Retrying {0} nodes that failed introspection. Attempt {1} of {2} '.format($.failed_introspection.len(), $.introspection_attempt, $.max_retry_attempts + 1), error=u'failed_introspection', data={}], wf=tripleo.baremetal.v1.introspect, task=retry_failed_nodes, action=std.noop]:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/mistral/engine/task_handler.py", line 109, in _on_action_complete
    task.on_action_complete(action_ex)
  File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 297, in on_action_complete
    self.complete(state, state_info)
  File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/mistral/engine/tasks.py", line 172, in complete
    data_flow.publish_variables(self.task_ex, self.task_spec)
  File "/usr/lib/python2.7/site-packages/mistral/workflow/data_flow.py", line 210, in publish_variables
    task_ex.published = expr.evaluate_recursively(branch_vars, expr_ctx)
  File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 100, in evaluate_recursively
    data[key] = _evaluate_item(data[key], context)
  File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 79, in _evaluate_item
    return evaluate(item, context)
  File "/usr/lib/python2.7/site-packages/mistral/expressions/__init__.py", line 71, in evaluate
    return evaluator.evaluate(expression, context)
  File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 100, in evaluate
    cls).evaluate(trim_expr, data_context)
  File "/usr/lib/python2.7/site-packages/mistral/expressions/yaql_expression.py", line 54, in evaluate
    ", data=%s]" % (expression, str(e), data_context)
YaqlEvaluationException: Can not evaluate YAQL expression [expression='Retrying {0} nodes that failed introspection. Attempt {1} of {2} '.format($.failed_introspection.len(), $.introspection_attempt, $.max_retry_attempts + 1), error=u'failed_introspection', data={}]
] (execution_id=cbccc196-6326-4ffa-9832-7054cfe1afee)

Revision history for this message
Dougal Matthews (d0ugal) wrote :

I tested this on the CLI... I do get a message from the parent workflow when it is finished.

(undercloud) [stack@925-heap--undercloud ~]$ openstack overcloud node introspect c2f1534d-3b85-4e91-9b5b-a116a03805e7
Waiting for introspection to finish...
Started Mistral Workflow tripleo.baremetal.v1.introspect. Execution ID: 1437b39a-99e4-47c2-b862-6ef42a9344c4
Waiting for messages on queue '3f47ab58-0f78-4d24-8965-c08a981b23af' with no timeout.
Failed to run action [action_ex_id=c61adbbc-8bd3-49d3-83d9-6be67ee7bc47, action_cls='<class 'mistral.actions.action_factory.BaremetalIntrospectionAction'>', attributes='{u'client_method_name': u'introspect'}', params='{u'uuid': u'c2f1534d-3b85-4e91-9b5b-a116a03805e7'}']
 BaremetalIntrospectionAction.introspect failed: Invalid provision state for introspection: "active", valid states are "['enroll', 'manageable', 'inspect failed', 'inspecting']"
Successfully introspected nodes.

That last line comes from here in the workflow: https://github.com/openstack/tripleo-common/blob/master/workbooks/baremetal.yaml#L597

(Okay, yes, there is an oddity here. "Successfully" doesn't make much sense given it failed. I was trying to introspect an active node, which fails - the workflow then asks for the introspection status and because it was previously introspected successfully it states that it was successful.)

Revision history for this message
Dougal Matthews (d0ugal) wrote :

I ran "DELETE FROM nodes;" in the inspector database to reset the inspector history.

Okay, so the problem arises if you introspect a node and it fails but there is no previous introspection status. I am now debugging this.

Revision history for this message
Dougal Matthews (d0ugal) wrote :

The problem is that the workflow assumes that start_introspection will succeed. If it fails, then the output from _introspect ( here: https://github.com/openstack/tripleo-common/blob/master/workbooks/baremetal.yaml#L431 ) isn't what we then expect later, and thus it fails.

Revision history for this message
Dougal Matthews (d0ugal) wrote :

Actually, in a mild change of events it turns out I was wrong. If the _introspect workflow completes without errors the output is actually empty.

(undercloud) [stack@925-heap--undercloud ~]$ mistral execution-get-output e598626e-9cd3-4279-b5d3-e327715b1906
{
    "result": null
}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/504386

Changed in tripleo:
assignee: Honza Pokorny (hpokorny) → Dougal Matthews (d0ugal)
Revision history for this message
Dougal Matthews (d0ugal) wrote :

Okay, so I have figured out the issue. If start introspection fails, then inspector never adds the node to it's database. Then the call to baremetal_introspection.get_status fails. The on-error for the get_introspection_status task is incorrect in this case and doesn't correctly handle this failure - it goes on to retry the node, but fails as it doesn't have the expected information about it.

Since we want the workflow to continue when some nodes start introspection correctly, we probably want to try and filter out these problematic nodes before the get_introspection status.

This might require adding a new output to the sub-workflow, perhaps something like "started_introspection" as a boolean to let us know if that node got far enough. Then we can send a message with something like "X nodes failed to start introspection"

Revision history for this message
Dougal Matthews (d0ugal) wrote :

The result of the get_introspection_status looks like this when one node fails to start and one succeeds.

[
    "Failed to run action [action_ex_id=aaf10ff1-75b8-45f8-92c2-e169325682fa, action_cls='<class 'mistral.actions.action_factory.BaremetalIntrospectionAction'>', attributes='{u'client_method_name': u'get_status'}', params='{u'uuid': u'c2f1534d-3b85-4e91-9b5b-a116a03805e7'}']\n BaremetalIntrospectionAction.get_status failed: Could not find node c2f1534d-3b85-4e91-9b5b-a116a03805e7 in cache",
    {
        "uuid": "4ef82524-d722-4f74-95d9-229740732315",
        "links": [
            {
                "href": "http://192.168.24.1:5050/v1/introspection/4ef82524-d722-4f74-95d9-229740732315",
                "rel": "self"
            }
        ],
        "finished_at": "2017-09-15T15:35:53",
        "state": "finished",
        "finished": true,
        "error": null,
        "started_at": "2017-09-15T15:33:18"
    }
]

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/504907

Changed in tripleo:
assignee: Dougal Matthews (d0ugal) → Honza Pokorny (hpokorny)
Honza Pokorny (hpokorny)
Changed in tripleo:
assignee: Honza Pokorny (hpokorny) → Dougal Matthews (d0ugal)
Changed in tripleo:
assignee: Dougal Matthews (d0ugal) → Honza Pokorny (hpokorny)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ui (master)

Reviewed: https://review.openstack.org/504035
Committed: https://git.openstack.org/cgit/openstack/tripleo-ui/commit/?id=8990d23cf5fc879ec8963f86d526a2f4da66a0be
Submitter: Jenkins
Branch: master

commit 8990d23cf5fc879ec8963f86d526a2f4da66a0be
Author: Honza Pokorny <email address hidden>
Date: Thu Sep 14 08:15:52 2017 -0300

    Listen for messages from internal introspect workflow

    This way we can update the status of each introspected node as its state
    changes, instead of waiting until they all finish.

    Change-Id: If8c2e3819d1a3226109f3f6a312c25c9d0972331
    Partial-Bug: 1716671

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/504386
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=064ae01d31440d458b40678aae6685aeffc5e127
Submitter: Jenkins
Branch: master

commit 064ae01d31440d458b40678aae6685aeffc5e127
Author: Dougal Matthews <email address hidden>
Date: Fri Sep 15 11:04:15 2017 +0100

    Always finish by sending a Zaqar message on error

    Add a task default to send a Zaqar message on an unhandled error. This
    is useful in this workflow due to the complexity and as we have seen
    some errors that cause the workflow to fail but the CLI and GUI never
    recieve notice of this.

    Partial-Bug: #1716671
    Change-Id: I84a0fb8d039e9f1b6b124ee003c5cfb98b7f30db

Changed in tripleo:
assignee: Honza Pokorny (hpokorny) → Dougal Matthews (d0ugal)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/504907
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=2ba36561cbe7d71988d827258a90d55861c7f3e6
Submitter: Jenkins
Branch: master

commit 2ba36561cbe7d71988d827258a90d55861c7f3e6
Author: Dougal Matthews <email address hidden>
Date: Mon Sep 18 11:29:08 2017 +0100

    Handle nodes that failed to start introspection

    This can happen if the node has an incorrect provisioning state and
    hasn't previously been introspected.

    Change-Id: I49fb5d1cffa514b048d1be1e51e60245192f5e45
    Closes-Bug: #1716671

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.0.0

This issue was fixed in the openstack/tripleo-common 8.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-ui (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/510862

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-ui (stable/pike)

Reviewed: https://review.openstack.org/510862
Committed: https://git.openstack.org/cgit/openstack/tripleo-ui/commit/?id=9f3cfcf42142857e8922f625b503d114a729ae53
Submitter: Zuul
Branch: stable/pike

commit 9f3cfcf42142857e8922f625b503d114a729ae53
Author: Honza Pokorny <email address hidden>
Date: Thu Sep 14 08:15:52 2017 -0300

    Listen for messages from internal introspect workflow

    This way we can update the status of each introspected node as its state
    changes, instead of waiting until they all finish.

    Change-Id: If8c2e3819d1a3226109f3f6a312c25c9d0972331
    Partial-Bug: 1716671
    (cherry picked from commit 8990d23cf5fc879ec8963f86d526a2f4da66a0be)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/521933

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/pike)

Reviewed: https://review.openstack.org/521933
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=98f8097a8cc7eb373d7c41327457663f87d11eac
Submitter: Zuul
Branch: stable/pike

commit 98f8097a8cc7eb373d7c41327457663f87d11eac
Author: Dougal Matthews <email address hidden>
Date: Fri Sep 15 11:04:15 2017 +0100

    Always finish by sending a Zaqar message on error

    Add a task default to send a Zaqar message on an unhandled error. This
    is useful in this workflow due to the complexity and as we have seen
    some errors that cause the workflow to fail but the CLI and GUI never
    recieve notice of this.

    Partial-Bug: #1716671
    Change-Id: I84a0fb8d039e9f1b6b124ee003c5cfb98b7f30db
    (cherry picked from commit 064ae01d31440d458b40678aae6685aeffc5e127)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.