Mistral does not cancel the hierarchical workflow execution - occasional

Bug #1779071 reported by Istvan Imre
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
High
Unassigned

Bug Description

After the correction of bug (https://bugs.launchpad.net/mistral/+bug/1731418) sometimes it happens that sub-workflow cancel not successful.

And the parent task workflow trace looks very strange:
Jun 26 14:42:14 [workflow_trace] Workflow 'main_workflow' [IDLE -> RUNNING, msg=None] (execution_id=ac4c9c53-6a6f-4153-b194-9d023c1876c3)
Jun 26 14:42:18 [workflow_trace] Workflow 'main_workflow' [RUNNING -> CANCELLED, msg=None] (execution_id=ac4c9c53-6a6f-4153-b194-9d023c1876c3)
Jun 26 14:42:19 [workflow_trace] Workflow 'sub_workflow' [IDLE -> RUNNING, msg=None] (execution_id=2a2632d9-b62c-4f0a-b6e3-1657e09303d2)
Jun 26 14:43:12 [workflow_trace] Workflow 'main_workflow' [RUNNING -> SUCCESS, msg=None] (execution_id=ac4c9c53-6a6f-4153-b194-9d023c1876c3)

The stack trace:

Exception during message handling
Traceback (most recent call last):
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
    res = self.dispatcher.dispatch(message)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/engine_server.py", line 257, in stop_workflow
    return self.engine.stop_workflow(wf_ex_id, state, message)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/db/utils.py", line 85, in decorate
    return retry.wraps(_with_auth_context)(auth_ctx, func, *args, **kw)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/tenacity/__init__.py", line 214, in wrapped_f
    return self.call(f, *args, **kw)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/tenacity/__init__.py", line 295, in call
    start_time=start_time)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/tenacity/__init__.py", line 252, in iter
    return fut.result()
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/concurrent/futures/_base.py", line 455, in result
    return self.__get_result()
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/tenacity/__init__.py", line 298, in call
    result = fn(*args, **kwargs)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/db/utils.py", line 45, in _with_auth_context
    return func(*args, **kw)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/action_queue.py", line 93, in decorate
    res = func(*args, **kw)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/default_engine.py", line 187, in stop_workflow
    wf_handler.stop_workflow(wf_ex, state, message)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/workflow_handler.py", line 80, in stop_workflow
    stop_workflow(sub_wf_ex, state, msg=msg)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/workflow_handler.py", line 69, in stop_workflow
    wf.stop(state, msg)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/workflows.py", line 124, in stop
    self._cancel_workflow(msg)
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/workflows.py", line 459, in _cancel_workflow
    self._send_result_to_parent_workflow()
  File "/opt/nokia/cbam-venv/lib/python2.7/site-packages/mistral/engine/workflows.py", line 480, in _send_result_to_parent_workflow
    "Method _send_result_to_parent_workflow() must never be called"
RuntimeError: Method _send_result_to_parent_workflow() must never be called if a workflow is not in SUCCESS, ERROR or CANCELLED state.

Revision history for this message
Dougal Matthews (d0ugal) wrote :

Can you provide the main and sub workflow YAML you used to reproduce this? It would be useful to help resolving it.

Changed in mistral:
status: New → Incomplete
importance: Undecided → High
milestone: none → rocky-3
Revision history for this message
Istvan Imre (istvan.imre) wrote :

Generally it is hard to reproduce in live system, due to it caused by some race condition.

But I'm able to reproduce similar issue in unit test with very small workflows: (It is not too clean, but I had not better idea to reproduce the symptom)

mistral.tests.unit.engine.test_set_state.TestSetState: (slightly copy of previous test in same test class)

    def test_cancel_set_state(self):
        wf1_text = """
        version: '2.0'
        wf1:
          tasks:
            task1:
              workflow: wf2
        """

        wf2_text = """
        version: '2.0'
        wf2:
          tasks:
            task1:
              action: std.noop
              wait-before: 5
        """

        wf_service.create_workflows(wf1_text)
        wf_service.create_workflows(wf2_text)

        wf_ex = self.engine.start_workflow('wf1')

        def wait_for_subworkflow_start():
            try:
                with db_api.transaction():
                    db_api.get_workflow_execution(wf_ex.id).task_executions[0].executions[0]
                    return True
            except Exception:
                return False

        self._await(
            predicate=wait_for_subworkflow_start,
            delay=0.1,
            timeout=20,
            fail_message="Sub workflow not started in time."
        )
        with db_api.transaction():
             sub_wf_ex = db_api.get_workflow_execution(wf_ex.id).task_executions[0].executions[0]

        self.await_workflow_success(wf_ex.id)

        # The state in db is SUCCESS, but wf_ex still contains outdated info.
        self.assertEqual(states.RUNNING, sub_wf_ex.state)

        wf = workflows.Workflow(sub_wf_ex)

        # Trying to change the status of succeed execution. There is no error,
        # only warning message that state has been changed in db.
        wf.stop(states.CANCELLED)

        with db_api.transaction():
            wf_ex = db_api.get_workflow_execution(sub_wf_ex.id)

        self.assertEqual(states.SUCCESS, wf_ex.state)

Dougal Matthews (d0ugal)
Changed in mistral:
status: Incomplete → Confirmed
Dougal Matthews (d0ugal)
Changed in mistral:
milestone: rocky-3 → stein-1
Revision history for this message
Andras Kovi (akovi) wrote :

The reason for the error seems to be if that mistral/engine/workflows.py:345 is simply returning without updating the wf_ex object if the update failed. As the update happened in a different session, the local object is not affected by the change. The later notification therefore fails.

Dougal Matthews (d0ugal)
Changed in mistral:
milestone: stein-1 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers