Join wait forever on branch error

Bug #1472790 reported by Winson Chan
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mistral
Fix Released
High
Winson Chan
Liberty
Won't Fix
High
Winson Chan
Mitaka
Fix Released
High
Winson Chan

Bug Description

vagrant@arkham:~/tmp/mistral$ cat join_wait_bug.yaml
version: '2.0'
name: test-join-wait-bug
workflows:
  main:
    type: direct
    tasks:
        task1:
            action: std.noop
            on-success:
                - task3
        task2:
            action: std.fail
            on-success:
                - task3
        task3:
            action: std.noop
            join: all

vagrant@arkham:~$ mistral workbook-create ~/tmp/mistral/join_wait_bug.yaml
Starting new HTTP connection (1): localhost
+------------+----------------------------+
| Field | Value |
+------------+----------------------------+
| Name | test-join-wait-bug |
| Tags | <none> |
| Created at | 2015-07-08 21:18:27.685937 |
| Updated at | None |
+------------+----------------------------+

vagrant@arkham:~$ mistral execution-create test-join-wait-bug.main
Starting new HTTP connection (1): localhost
+------------+--------------------------------------+
| Field | Value |
+------------+--------------------------------------+
| ID | 93cb4143-5625-49cd-a480-358381fca6a9 |
| Workflow | test-join-wait-bug.main |
| State | RUNNING |
| Created at | 2015-07-08 21:18:48.798204 |
| Updated at | 2015-07-08 21:18:49.060013 |
+------------+--------------------------------------+

vagrant@arkham:~$ mistral execution-list
Starting new HTTP connection (1): localhost
+--------------------------------------+-------------------------+---------+----------------------------+----------------------------+
| ID | Workflow | State | Created at | Updated at |
+--------------------------------------+-------------------------+---------+----------------------------+----------------------------+
| 93cb4143-5625-49cd-a480-358381fca6a9 | test-join-wait-bug.main | RUNNING | 2015-07-08 21:18:48.798204 | 2015-07-08 21:18:49.468916 |
+--------------------------------------+-------------------------+---------+----------------------------+----------------------------+

vagrant@arkham:~$ mistral task-list
Starting new HTTP connection (1): localhost
+--------------------------------------+-------+-------------------------+--------------------------------------+---------+
| ID | Name | Workflow name | Execution ID | State |
+--------------------------------------+-------+-------------------------+--------------------------------------+---------+
| 189cde93-2c0f-4226-b8da-39ddf86795c5 | task1 | test-join-wait-bug.main | 93cb4143-5625-49cd-a480-358381fca6a9 | SUCCESS |
| 324f2bb7-7230-48c3-9927-d8cf2c7f9f67 | task2 | test-join-wait-bug.main | 93cb4143-5625-49cd-a480-358381fca6a9 | ERROR |
| 0f653cd7-858d-42bf-b985-674132f7a9e1 | task3 | test-join-wait-bug.main | 93cb4143-5625-49cd-a480-358381fca6a9 | WAITING |
+--------------------------------------+-------+-------------------------+--------------------------------------+---------+

vagrant@arkham:~$ mistral execution-list
Starting new HTTP connection (1): localhost
+--------------------------------------+-------------------------+---------+----------------------------+----------------------------+
| ID | Workflow | State | Created at | Updated at |
+--------------------------------------+-------------------------+---------+----------------------------+----------------------------+
| 93cb4143-5625-49cd-a480-358381fca6a9 | test-join-wait-bug.main | RUNNING | 2015-07-08 21:18:48.798204 | 2015-07-08 21:18:49.468916 |
+--------------------------------------+-------------------------+---------+----------------------------+----------------------------+

Changed in mistral:
assignee: nobody → Winson Chan (winson-c-chan)
Changed in mistral:
importance: Undecided → Critical
Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Winson,

Actually by this behavior is intentional, it is valid by the current design. Task just hangs if some of the inbound tasks didn't successfully complete for some reason or didn't complete at all (workflow went to a different route).

What do you think should be the right behavior?

Note that we also have a BP https://blueprints.launchpad.net/mistral/+spec/mistral-weak-join that is to some extend related to this.

Revision history for this message
Winson Chan (winson-c-chan) wrote :

For the case on join:all, if all branches completed but some in error, the WF should fail instead of hanging.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Why?

When do you thing it should be hanging forever?

Revision history for this message
Winson Chan (winson-c-chan) wrote :

I not saying it should hang. I'm saying the WF should FAIL, which is not the actual case. If the join statement is waiting for all the branches and say all the branches executed but one has an error in one of the steps, it will never reach the join task and the join task will just be WAITING forever right now. I think the correct behavior should be to fail the workflow.

Also, I think it's hard to fix this bug per email. Currently the WF controller only knows about inbound and next commands for any task. So take for example, if one of the branch has many steps and one of the step fail upstream or if there are multi-level joins and a branch in an upstream join fails, there's no way currently to determine if the WF has fail, at least not without doing some recursive read on the database to identify upstream task executions leading to its path. Even in this example, there's no record of an inbound task for that failed branch into the join task that can be used to recursively query upstream to identify the failed step.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

* >> I think the correct behavior should be to fail the workflow.

Why do you think it should fail? Any practical reason? You may be right but please explain on examples.

* >> if one of the branch has many steps and one of the step fail upstream or if there are multi-level joins and a branch in an upstream join fails, there's no way currently to determine if the WF has fail, at least not without doing some recursive read on the database to identify upstream task executions leading to its path.

I completely disagree on a problem statement itself. 1) Why does it matter what happened upstream before direct predecessors? We have a notion of error handler (on-error) which says "if a task has errored and it doesn't have an error handler then the whole workflow fails". So if an error was handled somewhere upstream then downstream tasks are valid to start at some point. They just need to comply with required conditions (right state of wf context, completed inbound tasks for join etc.).
What you're describing leads to a completely different execution paradigm which we've never considered. Word "recursive" here is just scary. Anyway, I don't see why we need all this. Examples please.

* >> there's no record of an inbound task for that failed branch into the join task that can be used to recursively query upstream to identify the failed step.

Incorrect. Look at direct_workflow.py module. It has all methods we need to check that. E.g. _find_inbound_task_specs() can find all inbound tasks and using _triggers_join we can check if it triggers a particular join task. Also, once you found inbound tasks it's easy to see if it failed or not and do whatever we need.

Again, let's not mix things up. Let's first come to an agreement about what we should do with this. Sorry, but I'm still not sure this is a bug. Maybe it's rather a feature that give us a mechanism of blocking workflow branches on occurring errors?

Please start providing technical details, examples etc.

Revision history for this message
Winson Chan (winson-c-chan) wrote :

I provided an example above in the bug description. It's a very simple and practical WF. Please tell me what the user should expect? Continue to wait for the workflow to complete?

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Hm.. I actually tend to agree with you at this point, but this is all rather intuitive thinking as well as your argument "what the user should expect?". Only user knows what he/she expects, and every user can expect something different. it's not enough to make a decision. The example you provided is made up, not practical at all to me. In general, I'd suggest we avoiding to refer what an imagined user can expect in favor of mathematical rigorousness. I'm just willing to find a strong reasoning telling that you're right.

One example that could show the practical value of failing "join" task in this case that comes to my mind is something like:

build_report:
  tasks:
    get_data_from_storage1:
      ...
      on-success:
        - build_report

    get_data_from_storage2:
      ...
      on-success:
        - build_report

    get_data_from_storage3:
      ...
      on-success:
        - build_report

    build_report:
      join: all

Here it's probably meaningless to have "build_report" task hanging forever if we couldn't get data from one of the storages. On the other hand, the workflow will fail anyway because of a failing task so "build_report" task becomes irrelevant at all. Same will happen if "join" task is in just one of the multiple branches of the workflow. Whenever a task fails and it doesn't have a handler the entire workflow fails, right? If it does have a handler (on-error) then for downstream tasks it looks as if never bad happened. In your example the entire workflow will fail as well.

What am I missing here?

What else I mean is: if we change it the way you're asking then will we make some usecases available now impossible? See what I mean? Sort of a use case might be: in a complex workflow I want to block a branch on a task failure (don't confuse with failing the whole workflow).

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Winson,

I have to take my words back since I just realized I did not fully understand this bug. Really sorry, looks like I need to read bug descriptions more carefully.

So, my confusion was in difference between "failing workflow" and "failing join tasks". I was arguing because I thought that workflow in this case just fails but "join" task doesn't (hangs in WAITING state). However, I read everything again and finally understood that the workflow doesn't get failed. So no doubts then, it's a bug and it needs to be fixed.

Many apologies for the time we had to waste because of my sloppiness.

Revision history for this message
Limor Stotland (limor-bortman) wrote :

I also think that this is the expected behavior....
because task 3 is in "WAITING" State it's make sense that the WF will be "RUNNING"...
 Winson, To implement what you want we need to implement a timeout mechanism for tasks and i don't think its the right way to go... how much time will you give for task?

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

The thing is that we have a rule for direct workflows regardless "join" tasks: if a task has failed and its error hasn't been processed by "on-error" clause then the entire workflow fails. So I guess it's a stupid bug and "join" just influenced workflow completion algorigthm again (one bug has already been fixed by Lingxian).

Revision history for this message
Winson Chan (winson-c-chan) wrote :

Renat, Now that we come to an agreement that this is a bug. Can you advise on how to fix this? I own this right now and can do the work but I want to get your insight.

Revision history for this message
Lingxian Kong (kong) wrote :

Hi, Renat and Winson, I'm so sorry I didn't consider all the logical possibilities about 'join' scenario.

Renat, thanks for your explanation about design principle of 'workflow failure'.

For the bug, there is a workaround I can suggest:

* when a task completes (successfully or erroneously), we get all the tasks depends on it.
* if there is only one task with 'join', whose condition is not satisfied, we remove the 'WAITING' task record.
* then, we leave it to the whole workflow process, which will set workflow status to 'ERROR'.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Winson, can we close this bug? Your patch has been merged in.

Revision history for this message
Winson Chan (winson-c-chan) wrote :

We can't close this yet. I haven't finish bug fix for this.

no longer affects: mistral/liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (master)

Fix proposed to branch: master
Review: https://review.openstack.org/254525

Changed in mistral:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (master)

Reviewed: https://review.openstack.org/254525
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=9edd86dba5a8d05a869c99f51d259242cfbfc884
Submitter: Jenkins
Branch: master

commit 9edd86dba5a8d05a869c99f51d259242cfbfc884
Author: Winson Chan <email address hidden>
Date: Tue Dec 8 03:19:59 2015 +0000

    Fix join on branch error

    If a workflow has multiple branches that join on a task and there is
    one or more unhandled errors in upstream tasks at these branches, the
    workflow execution state will be RUNNING indefinitely. In this case,
    the workflow execution will never reach completion and the engine
    should fail the workflow.

    Change-Id: I237ed0ac481f946de1626e8d7936cfb7bf9081d6
    Closes-Bug: #1472790

Changed in mistral:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/mistral 2.0.0.0b2

This issue was fixed in the openstack/mistral 2.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.