Task with join runs more than once

Bug #1535295 reported by Limor Stotland
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mistral
Invalid
High
Renat Akhmerov

Bug Description

Steps for reproduce:
1. create workflow :
p_d:
  tasks:
    list_servers:
      action: std.echo output='1'
      on-success: [start_heal_servers_workflow]
    list_stack_servers:
      action: std.echo output='2'
      on-success: [start_heal_servers_workflow]
    start_heal_servers_workflow:
      action: std.echo output='3'
      join: all
  type: direct
version: '2.0'
2. rum wf from 1
3. rum task-list

result:
+--------------------------------------+-----------------------------+---------------+--------------------------------------+---------+------------+
| ID | Name | Workflow name | Execution ID | State | State info |
+--------------------------------------+-----------------------------+---------------+--------------------------------------+---------+------------+
| a7980263-15e9-4298-b8b5-6044eb2c10ae | list_stack_servers | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
| c917e57e-a2bb-4cde-a9ac-6e323ab7c92b | list_servers | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
| 3f775a58-7bb4-465f-bea5-23f368301bc7 | start_heal_servers_workflow | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
| e1cb2afe-845d-49e3-ab93-6368b6a620cc | start_heal_servers_workflow | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
+--------------------------------------+-----------------------------+---------------+--------------------------------------+---------+------------+

expected result:
task start_heal_servers_workflow scheduler run only once (because of the join)

Revision history for this message
Lingxian Kong (kong) wrote :

Limor,

I have tested your workflow, it worked as expected in my deployment, are you testing it in HA mode?

Changed in mistral:
status: New → Incomplete
Revision history for this message
Moshe Elisha (melisha) wrote :

Lingxian,

Yes - we are running in HA (three of everything).

I verified what you said that by default on DevStack it is not reproduced but I was able to reproduce after I started another engine and executor on my DevStack with the following command:

/usr/bin/python /usr/local/bin/mistral-server --server engine,executor --config-file /etc/mistral/mistral.conf

Moshe Elisha (melisha)
Changed in mistral:
status: Incomplete → New
Revision history for this message
Dawid Deja (dawid-deja-0) wrote :

I have just hit it on my Devstack without HA - just one engine and executor

Revision history for this message
Guy Paz (guy-paz) wrote :

Thanks Dawid Deja.

Is it happened all the time for you ?
are you using the same WF I attached ?

Revision history for this message
Dawid Deja (dawid-deja-0) wrote :

Guy Paz,

It is happening all the time when using attached WF. I think it may be related to this bug: https://bugs.launchpad.net/mistral/+bug/1535722

I have done some debbuging and when we have to tasks that ends at the same time, we have two parallel threads running code from default_engine.py. In the first thread, calling this[1] line result in two cmds (len(cmds) equals 2), when calling it in second thread, cmds is empty list.

[1] https://github.com/openstack/mistral/blob/master/mistral/engine/default_engine.py#L182

Changed in mistral:
assignee: nobody → Dawid Deja (dawid-deja-0)
Revision history for this message
Dawid Deja (dawid-deja-0) wrote :

Unfortunately, I can't work on this bug, but I found the root cause, so someone else can fix it.

Scenario:
Let's assume that we have two tasks, A and B, both run task C on success. Task C has 'join' property set to 'all'

How it was intended to work:
Task A ends. While in 'on_action_complete' function, it check in workflow which tasks are ended[1]. It only finds task A, so it looks which tasks are intended to run after A. It finds C, but since join is not satisfied, it sets C to 'Paused' state. Then task B ends, transition is going same, the only difference is that task C is already in DB, so it only change it state to 'RUNNING'. Task C is run once

How it works in case of race conditions (same as in bug https://bugs.launchpad.net/mistral/+bug/1535722):
Task A and B ends in nearly same time. Since setting task state to 'SUCCESS' is one transaction and checking which tasks should be run then is in another transaction, here's what happens:
1. Task A checks it state to SUCCESS
2. Task B checks it state to SUCCESS
3. Task A looks in workflow which tasks have ended[1]. It finds both A and B. In a for loop[2], it checks which tasks are to be run after tasks A and B, so it finds C to times. Since join is satisfied, it fires up C two times.

Proposed solution:
1) Join two transactions into one (not easy) or
2) After getting list of tasks that are in 'SUCCESS' state, check if some of them want to fire up same task and if so, prevent it somehow

[1] https://github.com/openstack/mistral/blob/master/mistral/workflow/direct_workflow.py#L77
[2] https://github.com/openstack/mistral/blob/master/mistral/workflow/direct_workflow.py#L82

Changed in mistral:
assignee: Dawid Deja (dawid-deja-0) → nobody
Changed in mistral:
importance: Undecided → High
milestone: none → newton-1
assignee: nobody → Renat Akhmerov (rakhmerov)
Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Dawid, I started investigating this bug and read your comment. It seems like you don't take locks into account. "Nearly at the same time" shouldn't matter because we have locking mechanism. I recently fixed another similar bug (https://review.openstack.org/#/c/299751/) which was caused by the fact that we had two transactions instead of one in on_action_complete() in default_engine. So that the whole transactional logic was broken. I guess this bug could be related with that too. I'm now trying to reproduce it in my environment.

Please tell me I'm missing something on locking.

Thanks

Changed in mistral:
status: New → In Progress
Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

I wasn't able to reproduce it so far. I tried both HA and non-HA mode.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

I stopped working on it for now because I wasn't able to reproduce it so far. It may have been fixed as a side effect of fixing https://bugs.launchpad.net/mistral/+bug/1524477.

If we stumble on it again, we can get back to it.

Changed in mistral:
status: In Progress → Incomplete
Revision history for this message
Dawid Deja (dawid-deja-0) wrote :

Renat,

I wasn't able to reproduce this too.
In your fix, you have merged two transaction into one in 'on_action_complete' and I believe that the fact that transaction was splitted into two parts produced this bug. So my guess is this one can be resolved.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote : Re: [Bug 1535295] Re: Task with join runs more than once

Ok, great.

Renat

> 6 апр. 2016 г., в 17:23, Dawid Deja <email address hidden> написал(а):
>
> Renat,
>
> I wasn't able to reproduce this too.
> In your fix, you have merged two transaction into one in 'on_action_complete' and I believe that the fact that transaction was splitted into two parts produced this bug. So my guess is this one can be resolved.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1535295
>
> Title:
> Task with join runs more than once
>
> Status in Mistral:
> Incomplete
>
> Bug description:
> Steps for reproduce:
> 1. create workflow :
> p_d:
> tasks:
> list_servers:
> action: std.echo output='1'
> on-success: [start_heal_servers_workflow]
> list_stack_servers:
> action: std.echo output='2'
> on-success: [start_heal_servers_workflow]
> start_heal_servers_workflow:
> action: std.echo output='3'
> join: all
> type: direct
> version: '2.0'
> 2. rum wf from 1
> 3. rum task-list
>
> result:
> +--------------------------------------+-----------------------------+---------------+--------------------------------------+---------+------------+
> | ID | Name | Workflow name | Execution ID | State | State info |
> +--------------------------------------+-----------------------------+---------------+--------------------------------------+---------+------------+
> | a7980263-15e9-4298-b8b5-6044eb2c10ae | list_stack_servers | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
> | c917e57e-a2bb-4cde-a9ac-6e323ab7c92b | list_servers | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
> | 3f775a58-7bb4-465f-bea5-23f368301bc7 | start_heal_servers_workflow | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
> | e1cb2afe-845d-49e3-ab93-6368b6a620cc | start_heal_servers_workflow | p_d | 9616f763-21b0-4b81-88cb-a8f3c779f7ba | SUCCESS | None |
> +--------------------------------------+-----------------------------+---------------+--------------------------------------+---------+------------+
>
> expected result:
> task start_heal_servers_workflow scheduler run only once (because of the join)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mistral/+bug/1535295/+subscriptions

Revision history for this message
Dougal Matthews (d0ugal) wrote :

Marking this as invalid to clear out the incomplete bugs. The last two comments around 2 years ago said this couldn't be reproduced. Happy to reopen it if there is more info provided.

Changed in mistral:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.