After-task logic runs multiple times if tasks run in parallel

Bug #1524477 reported by noa
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mistral
Fix Released
High
Renat Akhmerov

Bug Description

When executing a workflow with multiple tasks (let's mark them as task_a, task_b and task_c ). where some tasks run async actions, such as std.mistral_http (task_a), and contain "on-complete" sections . The tasks the should run later (in the "on-complete" section) (task_c) sometimes run more times then intended.

These tasks are ran from the actual completed task (task_a), but if a different task (task_b) is completed at the same time, it sometimes runs the task (task_c) again.

This seems to happen since both tasks (task_a and task_b) reach the method _on_task_state_change in default_engine.py, and when the direct_workflow.py _find_next_commands is called, both start to run task_c.

I reproduced this issue using this workflow:

---
version: '2.0'

test_tasks:
  tasks:
    task1:
      action: std.mistral_http url="http://google.com"
      on-complete:
        - task4
    task2:
      action: std.mistral_http url="http://google.com"
      on-complete:
        - task5
    task3:
      action: std.mistral_http url="http://google.com"
      on-complete:
        - task6
    task4:
      action: std.echo output='TASK4'
    task5:
      action: std.echo output='TASK5'
    task6:
      action: std.echo output='TASK6'

and running execution-update --state SUCCESS for task1, task2, and task3's action-executions concurrently.

this resulted in:
task 4 running twice: once when task1 finished, and once when task 5 finished.
task 5 running twice: once when task 2 finished and once when task 3 finished.
task 6 running twice: once when task 3 finished and once when task 1 finished.

Revision history for this message
noa (noa-koffman) wrote :

I'm attaching the logs from the bug reproduction.
I have two mistral-engine logs, since we run mistral in HA.
this is the output of the final task list of the execution:

# mistral task-list 54bfc592-40e9-4466-b115-87344b91ac91
+--------------------------------------+-------+---------------+--------------------------------------+---------+------------+
| ID | Name | Workflow name | Execution ID | State | State info |
+--------------------------------------+-------+---------------+--------------------------------------+---------+------------+
| 8dca3757-bd20-46f6-a1ca-06d4a5e3ddf7 | task1 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| 8286e9db-dba7-4d7e-8811-f9659c6821f2 | task2 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| 9dc2cf86-a824-4446-8063-24e7d40c173a | task3 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| 30ca8a3f-9ac7-46d3-a2af-6d986773b5af | task5 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| a72c14c9-b5de-4ae4-b087-1dee7e93b952 | task5 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| c78e2f2f-9c92-4238-840e-082f877fb565 | task6 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| b2b3cd2c-22f8-42d1-b504-8fe15e39de34 | task4 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| 9530ebc6-79af-4b09-8c64-bc00ddfe36e6 | task6 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
| 67d49e3f-afc1-44ff-b6de-bdb9ce5edcf4 | task4 | test_tasks | 54bfc592-40e9-4466-b115-87344b91ac91 | SUCCESS | None |
+--------------------------------------+-------+---------------+--------------------------------------+---------+------------+

Revision history for this message
noa (noa-koffman) wrote :
Revision history for this message
noa (noa-koffman) wrote :

Please note that I added two log lines in default_engine.py
In the _on_task_state_change method, before and after the line “cmds = wf_ctrl.continue_workflow()”
LOG.info("_on_task_state_change BEFORE NEXT TASK with task = %s", task_ex)
LOG.info("_on_task_state_change AFTER NEXT TASK with task = %s", task_ex)
In order to see which task called which.

noa (noa-koffman)
summary: - on-complete tasks are ran multiple times
+ on-complete tasks are ran multiple times on workflow resume
Revision history for this message
Moshe Elisha (melisha) wrote : Re: on-complete tasks are ran multiple times on workflow resume

Hi. Some more details about the flow regarding code flow:

1. Task1 in method on_action_complete – changes task1 state to RUNNING.
2. Task2 in method on_action_complete – changes task2 state to RUNNING (before task2 calls _on_task_state_change).
3. Task2 in “_on_task_state_change” > “continue_workflow” > “DirectWorkflowController ._find_next_commands” – it finds task1 because task1 is in SUCCESS and processed = False and “_find_next_commands_for_task(task1)” returns task4.
4. Task2 executes command to RunTask task4.
5. Task1 in “_on_task_state_change” > “continue_workflow” > “DirectWorkflowController ._find_next_commands” – it finds task1 because task1 is in SUCCESS and processed = False and “_find_next_commands_for_task(task1)” returns task4.
6. Task1 executes command to RunTask task4.

Revision history for this message
Moshe Elisha (melisha) wrote :

This bug can also be easily reproduced with the following workflow:

---
version: '2.0'

test_on_after_task_parallel_wf:
  tasks:
    task1:
      action: std.noop
      on-complete:
        - task1_a

    task2:
      action: std.noop

    task3:
      action: std.noop

    task1_a:
      action: std.noop

stack@melisha-devstack mistral(keystone_admin)]$ mistral task-list
+--------------------------------------+---------+--------------------------------+--------------------------------------+---------+------------+
| ID | Name | Workflow name | Execution ID | State | State info |
+--------------------------------------+---------+--------------------------------+--------------------------------------+---------+------------+
| 5992e46c-2310-46a2-b3f8-f3cf1dd4da14 | task1 | test_on_after_task_parallel_wf | 6f206afe-eca2-469e-ab2f-5b02998105bd | SUCCESS | None |
| ba19ed74-9e60-4278-a349-07b9b4afb49c | task3 | test_on_after_task_parallel_wf | 6f206afe-eca2-469e-ab2f-5b02998105bd | SUCCESS | None |
| d158b3ef-1f59-4fa3-bfea-7a5f3bd59a4b | task2 | test_on_after_task_parallel_wf | 6f206afe-eca2-469e-ab2f-5b02998105bd | SUCCESS | None |
| 96ac0ec5-eede-479d-8da4-7ad5a1123f5b | task1_a | test_on_after_task_parallel_wf | 6f206afe-eca2-469e-ab2f-5b02998105bd | SUCCESS | None |
| f20419a5-8ae9-4aae-b854-51b78d4e680b | task1_a | test_on_after_task_parallel_wf | 6f206afe-eca2-469e-ab2f-5b02998105bd | SUCCESS | None |
| c0a7c7ee-34e1-4040-aea2-72ab45786d07 | task1_a | test_on_after_task_parallel_wf | 6f206afe-eca2-469e-ab2f-5b02998105bd | RUNNING | None |
+--------------------------------------+---------+--------------------------------+--------------------------------------+---------+------------+

Moshe Elisha (melisha)
summary: - on-complete tasks are ran multiple times on workflow resume
+ After-task logic runs multiple times if tasks run in parallel
Changed in mistral:
importance: Undecided → High
Changed in mistral:
assignee: nobody → Renat Akhmerov (rakhmerov)
milestone: none → mitaka-2
Changed in mistral:
milestone: mitaka-2 → mitaka-3
Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

I see now where the problem is. It's basically a contention between two or more DB transactions that update state of the same workflow "simulteneously". This should have been solved by acquiring a lock on workflow execution object so that all changes are atomic. Seems like this simple locking mechanism doesn't work properly. I'll investigate why. Keep working on it..

Changed in mistral:
status: New → In Progress
Changed in mistral:
milestone: mitaka-3 → mitaka-rc1
Revision history for this message
Lingxian Kong (kong) wrote :

Hi, Renat,

What if we replace _find_next_commands() here https://github.com/openstack/mistral/blob/master/mistral/workflow/base.py#L85 with _find_next_commands_for_task(). I mean, when a task completes, why not we just look for subsequent tasks of it, rather than looking for subsequent tasks of all completed tasks?

Revision history for this message
Renat Akhmerov (rakhmerov) wrote : Re: [Bug 1524477] Re: After-task logic runs multiple times if tasks run in parallel

Nope, this is not a solution. I'll explain why later.

Renat

> 15 марта 2016 г., в 3:23, Lingxian Kong <email address hidden> написал(а):
>
> Hi, Renat,
>
> What if we replace _find_next_commands() here
> https://github.com/openstack/mistral/blob/master/mistral/workflow/base.py#L85
> with _find_next_commands_for_task(). I mean, when a task completes, why
> not we just look for subsequent tasks of it, rather than looking for
> subsequent tasks of all completed tasks?
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1524477
>
> Title:
> After-task logic runs multiple times if tasks run in parallel
>
> Status in Mistral:
> In Progress
>
> Bug description:
> When executing a workflow with multiple tasks (let's mark them as
> task_a, task_b and task_c ). where some tasks run async actions, such
> as std.mistral_http (task_a), and contain "on-complete" sections . The
> tasks the should run later (in the "on-complete" section) (task_c)
> sometimes run more times then intended.
>
> These tasks are ran from the actual completed task (task_a), but if a
> different task (task_b) is completed at the same time, it sometimes
> runs the task (task_c) again.
>
> This seems to happen since both tasks (task_a and task_b) reach the
> method _on_task_state_change in default_engine.py, and when the
> direct_workflow.py _find_next_commands is called, both start to run
> task_c.
>
> I reproduced this issue using this workflow:
>
> ---
> version: '2.0'
>
> test_tasks:
> tasks:
> task1:
> action: std.mistral_http url="http://google.com"
> on-complete:
> - task4
> task2:
> action: std.mistral_http url="http://google.com"
> on-complete:
> - task5
> task3:
> action: std.mistral_http url="http://google.com"
> on-complete:
> - task6
> task4:
> action: std.echo output='TASK4'
> task5:
> action: std.echo output='TASK5'
> task6:
> action: std.echo output='TASK6'
>
>
> and running execution-update --state SUCCESS for task1, task2, and task3's action-executions concurrently.
>
> this resulted in:
> task 4 running twice: once when task1 finished, and once when task 5 finished.
> task 5 running twice: once when task 2 finished and once when task 3 finished.
> task 6 running twice: once when task 3 finished and once when task 1 finished.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mistral/+bug/1524477/+subscriptions

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Lingxian,

What you suggest would work. However, it should work now as well if we assume that our transactions and locks work as expected. Workflow controller should have exclusive access to workflow execution object (actually graph of objects related to it), that's why we have a locking mechanism: to prevent concurrent updates of the same workflow execution state. But the thing is that locking mechanism does not work properly, I've done some investigation on that. So your solution would just mask more fundamental problem. That TX/locking problem may also cause other problems, so we need to fix it.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

test comment...

Changed in mistral:
milestone: mitaka-rc1 → mitaka-rc2
Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Lingxian,

On the other thought your suggestion makes sense to me now for a different reason. We need to make DB transactions as short as possible so that this would be a good optimization because we wouldn't have to have heavy DB querying.

We used to have the calculation model you mentioned before. We moved away from it because architecturally it wasn't convenient to have this "cause_task" param all over the place. And there's a number of situations when there isn't any cause tasks. For example, when we need to resume a workflow from pause.

So my suggestions are the following:
* We need to understand precisely why it doesn't work now so that we don't have hidden surprises in the future
* We need to optimize calculation of next tasks
* Ideally we need to move away from using locks and thereby improve parallelism. For example, by using optimistic concurrency

Anyway, let's not rush and understand first all pros and cons of all solutions.

Thanks

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (master)

Reviewed: https://review.openstack.org/298117
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=ad07ba0d68836ab0409fb51d235def653143cd2c
Submitter: Jenkins
Branch: master

commit ad07ba0d68836ab0409fb51d235def653143cd2c
Author: Renat Akhmerov <email address hidden>
Date: Mon Mar 28 13:44:54 2016 +0600

    Fixing engine transaction model and error handling

    * Transaction in on_action_complete() must not be splitted into 2 parts,
      it caused the bug with after task completion logic
    * Fix executor behavior so that it doesn't send an error back to engine
      if a error came from engine itself. It should report back only errors
      occurred with an action itself.
    * YAQL and other expected Mistral exceptions in transitions should not
      lead to transaction rollback and rollback of action result. For example
      if action result came and it's valid but while evaluating transition
      conditions we got a YAQL exception then action result should be stored
      normally w/o transaction rollback and corresponding task and workflow
      should fail with corresponding state_info.
    * Fixed all tests
    * Minor cosmetic changes

    Closes-Bug: #1524477

    Change-Id: I09086e40a5902bbb6c977bf195cb035e31f21246

Changed in mistral:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/299751

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (stable/mitaka)

Reviewed: https://review.openstack.org/299751
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=a3ff23cfffb14331bb6304f30bb53666e2351943
Submitter: Jenkins
Branch: stable/mitaka

commit a3ff23cfffb14331bb6304f30bb53666e2351943
Author: Renat Akhmerov <email address hidden>
Date: Mon Mar 28 13:44:54 2016 +0600

    Fixing engine transaction model and error handling

    * Transaction in on_action_complete() must not be splitted into 2 parts,
      it caused the bug with after task completion logic
    * Fix executor behavior so that it doesn't send an error back to engine
      if a error came from engine itself. It should report back only errors
      occurred with an action itself.
    * YAQL and other expected Mistral exceptions in transitions should not
      lead to transaction rollback and rollback of action result. For example
      if action result came and it's valid but while evaluating transition
      conditions we got a YAQL exception then action result should be stored
      normally w/o transaction rollback and corresponding task and workflow
      should fail with corresponding state_info.
    * Fixed all tests
    * Minor cosmetic changes

    Closes-Bug: #1524477

    Change-Id: I09086e40a5902bbb6c977bf195cb035e31f21246
    (cherry picked from commit ad07ba0d68836ab0409fb51d235def653143cd2c)

tags: added: in-stable-mitaka
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/mistral 2.0.0.0rc2

This issue was fixed in the openstack/mistral 2.0.0.0rc2 release candidate.

Changed in mistral:
milestone: mitaka-rc2 → 2.0.0
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/mistral 3.0.0.0b1

This issue was fixed in the openstack/mistral 3.0.0.0b1 development milestone.

Revision history for this message
Sonal (sonal-ojha) wrote :

Renat,

I am seeing a similar issue with liberty version of mistral, the setup I have is a 3 mistral node setup each running mistral api, server, engine with common rabbit and database. Mostly the workflows I have are direct workflows like

tasks:
    task1:
      action: <some-action>
      on-success:
        - task2
    task2:
      action: <some-action>
      on-success:
        - task3
    task3:
      action: <some-action>
      on-success:
        - task4
    task4:
      action: <some-action>

Can you help me to know what kind of setup did you had?

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Hi, what do you mean by "setup"?

Well, that bug itself was caused by wrong transaction design and it was fixed in Newton cycle (the current dev focus). Any chance to upgrade to master? Otherwise, it would may be not easy to back port it to Liberty.

Renat

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Btw, the workflow you posted is different topologically. I'm not sure that you're facing the same problem. Please write more info: DB you use, logs etc.

Revision history for this message
Sonal (sonal-ojha) wrote :

Renat,

By setup I meant how was mistral deployed while reproducing the issue? Like in my case I am seeing mistral nodes executing same task multiple times (not necessarily all the tasks are executed multiple times ).

We are planning to upgrade to mitaka version, not sure how soon.
The problem faced is mistral nodes are picking up same task multiple times, I see something like shown in mistral_logs_1.txt attached with the post.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

The main condition to reproduce this bug is to have multiple engines. Yes, try Mitaka and let me know if it works for you.

Revision history for this message
Sonal (sonal-ojha) wrote :

The three mistral nodes each had engine, server and api services running. This issue isn't seen all the time, like even shown in the logs only some tasks are picked multiple times.

Revision history for this message
Sonal (sonal-ojha) wrote :

Renat,

Can you help me to know the data migration stuff for Liberty to Mitaka? During the investigation found that in Mitaka release we have some schema changes.

I did found mistral-db-manage cli which would help me to migrate to the latest schema but couldn't find anything relevant to data migration. Can you help me to know how do we migrate the data to the latest Mitaka version? Do we have any cli or tool available currently?

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

mistral-db-manage should be available after installation. In order to update to the latest schema you should run "mistral-db-manage upgrade head".

Revision history for this message
Sonal (sonal-ojha) wrote :

Renat,

So I did the offline db migration(mistral-db-manage --config-file /path/to/mistral.conf upgrade 009:011 --sql) and it gives sql only to create and update the schema, I don't see anything to migrate the data as per the latest schema.

============================

mistral-db-manage --config-file /path/to/mistral.conf upgrade 009:011 --sql
INFO [alembic.runtime.migration] Context impl MySQLImpl.
INFO [alembic.runtime.migration] Generating static SQL
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
INFO [alembic.runtime.migration] Running upgrade 009 -> 010, add_resource_members_v2_table
-- Running upgrade 009 -> 010

CREATE TABLE resource_members_v2 (
    id VARCHAR(36) NOT NULL,
    created_at DATETIME,
    updated_at DATETIME,
    project_id VARCHAR(80) NOT NULL,
    member_id VARCHAR(80) NOT NULL,
    resource_id VARCHAR(80) NOT NULL,
    resource_type VARCHAR(50) NOT NULL,
    status VARCHAR(20) NOT NULL,
    PRIMARY KEY (id),
    UNIQUE (resource_id, resource_type, member_id)
);

UPDATE alembic_version SET version_num='010' WHERE alembic_version.version_num = '009';

INFO [alembic.runtime.migration] Running upgrade 010 -> 011, add workflow id for execution
-- Running upgrade 010 -> 011

ALTER TABLE executions_v2 ADD COLUMN workflow_id VARCHAR(80);

UPDATE alembic_version SET version_num='011' WHERE alembic_version.version_num = '010';

Revision history for this message
Sonal (sonal-ojha) wrote :

Renat,

The upgrade head fails as it tries to run the mysql create table command for existing tables.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.