WF execution stays in RUNNING although task and action executions are in SUCCESS

Bug #1518012 reported by Moshe Elisha
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
Fix Released
High
Moshe Elisha

Bug Description

Hi,

I have created a workflow and executed it.
The actions executions are all in SUCCESS.
The task executions are all in SUCCESS.
The workflow execution is still in RUNNING.

I see one ERROR line in the logs and it might be related - os-mistral-01 mistral-engine.log:

2015-11-19 19:52:23.654 2034 ERROR oslo_messaging._drivers.impl_qpid [-] Unable to connect to AMQP server on {'username': 'guest', 'host': '10.1.20.151:5671', 'password': 'guest'}: [Errno 8] _ssl.c:504: EOF occurred in violation of protocol. Sleeping 1 seconds
2015-11-19 19:52:23.655 2034 DEBUG oslo_messaging._drivers.amqp [-] Pool creating new connection create /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py:103
2015-11-19 19:52:23.668 2034 INFO oslo_messaging._drivers.impl_qpid [-] Connected to AMQP server on 10.1.20.151:5671
2015-11-19 19:52:23.675 2034 DEBUG oslo_messaging._drivers.amqpdriver [-] sending reply msg_id: 11bb94319e374fd792d2fbcdf698e39d reply queue: reply_852b887d8ad54eb39f92b7cc9a669bbb _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:79
2015-11-19 19:52:24.673 2034 INFO oslo_messaging._drivers.impl_qpid [-] Connected to AMQP server on 10.1.20.151:5671
2015-11-19 19:52:24.681 2034 DEBUG oslo_messaging._drivers.amqpdriver [-] sending reply msg_id: 125489f3fc3b498d9c828f0d8fa4db36 reply queue: reply_47481d93ae9d46be91133c8632d783d4 _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:79

In the attached ZIP you will find:
* The workflow used
* The cmd output showing the scenario
* The Mistral logs (from all three servers)

Revision history for this message
Moshe Elisha (melisha) wrote :
Revision history for this message
Moshe Elisha (melisha) wrote :

I am able to reproduce this bug. In other occurrences of the same failed scenario I don't see the "ERROR oslo_messaging._drivers.impl_qpid" - so that ERROR might not be related to the bug

Revision history for this message
Moshe Elisha (melisha) wrote :

I have added a logging line to default_engine.py:_check_workflow_completion

        if wf_utils.find_incomplete_task_executions(wf_ex):
+++ LOG.debug("WF Execution %s has incompleted task executions: %s" % (str(wf_ex), str(wf_ex.task_executions)))
            return

I can see that all 4 task executions are printing that statement unaware of the updated status of the other task executions.
Can you confirm? I have attached the relevant log segments.

Revision history for this message
Renat Akhmerov (rakhmerov) wrote :
Changed in mistral:
importance: Undecided → High
status: New → Triaged
milestone: none → mitaka-1
tags: added: liberty-backport-potential
Moshe Elisha (melisha)
Changed in mistral:
assignee: nobody → Moshe Elisha (melisha)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (master)

Fix proposed to branch: master
Review: https://review.openstack.org/253819

Changed in mistral:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (master)

Reviewed: https://review.openstack.org/253819
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=976667c0a22f604559e19135f30dea803e3080f6
Submitter: Jenkins
Branch: master

commit 976667c0a22f604559e19135f30dea803e3080f6
Author: Moshe Elisha <email address hidden>
Date: Sat Dec 5 15:31:24 2015 +0000

    Fix concurrency issues by using READ_COMMITTED

    The state of a workflow execution was not updated even when all task
    executions were completed if some tasks finished at the same time as
    other tasks.
    Because we were using our connections with transaction isolation
    level = REPEATABLE_READ - Each process was using a snapshot of the DB
    created at the first read statement in that transaction.
    When a task finished and evaluated the state of all the other tasks
    it did not see the up-to-date state of those tasks - and so, because
    not all tasks were completed - the task did not change the workflow
    execution state.

    Similar behavior happened with multiple action executions under same
    task. On completion, each action execution checked the status of the
    other action executions and did not see the up-to-date state of these
    action execution - causing task execution to stay in RUNNING state.

    Change-Id: I12f66134d92b8ed39df9d6128d7de5ee49aa8623
    Closes-Bug: #1518012
    Closes-Bug: #1513456

Changed in mistral:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/255718

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (stable/liberty)

Reviewed: https://review.openstack.org/255718
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=09f206f863396591c8a0b0d1805a72ae8b3966cd
Submitter: Jenkins
Branch: stable/liberty

commit 09f206f863396591c8a0b0d1805a72ae8b3966cd
Author: Moshe Elisha <email address hidden>
Date: Sat Dec 5 15:31:24 2015 +0000

    Fix concurrency issues by using READ_COMMITTED

    The state of a workflow execution was not updated even when all task
    executions were completed if some tasks finished at the same time as
    other tasks.
    Because we were using our connections with transaction isolation
    level = REPEATABLE_READ - Each process was using a snapshot of the DB
    created at the first read statement in that transaction.
    When a task finished and evaluated the state of all the other tasks
    it did not see the up-to-date state of those tasks - and so, because
    not all tasks were completed - the task did not change the workflow
    execution state.

    Similar behavior happened with multiple action executions under same
    task. On completion, each action execution checked the status of the
    other action executions and did not see the up-to-date state of these
    action execution - causing task execution to stay in RUNNING state.

    Change-Id: I12f66134d92b8ed39df9d6128d7de5ee49aa8623
    Closes-Bug: #1518012
    Closes-Bug: #1513456
    (cherry picked from commit 976667c0a22f604559e19135f30dea803e3080f6)

tags: added: in-stable-liberty
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/mistral 2.0.0.0b2

This issue was fixed in the openstack/mistral 2.0.0.0b2 development milestone.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/mistral 1.0.1

This issue was fixed in the openstack/mistral 1.0.1 release.

Changed in mistral:
milestone: mitaka-1 → 2.0.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

This issue was fixed in the openstack/mistral 1.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.