Move away from using periodic jobs to check workflow completion

Bug #1799382 reported by Renat Akhmerov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
Fix Released
High
Renat Akhmerov

Bug Description

In order to check if a workflow is completed Mistral uses periodic scheduled jobs that poll DB and decide when the workflow needs to be completed. This approach has been proven to be not effective in many case because it's hard to predict when the next iteration of such job should run. If too soon, the load in the system will be big. If too late, then a workflow may be in RUNNING state for too long after its tasks are completed.

Changed in mistral:
assignee: nobody → Renat Akhmerov (rakhmerov)
importance: Undecided → High
status: New → Confirmed
milestone: none → stein-1
Changed in mistral:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (master)

Reviewed: https://review.openstack.org/607807
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=3d7acd3957a75457da4ca87ae9ebd5cc61d28149
Submitter: Zuul
Branch: master

commit 3d7acd3957a75457da4ca87ae9ebd5cc61d28149
Author: Renat Akhmerov <email address hidden>
Date: Thu Oct 4 11:50:03 2018 +0700

    Improve workflow completion logic by removing periodic jobs

    * Workflow completion algorithm use periodic scheduled jobs to
      poll DB and determine when a workflow is finished. The problem
      with this approach is that if Mistral runs another iteration
      of such job too soon then running such jobs will create a big
      load on the system. If too late, then a workflow may be in
      RUNNING state for too long after all its tasks are completed.
      The current implementation tries to predict a delay with which
      the next job should run, based on a number of incompleted tasks.
      This approach was initially taken because we switched to a
      non-blocking transactional model (previously we locked the entire
      workflow execution graph in order to change a state of anything)
      and in this architecture, when we have parallel branches, i.e.
      parallel DB transactions, we can't make a consistent read from
      DB from neither of these transactions to make a reliable decision
      about whether the workflow is completed or not. Using periodic
      jobs was a solution. However, this approach has been proven to
      work unreliably because such a prediction about delay before the
      next job iteration doesn't work well on all variety of use cases
      that we have.
      This patch removes using periodic jobs in favor of using the
      "two transactions" approach when in the first transaction we
      handle action completion event (and task completion if it causes
      it) and in the second transaction, if a task is completed, we
      check if the workflow is completed. This approach guarantees
      that at least one of the "second" transactions in parallel
      branches will make needed consistent read from DB (i.e. will
      see the actuall state of all needed objects) to make the right
      decision.

    Closes-Bug: #1799382
    Change-Id: I2333507503b3b8226c184beb0bd783e1dcfa397f

Changed in mistral:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/616658

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (stable/rocky)

Reviewed: https://review.openstack.org/616658
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=2129986012761d094552c9e5c0012e89d498e737
Submitter: Zuul
Branch: stable/rocky

commit 2129986012761d094552c9e5c0012e89d498e737
Author: Renat Akhmerov <email address hidden>
Date: Thu Oct 4 11:50:03 2018 +0700

    Improve workflow completion logic by removing periodic jobs

    * Workflow completion algorithm use periodic scheduled jobs to
      poll DB and determine when a workflow is finished. The problem
      with this approach is that if Mistral runs another iteration
      of such job too soon then running such jobs will create a big
      load on the system. If too late, then a workflow may be in
      RUNNING state for too long after all its tasks are completed.
      The current implementation tries to predict a delay with which
      the next job should run, based on a number of incompleted tasks.
      This approach was initially taken because we switched to a
      non-blocking transactional model (previously we locked the entire
      workflow execution graph in order to change a state of anything)
      and in this architecture, when we have parallel branches, i.e.
      parallel DB transactions, we can't make a consistent read from
      DB from neither of these transactions to make a reliable decision
      about whether the workflow is completed or not. Using periodic
      jobs was a solution. However, this approach has been proven to
      work unreliably because such a prediction about delay before the
      next job iteration doesn't work well on all variety of use cases
      that we have.
      This patch removes using periodic jobs in favor of using the
      "two transactions" approach when in the first transaction we
      handle action completion event (and task completion if it causes
      it) and in the second transaction, if a task is completed, we
      check if the workflow is completed. This approach guarantees
      that at least one of the "second" transactions in parallel
      branches will make needed consistent read from DB (i.e. will
      see the actuall state of all needed objects) to make the right
      decision.

    (cherry picked from commit 3d7acd3957a75457da4ca87ae9ebd5cc61d28149)

    Closes-Bug: #1799382
    Change-Id: I2333507503b3b8226c184beb0bd783e1dcfa397f

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/mistral 8.0.0.0b1

This issue was fixed in the openstack/mistral 8.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/mistral 7.1.0

This issue was fixed in the openstack/mistral 7.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.