Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel

Bug #1508379 reported by Moshe Elisha
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
Fix Released
Medium
Tomer Shtilman

Bug Description

Workflow executions might fail if the execution is running while a mistral-db-manage populate is running.

On production systems using Puppet - puppet agent will run "puppet agent -t" on the Mistral VMs every X minutes.
This will cause the "mistral-db-manage populate" to run every X minutes.

Currently, "mistral-db-manage populate" runs the sync_db method on actions and workflows and this method recreates system entities by deleting and creating them.

If an execution tries to use these entities while they are being recreated - it might fail on something like "InvalidActionException: Failed to find action [action_name=std.http]" (full stack trace attached).

The proposed solution is to delete and create the entities in a transaction.
This is the brainstorming done in the IRC on the solution:

melisha
Hi, all. We were occasionally encountering a "Failed to find action" error on various system actions like "std.echo" and "std.http" and we were trying to track the reason for that.
We were reviewing the fix for https://review.openstack.org/#/c/223536 and we think we understand the issue
Our setups are running puppet and every X minutes they run "puppet agent -t" to make sure the VM is up to date
This causes the mistral-db-manager to run again and reinstall all system actions / workflows
During that time - running workflows are failing
← achanda has quit (Remote host closed the connection)
→ achanda has joined
rakhmerov
melisha: hi
you mean this happens because mistral-db-manage deletes standard actions before recreating them?
melisha
rakhmerov: Hi. Yes
rakhmerov
hm.. thinking
so the better way would be not to delete them
melisha
Yes. We could do an update instead of delete / create
Or we can put the while thing in a transaction and commit only after both delete and create are done
rakhmerov
yes
melisha
I think transaction is simpler, right?
rakhmerov
it will only work thought for REPEATABLE READ transactions
right, transaction approach looks simpler
melisha
READ COMMITTED
?
rakhmerov
I thought REPEATABLE READ because we need consistency on a DB table rows level
say during update we understood that part of actions are gone (unlikely)
so that leads to the point that we shouldn't see an inconsistent state of the whole table
when part of actions are up to date already and part are not
rakhmerov
if we need that level of consistency then we need REPEATABLE READ
if we're ok that Mistral sees at some point partially obsolete actions and partially actual actions then READ COMMITTED is enough
melisha
OK. I agree. What is the default isolation level of Mistral?
rakhmerov
makes sense?
there's only a recommendation somewhere in the docs to use either READ_COMMITTED or REPEATABLE_READ
so in fact by default if Mistral config doesn't contain that the default DB settings are used
for every database it may vary but in most cases (at least for mysql and postgres) it is at least READ_COMMITTED
which is enough
melisha
REPEATABLE_READ will theoretically guarantee that a workflow that does "std.echo" as first task and uses "std.echo" again as last task will use the exact same "std.echo" even if the action definition was updated, right? But I say theoretically because every task is running in a different transaction anyway so it will read the new definition because that is the first time it reads that definition in that transaction
Correct?
rakhmerov
melisha: not, currently it may be different
because, as you said, it gets fetched from DB within different transactions
the only way to guarantee that is to make snapshots of all actions (as we do for wf itself) at the beginning of workflow
melisha
So I am saying that REPEATABLE_READ cannot help anyway so READ_COMMITTED is good for now
Until we will do snapshot
rakhmerov
melisha: yes
rakhmerov
melisha: at least for now, I would suggest we fix the current update algorithm
melisha
rakhmerov: To be in transaction like we said?
rakhmerov
yes, it seems to be the best way for now
melisha
OK. Thanks. I will open a bug and will work on a fix. Thanks.
rakhmerov
melisha: ok, thanks. That's a good finding actually

Revision history for this message
Moshe Elisha (melisha) wrote :
Changed in mistral:
assignee: nobody → Tomer Shtilman (tomer-shtilman)
Changed in mistral:
importance: Undecided → Medium
milestone: none → mitaka-1
status: New → Confirmed
status: Confirmed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (master)

Fix proposed to branch: master
Review: https://review.openstack.org/240705

Changed in mistral:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on mistral (master)

Change abandoned by Tomer Shtilman (<email address hidden>) on branch: master
Review: https://review.openstack.org/240705

tags: added: liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (master)

Reviewed: https://review.openstack.org/240705
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=3b3695efca60f28c63dbc07882aa41bf41133845
Submitter: Jenkins
Branch: master

commit 3b3695efca60f28c63dbc07882aa41bf41133845
Author: Nikolay Mahotkin <email address hidden>
Date: Fri Nov 6 13:29:54 2015 +0300

    Wrap sync_db operations in transactions

      Fix for fail on failed to find system actions

    Change-Id: Ief7cf96eedd201990ca3c169fb0c1509ee55665e
    Closes-Bug: 1508379

Changed in mistral:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to mistral (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/246842

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to mistral (stable/liberty)

Reviewed: https://review.openstack.org/246842
Committed: https://git.openstack.org/cgit/openstack/mistral/commit/?id=e09ec654370eb62d2194786f900627daf96a9eb2
Submitter: Jenkins
Branch: stable/liberty

commit e09ec654370eb62d2194786f900627daf96a9eb2
Author: Nikolay Mahotkin <email address hidden>
Date: Fri Nov 6 13:29:54 2015 +0300

    Wrap sync_db operations in transactions

      Fix for fail on failed to find system actions

    Change-Id: Ief7cf96eedd201990ca3c169fb0c1509ee55665e
    Closes-Bug: 1508379
    (cherry picked from commit 3b3695efca60f28c63dbc07882aa41bf41133845)

tags: added: in-stable-liberty
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/mistral 2.0.0.0b1

This issue was fixed in the openstack/mistral 2.0.0.0b1 development milestone.

Changed in mistral:
status: Fix Committed → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/mistral 1.0.1

This issue was fixed in the openstack/mistral 1.0.1 release.

Changed in mistral:
milestone: mitaka-1 → 2.0.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

This issue was fixed in the openstack/mistral 1.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.