Running WFs fail on failed to find system actions / workflows if DB sync is running in parallel
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mistral |
Fix Released
|
Medium
|
Tomer Shtilman |
Bug Description
Workflow executions might fail if the execution is running while a mistral-db-manage populate is running.
On production systems using Puppet - puppet agent will run "puppet agent -t" on the Mistral VMs every X minutes.
This will cause the "mistral-db-manage populate" to run every X minutes.
Currently, "mistral-db-manage populate" runs the sync_db method on actions and workflows and this method recreates system entities by deleting and creating them.
If an execution tries to use these entities while they are being recreated - it might fail on something like "InvalidActionE
The proposed solution is to delete and create the entities in a transaction.
This is the brainstorming done in the IRC on the solution:
melisha
Hi, all. We were occasionally encountering a "Failed to find action" error on various system actions like "std.echo" and "std.http" and we were trying to track the reason for that.
We were reviewing the fix for https:/
Our setups are running puppet and every X minutes they run "puppet agent -t" to make sure the VM is up to date
This causes the mistral-db-manager to run again and reinstall all system actions / workflows
During that time - running workflows are failing
← achanda has quit (Remote host closed the connection)
→ achanda has joined
rakhmerov
melisha: hi
you mean this happens because mistral-db-manage deletes standard actions before recreating them?
melisha
rakhmerov: Hi. Yes
rakhmerov
hm.. thinking
so the better way would be not to delete them
melisha
Yes. We could do an update instead of delete / create
Or we can put the while thing in a transaction and commit only after both delete and create are done
rakhmerov
yes
melisha
I think transaction is simpler, right?
rakhmerov
it will only work thought for REPEATABLE READ transactions
right, transaction approach looks simpler
melisha
READ COMMITTED
?
rakhmerov
I thought REPEATABLE READ because we need consistency on a DB table rows level
say during update we understood that part of actions are gone (unlikely)
so that leads to the point that we shouldn't see an inconsistent state of the whole table
when part of actions are up to date already and part are not
rakhmerov
if we need that level of consistency then we need REPEATABLE READ
if we're ok that Mistral sees at some point partially obsolete actions and partially actual actions then READ COMMITTED is enough
melisha
OK. I agree. What is the default isolation level of Mistral?
rakhmerov
makes sense?
there's only a recommendation somewhere in the docs to use either READ_COMMITTED or REPEATABLE_READ
so in fact by default if Mistral config doesn't contain that the default DB settings are used
for every database it may vary but in most cases (at least for mysql and postgres) it is at least READ_COMMITTED
which is enough
melisha
REPEATABLE_READ will theoretically guarantee that a workflow that does "std.echo" as first task and uses "std.echo" again as last task will use the exact same "std.echo" even if the action definition was updated, right? But I say theoretically because every task is running in a different transaction anyway so it will read the new definition because that is the first time it reads that definition in that transaction
Correct?
rakhmerov
melisha: not, currently it may be different
because, as you said, it gets fetched from DB within different transactions
the only way to guarantee that is to make snapshots of all actions (as we do for wf itself) at the beginning of workflow
melisha
So I am saying that REPEATABLE_READ cannot help anyway so READ_COMMITTED is good for now
Until we will do snapshot
rakhmerov
melisha: yes
rakhmerov
melisha: at least for now, I would suggest we fix the current update algorithm
melisha
rakhmerov: To be in transaction like we said?
rakhmerov
yes, it seems to be the best way for now
melisha
OK. Thanks. I will open a bug and will work on a fix. Thanks.
rakhmerov
melisha: ok, thanks. That's a good finding actually
Changed in mistral: | |
assignee: | nobody → Tomer Shtilman (tomer-shtilman) |
Changed in mistral: | |
importance: | Undecided → Medium |
milestone: | none → mitaka-1 |
status: | New → Confirmed |
status: | Confirmed → Triaged |
tags: | added: liberty-backport-potential |
Changed in mistral: | |
milestone: | mitaka-1 → 2.0.0 |
Fix proposed to branch: master /review. openstack. org/240705
Review: https:/