Workflow execution stuck in 'RUNNING' state if DB error occurs due to large outputs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mistral |
Won't Fix
|
High
|
ali abdelal |
Bug Description
Workflow execution gets stuck in 'RUNNING' state if DB error occurs during updating DB after outputs is evaluated. Mistral keeps retrying for 50 times (as per db retry policy) and still fails.
Setup details:
MySQL v5.6 with InnoDB engine
Mistral master
[engine]
execution_
Bug description:
The DB errors faced in case of saving large outputs (say 16MB in size) can be due to one or more of following reasons:
1) `max_packet_
2) Large UPDATE queries fail due to insufficient redo size log (size of update data is > 10% of redo size log) see bug: https:/
Bug from mistral side is improper error handling --- Mistral keeps retrying operation (assuming error to be recoverable Deadlock or similar).
How to reproduce:
Execute a WF that has very large size outputs.
Workflow:
---
version: "2.0"
bugs_me:
output:
big_dict: <% 2097152 * ["ABCD"] %>
tasks:
t:
action: std.noop
Expectation:
Workflow status should be 'ERROR'.
Changed in mistral: | |
status: | New → Confirmed |
importance: | Undecided → Critical |
importance: | Critical → High |
milestone: | none → rocky-2 |
tags: | added: low-hanging-fruit |
Changed in mistral: | |
milestone: | rocky-2 → rocky-3 |
Changed in mistral: | |
assignee: | nobody → Hardik Jasani (hjasani) |
Changed in mistral: | |
milestone: | rocky-3 → stein-1 |
Changed in mistral: | |
milestone: | stein-1 → stein-2 |
Changed in mistral: | |
milestone: | stein-2 → stein-3 |
Changed in mistral: | |
assignee: | Hardik Jasani (hjasani) → nobody |
milestone: | stein-3 → train-1 |
assignee: | nobody → Renat Akhmerov (rakhmerov) |
Changed in mistral: | |
assignee: | Renat Akhmerov (rakhmerov) → ali (alielal) |
Changed in mistral: | |
status: | Confirmed → Won't Fix |
the exception raised in this case is very generic, as it is raised in other cases too in which we want it to retry.
for example, in case of a timeout.
the exception raised in case of a Mysql is 'MySQL server has gone away' , which can happen for many reasons, incorrect package, timeout or because of a package that is too big.