WISHLIST: Smart Action Triggers

Bug #1989498 reported by Blake GH
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Wishlist
Unassigned

Bug Description

action_trigger_runner.pl can sometimes run into trouble. We've seen issues when the number of events exceeds some threshold. On our hardware (8 CPU, 32GB memory) we've seen the process never finish if it attempts to run more than ~30k events at a time (and the hardware is not maxed out, not even close). We've had to switch to "manual mode" and reset the action_trigger.event rows with state='collected' back to 'pending' and manually run the action_trigger_runner.pl script again. But this time, with only the number of events that we reset. It will finish and then we repeat that process until it's caught up with the backlog. 20k rows at a time seemed to be the sweet spot for us.

I propose something a little more sophisticated. I'm not entirely sure what shape it takes but the main idea is:

What if we had something that would read through all of the definitions and spawn the runner process based on the definition's "frequency" (introducing a new column called frequency (type INTERVAL)).

What if the "something" also managed the maximum number of events the process is allowed to work on?

What if the "something" also played nice on the hardware. Where it could make sure the machine CPU utilization was below a configured threshold before it was allowed to spawn a new runner?

This "something" would be the only thing that needs to be administratively running. No more need for the crontab spaghetti of action_trigger_runner.pl.

It could also automatically reset "lost" triggers back to pending (like what I was doing by hand), when the action_trigger.event table contains rows that are "stuck", knowing that the row is no longer being operated upon. It would know this because it's in charge of spawning the action_trigger_runner software. This could be kept track of by some an assigned ID/hash number. A new column in action_trigger.event for "runner_hash" type TEXT (sort of related to update_process but really just for the purposes of this "something" to compare to it's PID table). A small unique hash assigned to each runner so that the "something" can check the local PID table.

More ideas welcome! The idea is that this needs to be better kept track of by software. And if we're doing a better job of keeping track of it, we might as well automate the whole thing to where it fixes itself too. Or at least alerts the admin's when something needs attention. We use monitoring software to look for stuck triggers, and we supplement that with a cron that resets the triggers if they are stuck with a start_time more than 3 days old. That's a hack but it seems to keep the trouble at bay most of the time.

IRC discussion:

http://irc.evergreen-ils.org/evergreen/2022-09-13#i_512512

and our "fix" cron:

BEGIN;

UPDATE action_trigger.event
SET
state='pending',
start_time=NULL,
update_time=NULL,
complete_time=NULL,
update_process=NULL,
template_output=NULL
WHERE id
IN
(
SELECT id
FROM
action_trigger.event
WHERE
start_time BETWEEN (now() - '14 days'::INTERVAL) AND (now() - '3 days'::INTERVAL) AND
state NOT IN('complete','invalid','pending','error')
ORDER BY 1
LIMIT 5000
);

COMMIT;

cron frequency:
0 5,10,13,18,23 * * *

tags: added: actiontrigger
Changed in evergreen:
importance: Undecided → Wishlist
status: New → Confirmed
summary: - WISHLSIT: Smart Action Triggers
+ WISHLIST: Smart Action Triggers
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.