Action trigger data grows quickly, would benefit from cleanup

Bug #1672775 reported by Bill Erickson
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Wishlist
Unassigned

Bug Description

Evergreen 2.12 / All versions

Email notices, print notices, a variety of of print documents, and many more are built with Action/Trigger. Their data all ends up in the action_trigger.event and action_trigger.event_output tables. Because they are used for so many types of events, these tables grow large, both in row counts and physical size.

Another concern of course is that we are retaining data that links patrons to circulations and holds (via notices) that need not be retained for patron privacy.

Arguably, much of this data does not need to persist very long. For example, each time a patron clicks the Print option on the record detail page of the catalog, the print output is stored in the database forever. In this example, the data is only needed for a few seconds. In other cases, like overdue notices, we may want the data to persist for weeks or months for debugging purposes, but likely not for years.

I propose a new action_trigger.event_definition.retention_interval column which specifies the amount of time after an event occurs where it and its output are kept in the database. A new external process (e.g. via CRON) will regularly delete event data that whose retention time has expired.

One concern with removing event data is that it's used to determine when something has already happened. For example, a patron can't receive a second 7-day overdue email notice for a given circulation, because such an event already exists in the database. This will have to be taken into account when considering reasonable default retention intervals.

Comments, suggestions appreciated.

Revision history for this message
Bill Erickson (berick) wrote :

Some more..

Taking bug #1672824 into account, plus the fact that non-completed (invalid, error) events, which do not get a complete_time, should also be purged, a likely candidate for the timestamp used when testing the retention interval would be the event's update_time. This should be the last time an event was modified, regardless of its outcome. Or start_time -- I think the difference would be negligible in practice.

When (grouped) events link to the same output, no events are deleted until all events and their output are deleted. Maybe just check the max(update_time) for grouped events.

Regarding the final paragraph from the description, I believe as long as all event definitions that should have a non-NULL retention_interval also have a max_delay value and that the retention_interval exceeds the max_delay time, there would be no chance of duplicates.

Also, to be clear, retention_interval can be null. We may want to keep some data indefinitely.

Revision history for this message
Bill Erickson (berick) wrote :
Changed in evergreen:
assignee: nobody → Bill Erickson (berick)
Revision history for this message
Mike Rylander (mrylander) wrote :

That code looks good! My only nit to pick, and it's minor since you're raising an exception, is that I think the trigger should be a BEFORE INSERT OR UPDATE trigger.

Thanks, Bill!

Revision history for this message
Bill Erickson (berick) wrote :

Oh yeah, catch it early. Thanks for the eyes, Mike!

Trigger mods and pgtap tests will be next in the pipeline.

Revision history for this message
Bill Erickson (berick) wrote :

Trigger mods applied, pgtap tests and release notes added. Schema synced to baseline. Rebased and squashed down to 4 commits (sql/idl, pgtap, purge script, release notes). Adding pullrequest.

tags: added: pullrequest
Changed in evergreen:
milestone: none → 2.next
assignee: Bill Erickson (berick) → nobody
Bill Erickson (berick)
Changed in evergreen:
milestone: 3.next → 3.0-alpha
Revision history for this message
Bill Erickson (berick) wrote :

Rebased to master.

Galen Charlton (gmc)
Changed in evergreen:
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Galen Charlton (gmc) wrote :

I've tested it and confirmed that it does the thing. I've pushed a signoff branch (that also includes a follow-up adding purge_at_events.srfsh to the example crontab) as user/gmcharlt/lp1672775_signoff

I note that creating some indexes on action_trigger.event could speed up the query that identifies orphan event_output rows. E.g., in a database of ~34 million atev rows:

without index
-------------
EXPLAIN ANALYZE SELECT DISTINCT(template_output) AS id
FROM action_trigger.event WHERE template_output IS NOT NULL;

 Unique (cost=5915127.47..6081788.41 rows=377354 width=8) (actual time=38110.434..48259.902 rows=13427636 loops=1)
   -> Sort (cost=5915127.47..5998457.94 rows=33332188 width=8) (actual time=38110.432..45059.325 rows=33423432 loops=1)
         Sort Key: template_output
         Sort Method: external merge Disk: 588112kB
         -> Seq Scan on event (cost=0.00..838773.80 rows=33332188 width=8) (actual time=0.038..6922.994 rows=33423432 loops=1)
               Filter: (template_output IS NOT NULL)
               Rows Removed by Filter: 1002154
 Planning time: 0.096 ms
 Execution time: 48760.741 ms

with an index
-------------
CREATE INDEX gmc_idx ON action_trigger.event(template_output) WHERE (template_output IS NOT NULL);

 Unique (cost=0.56..1035636.33 rows=387883 width=8) (actual time=0.070..6706.581 rows=13427636 loops=1)
   -> Index Only Scan using gmc_idx on event (cost=0.56..951961.92 rows=33469762 width=8) (actual time=0.067..3706.761 rows=33423432 loops=1)
         Index Cond: (template_output IS NOT NULL)
         Heap Fetches: 0
 Planning time: 0.118 ms
 Execution time: 7104.611 ms

On the other hand...since at the moment we're talking about a cronjob that need run only at most once a day, it may not yet be worth the space to have such indexes.

tags: added: signedoff
Galen Charlton (gmc)
Changed in evergreen:
importance: Undecided → Wishlist
status: New → Confirmed
assignee: Galen Charlton (gmc) → nobody
Revision history for this message
Bill Erickson (berick) wrote :

Thanks, Galen! I could go either way on the indexes. The first purge will be painful for many sites. After some fairly aggressive purging tests locally, though, I whittled our event data set down to about 10% of its original size, at which point the queries are all fast enough for a daily run.

For now, I'll move forward, sign off the crontab patch, and merge shortly after unless anyone shouts otherwise. We can of course add the indexes later. I also have some SQL for truncating event / event_output that might be useful for large sites I could put somewhere.

Revision history for this message
Bill Erickson (berick) wrote :
Revision history for this message
Bill Erickson (berick) wrote :

Merged to master.

Changed in evergreen:
status: Confirmed → Fix Committed
Revision history for this message
Bill Erickson (berick) wrote :

See also bug #1693848 for default retention intervals.

Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.