Autorenewal Can Overwhelm open-ils.trigger Service Drones

Bug #2030915 reported by Jason Stephenson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Released
Medium
Unassigned
3.10
Fix Released
Medium
Unassigned
3.11
Fix Released
Medium
Unassigned

Bug Description

Evergreen version: 3.10.3+
OpenSRF version: 3.2.3
PostgreSQL version: 10+ (N/A)

The way that the autorenewal code creates the follow up event to notify the patron of the outcome of the autorenewal attempt can overwhelm the open-ils.trigger service leading to the server running out of trigger drones or to drones crashing leading to errors that can cascade and affect multiple autorenewals. The following log messages are indicative of these two occurrences, respectively:

[2023-08-08 04:06:06] open-ils.trigger [WARN:1901:Server.pm:200:] server: no children available, waiting... consider increasing max_children for this application higher than 20 in the OpenSRF configuration if this message occurs frequently

and

[2023-08-09 03:53:04] open-ils.trigger [ERR :1516974:EventGroup.pm:98:] Event reacting failed with Can't call method "request" on an undefined value at /usr/local/share/perl/5.30.0/OpenILS/Application/Trigger/Reactor/Circ/AutoRenew.pm line 88.

Either condition can lead to autorenewal events getting the error state. In many cases, these events can be reset to pending and the action_trigger_runner.pl can be run with the appropriate granularity and the --run-pending option to reprocess the failed events.

These conditions can occur when thousands of renewal events are being processed. The code on line 88 of the AutoRenew.pm module calls code to create the autorenewal notification event and does not wait for it to return before the process continues on to the next autorenewal event. When there are enough events this can lead to the exhaustion of drones and to drones possibly crashing or shutting down without properly notifying the listener.

As discussed in IRC, the solution appears to be to alter the code that creates the autorenewal notification event so that it waits on the event creation to complete before moving on to the next event: http://irc.evergreen-ils.org/evergreen/2023-08-09#i_530787

A patch branch will be forthcoming.

Bill Erickson (berick)
Changed in evergreen:
status: New → Confirmed
Revision history for this message
Blake GH (bmagic) wrote :

Confirmed

Changed in evergreen:
importance: Undecided → Medium
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Branch with a potential fix is in the working repository at user/dyrcona/lp20230915-gather-autorenewnotify

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp20230915-gather-autorenewnotify

Really testing this requires an environment with a production amount of data, but basically, just running autorenewals and seeing that they still work after the patch is more or less enough.

I am sure that those affected by this bug will want to put the patch through its paces.

tags: added: pullrequest
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
milestone: none → 3.12-beta
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I pushed the branch to one with the correct name: user/dyrcona/lp2030915-gather-autorenewnotify

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp2030915-gather-autorenewnotify

The branch with the typo in the name is still there and is identical to this one.

Revision history for this message
Galen Charlton (gmc) wrote :

Upon reading the patch, it occurs to me that method_lookup() would be slightly better. Since open-ils.trigger.event.autocreate is guaranteed to be available to the process running the reactor, and since there's nothing to be gained by enlisting the aid of a second open-ils.trigger drone if we're waiting, the overhead of a full OpenSRF call can be skipped.

Revision history for this message
Galen Charlton (gmc) wrote :

And upon actually trying to use method_lookup() - nice idea, but the reactors don't have access to the underlying OpenILS::Application object, so they have no way to actually do it.

Revision history for this message
Galen Charlton (gmc) wrote :

Tested and signoff pushed to user/gmcharlt/lp2030195-signoff.

The pattern of reactors invoking open-ils.trigger.event.autocreate to create follow-up events is common enough that it may be worth it as a micro-optimization to make it directly available to them to save an OpenSRF call.

tags: added: signedoff
Revision history for this message
Jason Stephenson (jstephenson) wrote :

C/W MARS started using this patch in production today. We got through a batch of 14,375 autoreenwals with no incidents this am.

I'll let this code run for a few more days, and if there are no errors, I'll push it next week as part of the bug squash week.

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I assigned myself to this bug because we are still having issues with action triggers. I've looked into it, and I think those issues are caused by running too many events at the same time, but I'll have to do some serious log diving to figure that out.

This patch did eliminate the two messages mentioned in the summary so far. We've been running this in production for almost a week now.

I'll leave this for a few days in case someone else wants to have a look. If it doesn't get pushed by the end of the week, I'll add a release note and push it to main.

Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Changed in evergreen:
status: Confirmed → Fix Committed
assignee: Jason Stephenson (jstephenson) → nobody
Revision history for this message
Jason Stephenson (jstephenson) wrote :

After a week of running this in production, we have had an improvement in events processing. We no longer run out of open-ils.trigger drones and we no longer have the "Event reacting failed with Can't call method "request" on an undefined value at /usr/local/share/perl/5.30.0/OpenILS/Application/Trigger/Reactor/Circ/AutoRenew.pm line 88" error message.

The number of stuck and error'd auto-renewal events appears to have dropped.

We do still have issues processing some events in the wee hours of the morning, but I think those are other bugs and may be alleviated by adjusting our cron schedules.

I have therefore pushed the branch to main, rel_3_11, and rel_3_10, with a short release note to indicate a potential issue is resolved.

Thanks to Galen for signing off, and thanks to Bill for the original code for the fix that needed small modification to work in stock Evergreen.

Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.