Fine Generator timeout is different in single/parallel modes

Bug #1206629 reported by Jason Boyer
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Evergreen
Confirmed
Medium
Unassigned

Bug Description

Eg 2.2+

/openils/bin/fine_generator.pl

When the fine generator is run with parallel=1 in opensrf.xml, there is no timeout specified on line 51:
if ($parallel == 1) {

    my $r = OpenSRF::AppSession
            ->create( 'open-ils.storage' )
            ->request( 'open-ils.storage.action.circulation.overdue.generate_fines' );

------> while (!$r->complete) { $r->recv };

} else {

But, when using 2 or more generators at the same time, a 10 minute timeout is specified on line 63:

} else {

    my $multi_generator = OpenSRF::MultiSession->new(
        app => 'open-ils.storage',
        cap => $parallel,
        api_level => 1,
    );

    my $storage = OpenSRF::AppSession->create("open-ils.storage");
    my $r = $storage->request('open-ils.storage.action.circulation.overdue.id_list');
------> while (my $resp = $r->recv(timeout => 600)) {
        my $circ_id = $resp->content;
        $multi_generator->request( 'open-ils.storage.action.circulation.overdue.generate_fines', $circ_id );
    }

This causes parallel fine generation to fail if the database is too slow to respond. I realize the timeout could be increased in that case, but shouldn't both paths behave the same?

Tags: circulation
Revision history for this message
Ben Shum (bshum) wrote :

This sounds like a good question.

Changed in evergreen:
status: New → Triaged
importance: Undecided → Medium
Elaine Hardy (ehardy)
tags: added: circulation needsdiscussion
Changed in evergreen:
status: Triaged → Confirmed
Revision history for this message
Jason Stephenson (jstephenson) wrote :

It also seems to fail in a very interesting manner. In several cases, I have seen the storage listener die, with drones still running.

osrf_control --diagnostic says: ERR open-ils.storage Has PID file entry [1625], which matches no running open-ils.storage processes

And pgrep -af storage shows 7 drones running:
21046 OpenSRF Drone [open-ils.storage]
21052 OpenSRF Drone [open-ils.storage]
21059 OpenSRF Drone [open-ils.storage]
21063 OpenSRF Drone [open-ils.storage]
21064 OpenSRF Drone [open-ils.storage]
21070 OpenSRF Drone [open-ils.storage]
21077 OpenSRF Drone [open-ils.storage]

The fine generator itself is still running. pgrep -af fine reports:
21026 /usr/bin/perl /openils/bin/fine_generator.pl /openils/conf/opensrf_core.xml

The lock file is still in place.

My latest occurrence of this happened today after the 5:00 pm run. When the 6:00 pm run started, it reported that the fine generator seemed to already be running, so I had a look.

/openils/var/log/open-ils.storage_stderr.log has 15,193 of the following message, with the file timestamped at Mar 12 17:13:

Caught error from 'run' method: Can't call method "search_where" on an undefined value at /usr/local/share/perl/5.26.1/OpenILS/Application/Storage/Publisher/action.pm line 1014.

I'm grepping syslog to see if I can get more information but it takes a while to grep a 1.2GB file, so I'll have to update this bug, again, later.

I have seen this with OpenSRF and Evergreen 3.0 running on Ubuntu 16.04.

NOTE: I am not 100% certain that I am seeing a timeout, yet. That's what I'm searching for in the syslog file, but this looked like the most likely bug to glom onto.

tags: removed: needsdiscussion
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I am still waiting on the grep to finish, but I can add that the messages from /openils/var/log/open-ils.storage_stderr.log appear to be red herrings. These appear even after a "successful" run.

The code on the line referenced in the error, line 1014 of OpenILS/Application/Storage/Publisher/action.pm, is attempting to find bookings when it fails to find circulations. We don't use the booking module, so we do not run the booking service. However, it does point to a bug in the fine generator code, which I will report separately.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I finally got the 312,264 lines for the hour of 17:00 out of syslog so I could look at them. (It helps not to typo your command lines or forget the file argument to grep.)

I see no signs of a timeout. I see no signs of a problem at all, to be honest. Everything looks normal right up until the Evergreen log messages stop at 17:15:56. No, "NOT CONNECTED TO THE NETWORK" errors, nothing.

So, it looks like I need to open a new bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.