Fine Generator timeout is different in single/parallel modes

Bug #1206629 reported by Jason Boyer on 2013-07-30
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Evergreen
Medium
Unassigned

Bug Description

Eg 2.2+

/openils/bin/fine_generator.pl

When the fine generator is run with parallel=1 in opensrf.xml, there is no timeout specified on line 51:
if ($parallel == 1) {

    my $r = OpenSRF::AppSession
            ->create( 'open-ils.storage' )
            ->request( 'open-ils.storage.action.circulation.overdue.generate_fines' );

------> while (!$r->complete) { $r->recv };

} else {

But, when using 2 or more generators at the same time, a 10 minute timeout is specified on line 63:

} else {

    my $multi_generator = OpenSRF::MultiSession->new(
        app => 'open-ils.storage',
        cap => $parallel,
        api_level => 1,
    );

    my $storage = OpenSRF::AppSession->create("open-ils.storage");
    my $r = $storage->request('open-ils.storage.action.circulation.overdue.id_list');
------> while (my $resp = $r->recv(timeout => 600)) {
        my $circ_id = $resp->content;
        $multi_generator->request( 'open-ils.storage.action.circulation.overdue.generate_fines', $circ_id );
    }

This causes parallel fine generation to fail if the database is too slow to respond. I realize the timeout could be increased in that case, but shouldn't both paths behave the same?

Ben Shum (bshum) wrote :

This sounds like a good question.

Changed in evergreen:
status: New → Triaged
importance: Undecided → Medium
Elaine Hardy (ehardy) on 2019-03-07
tags: added: circulation needsdiscussion
Changed in evergreen:
status: Triaged → Confirmed
Jason Stephenson (jstephenson) wrote :

It also seems to fail in a very interesting manner. In several cases, I have seen the storage listener die, with drones still running.

osrf_control --diagnostic says: ERR open-ils.storage Has PID file entry [1625], which matches no running open-ils.storage processes

And pgrep -af storage shows 7 drones running:
21046 OpenSRF Drone [open-ils.storage]
21052 OpenSRF Drone [open-ils.storage]
21059 OpenSRF Drone [open-ils.storage]
21063 OpenSRF Drone [open-ils.storage]
21064 OpenSRF Drone [open-ils.storage]
21070 OpenSRF Drone [open-ils.storage]
21077 OpenSRF Drone [open-ils.storage]

The fine generator itself is still running. pgrep -af fine reports:
21026 /usr/bin/perl /openils/bin/fine_generator.pl /openils/conf/opensrf_core.xml

The lock file is still in place.

My latest occurrence of this happened today after the 5:00 pm run. When the 6:00 pm run started, it reported that the fine generator seemed to already be running, so I had a look.

/openils/var/log/open-ils.storage_stderr.log has 15,193 of the following message, with the file timestamped at Mar 12 17:13:

Caught error from 'run' method: Can't call method "search_where" on an undefined value at /usr/local/share/perl/5.26.1/OpenILS/Application/Storage/Publisher/action.pm line 1014.

I'm grepping syslog to see if I can get more information but it takes a while to grep a 1.2GB file, so I'll have to update this bug, again, later.

I have seen this with OpenSRF and Evergreen 3.0 running on Ubuntu 16.04.

NOTE: I am not 100% certain that I am seeing a timeout, yet. That's what I'm searching for in the syslog file, but this looked like the most likely bug to glom onto.

tags: removed: needsdiscussion
Jason Stephenson (jstephenson) wrote :

I am still waiting on the grep to finish, but I can add that the messages from /openils/var/log/open-ils.storage_stderr.log appear to be red herrings. These appear even after a "successful" run.

The code on the line referenced in the error, line 1014 of OpenILS/Application/Storage/Publisher/action.pm, is attempting to find bookings when it fails to find circulations. We don't use the booking module, so we do not run the booking service. However, it does point to a bug in the fine generator code, which I will report separately.

Jason Stephenson (jstephenson) wrote :

I finally got the 312,264 lines for the hour of 17:00 out of syslog so I could look at them. (It helps not to typo your command lines or forget the file argument to grep.)

I see no signs of a timeout. I see no signs of a problem at all, to be honest. Everything looks normal right up until the Evergreen log messages stop at 17:15:56. No, "NOT CONNECTED TO THE NETWORK" errors, nothing.

So, it looks like I need to open a new bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers