Option to force-recycle drones after long-running sessions

Bug #1706147 reported by Bill Erickson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenSRF
Fix Released
Wishlist
Unassigned

Bug Description

An OpenSRF drone process exits and dies after it serves its configured maximum number of requests. There are times, however, when it would make sense to let a drone die sooner to release resources (e.g. memory), particularly when a single API call or session runs for an extended period of time and does a lot of work. In such cases, the amount of work performed may be comparable to thousands of quick API calls.

Examples from Evergreen land include the hold targeter, potentially the fine generator (bug #1705728), and likely other batch processes.

One option is to allow the API code to specify that a call is long-running and that its processing drone should be recycled once the session completes. I created a simple (Perl) proof of concept branch that does this that I will post for feedback.

Tags: pullrequest
Revision history for this message
Bill Erickson (berick) wrote :

As promised:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1706147-session-force-recycle-perl

From the commit:

Creates an API-level option to inform the OpenSRF drone management code that the running drone should be recycled upon completion of the current OpenSRF session. This allows for quicker release of resources consumed by the drone.

To use:

sub some_api_method {
    my ($self, $client, ...) = @_;
    $self->session->force_recycle(1);
    ...
}

While I could also see supporting this flag in the API registration chunk, being able to specify it dynamically is still required for multi-purpose (batch vs non-batch) APIs.

Revision history for this message
Mike Rylander (mrylander) wrote :

Perhaps we can use the existing session callback mechanism to do this. There is currently code specifically for "disconnect" and "death" callbacks. I wonder if simply sayingsomething like `$self->session->register_callback( disconnect => sub {die} );` would do what you want? I haven't tested, of course...

Revision history for this message
Bill Erickson (berick) wrote :

Ah, forgot about the callbacks. Two immediate questions are whether a 'disconnect' callback would be invoked for a stateless request and whether this would result in some scary logs for unexpected child deaths.

Revision history for this message
Bill Erickson (berick) wrote :

Hi Mike, I confirmed 'disconnect' and 'death' handlers allow the back-end to die when the session completes. (Using $client->session->register_callback()). This is very useful info.

I have concerns with this approach, though, for day to day use. For starters, it creates ugly error logs. More importantly, it requires the forking server process to clean up the unceremoniously dead child process via signal handler instead of the normal child maintenance pathways. I worry this may open us up to a new class of bugs like bug #1697029. It also means any registered 'child_exit' handlers will not be called and the jabber socket will be ungracefully chopped off.

I still prefer an early-exit / force-recycle mechanism managed by the forking server for daily use.

Revision history for this message
Galen Charlton (gmc) wrote :

FWIW, I also lean towards the force-recycle mechanism that Bill originally proposed in the interests of ensuring that the drones are more cleanly shut down.

Revision history for this message
Mike Rylander (mrylander) wrote :

The code presented isn't invasive or scary, does what it says on the tin, and definitely is nicer than sub{die}. ;)

+1

Revision history for this message
Bill Erickson (berick) wrote :

Thanks Mike, thanks Galen. Slapping a pullrequest on this. Note the milestone may need to change.

To test:

1. Add $self->session->force_recycle(1); to an API handler.
2. Restart services and confirm which PIDs are active for the affected drone type.
3. Send a request to the modified API.
4. Confirm the one of the drone PIDs is no more and has been replaced by a new one.

Alternatively, raise logging to INTERNAL before restarting and look for a new message "server: child exiting early on force_recycle" after making the API call.

tags: added: pullrequest
Changed in opensrf:
assignee: Bill Erickson (berick) → nobody
milestone: none → 3.0.1
Changed in opensrf:
milestone: 3.0.1 → 3.0.2
Galen Charlton (gmc)
Changed in opensrf:
milestone: 3.0.2 → 3.1-beta
Galen Charlton (gmc)
Changed in opensrf:
status: New → Confirmed
assignee: nobody → Galen Charlton (gmc)
Revision history for this message
Galen Charlton (gmc) wrote :

Pushed to master for inclusion in the 3.1 beta. Thanks, Bill!

Changed in opensrf:
assignee: Galen Charlton (gmc) → nobody
status: Confirmed → Fix Committed
Galen Charlton (gmc)
Changed in opensrf:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.