OpenSRF

Option to force-recycle drones after long-running sessions

Bug #1706147 reported by Bill Erickson on 2017-07-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenSRF	Fix Released	Wishlist	Unassigned	OpenSRF 3.1-beta

Bug Description

An OpenSRF drone process exits and dies after it serves its configured maximum number of requests. There are times, however, when it would make sense to let a drone die sooner to release resources (e.g. memory), particularly when a single API call or session runs for an extended period of time and does a lot of work. In such cases, the amount of work performed may be comparable to thousands of quick API calls.

Examples from Evergreen land include the hold targeter, potentially the fine generator (bug #1705728), and likely other batch processes.

One option is to allow the API code to specify that a call is long-running and that its processing drone should be recycled once the session completes. I created a simple (Perl) proof of concept branch that does this that I will post for feedback.

Tags:

Revision history for this message

Bill Erickson (berick) wrote on 2017-07-24:

As promised:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1706147-session-force-recycle-perl

From the commit:

Creates an API-level option to inform the OpenSRF drone management code that the running drone should be recycled upon completion of the current OpenSRF session. This allows for quicker release of resources consumed by the drone.

To use:

sub some_api_method {
    my ($self, $client, ...) = @_;
    $self->session->force_recycle(1);
    ...
}

While I could also see supporting this flag in the API registration chunk, being able to specify it dynamically is still required for multi-purpose (batch vs non-batch) APIs.

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-07-24:

Perhaps we can use the existing session callback mechanism to do this. There is currently code specifically for "disconnect" and "death" callbacks. I wonder if simply sayingsomething like `$self->session->register_callback( disconnect => sub {die} );` would do what you want? I haven't tested, of course...

Revision history for this message

Bill Erickson (berick) wrote on 2017-07-24:

Ah, forgot about the callbacks. Two immediate questions are whether a 'disconnect' callback would be invoked for a stateless request and whether this would result in some scary logs for unexpected child deaths.

Revision history for this message

Bill Erickson (berick) wrote on 2017-09-06:

Hi Mike, I confirmed 'disconnect' and 'death' handlers allow the back-end to die when the session completes. (Using $client->session->register_callback()). This is very useful info.

I have concerns with this approach, though, for day to day use. For starters, it creates ugly error logs. More importantly, it requires the forking server process to clean up the unceremoniously dead child process via signal handler instead of the normal child maintenance pathways. I worry this may open us up to a new class of bugs like bug #1697029. It also means any registered 'child_exit' handlers will not be called and the jabber socket will be ungracefully chopped off.

I still prefer an early-exit / force-recycle mechanism managed by the forking server for daily use.

Revision history for this message

Galen Charlton (gmc) wrote on 2017-09-11:

FWIW, I also lean towards the force-recycle mechanism that Bill originally proposed in the interests of ensuring that the drones are more cleanly shut down.

Revision history for this message

Mike Rylander (mrylander) wrote on 2017-10-04:

The code presented isn't invasive or scary, does what it says on the tin, and definitely is nicer than sub{die}. ;)

Revision history for this message

Bill Erickson (berick) wrote on 2017-10-05:

Thanks Mike, thanks Galen. Slapping a pullrequest on this. Note the milestone may need to change.

To test:

1. Add $self->session->force_recycle(1); to an API handler.
2. Restart services and confirm which PIDs are active for the affected drone type.
3. Send a request to the modified API.
4. Confirm the one of the drone PIDs is no more and has been replaced by a new one.

Alternatively, raise logging to INTERNAL before restarting and look for a new message "server: child exiting early on force_recycle" after making the API call.

tags:	added: pullrequest
Changed in opensrf:
assignee:	Bill Erickson (berick) → nobody
milestone:	none → 3.0.1

Evergreen Bug Maintenance (bugmaster) on 2018-05-29

Changed in opensrf:
milestone:	3.0.1 → 3.0.2

Galen Charlton (gmc) on 2018-11-05

Changed in opensrf:
milestone:	3.0.2 → 3.1-beta

Galen Charlton (gmc) on 2018-11-05

Changed in opensrf:
status:	New → Confirmed
assignee:	nobody → Galen Charlton (gmc)

Revision history for this message

Galen Charlton (gmc) wrote on 2018-11-05:

Pushed to master for inclusion in the 3.1 beta. Thanks, Bill!

Changed in opensrf:
assignee:	Galen Charlton (gmc) → nobody
status:	Confirmed → Fix Committed

Galen Charlton (gmc) on 2019-01-10

Changed in opensrf:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.