Option to force-recycle drones after long-running sessions

Bug #1706147 reported by Bill Erickson on 2017-07-24
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

An OpenSRF drone process exits and dies after it serves its configured maximum number of requests. There are times, however, when it would make sense to let a drone die sooner to release resources (e.g. memory), particularly when a single API call or session runs for an extended period of time and does a lot of work. In such cases, the amount of work performed may be comparable to thousands of quick API calls.

Examples from Evergreen land include the hold targeter, potentially the fine generator (bug #1705728), and likely other batch processes.

One option is to allow the API code to specify that a call is long-running and that its processing drone should be recycled once the session completes. I created a simple (Perl) proof of concept branch that does this that I will post for feedback.

Bill Erickson (berick) wrote :

As promised:


From the commit:

Creates an API-level option to inform the OpenSRF drone management code that the running drone should be recycled upon completion of the current OpenSRF session. This allows for quicker release of resources consumed by the drone.

To use:

sub some_api_method {
    my ($self, $client, ...) = @_;

While I could also see supporting this flag in the API registration chunk, being able to specify it dynamically is still required for multi-purpose (batch vs non-batch) APIs.

Mike Rylander (mrylander) wrote :

Perhaps we can use the existing session callback mechanism to do this. There is currently code specifically for "disconnect" and "death" callbacks. I wonder if simply sayingsomething like `$self->session->register_callback( disconnect => sub {die} );` would do what you want? I haven't tested, of course...

Bill Erickson (berick) wrote :

Ah, forgot about the callbacks. Two immediate questions are whether a 'disconnect' callback would be invoked for a stateless request and whether this would result in some scary logs for unexpected child deaths.

Bill Erickson (berick) wrote :

Hi Mike, I confirmed 'disconnect' and 'death' handlers allow the back-end to die when the session completes. (Using $client->session->register_callback()). This is very useful info.

I have concerns with this approach, though, for day to day use. For starters, it creates ugly error logs. More importantly, it requires the forking server process to clean up the unceremoniously dead child process via signal handler instead of the normal child maintenance pathways. I worry this may open us up to a new class of bugs like bug #1697029. It also means any registered 'child_exit' handlers will not be called and the jabber socket will be ungracefully chopped off.

I still prefer an early-exit / force-recycle mechanism managed by the forking server for daily use.

Galen Charlton (gmc) wrote :

FWIW, I also lean towards the force-recycle mechanism that Bill originally proposed in the interests of ensuring that the drones are more cleanly shut down.

Mike Rylander (mrylander) wrote :

The code presented isn't invasive or scary, does what it says on the tin, and definitely is nicer than sub{die}. ;)


Bill Erickson (berick) wrote :

Thanks Mike, thanks Galen. Slapping a pullrequest on this. Note the milestone may need to change.

To test:

1. Add $self->session->force_recycle(1); to an API handler.
2. Restart services and confirm which PIDs are active for the affected drone type.
3. Send a request to the modified API.
4. Confirm the one of the drone PIDs is no more and has been replaced by a new one.

Alternatively, raise logging to INTERNAL before restarting and look for a new message "server: child exiting early on force_recycle" after making the API call.

tags: added: pullrequest
Changed in opensrf:
assignee: Bill Erickson (berick) → nobody
milestone: none → 3.0.1
Changed in opensrf:
milestone: 3.0.1 → 3.0.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers