fork() failure results in Perl service Listeners becoming Drones
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenSRF |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
When an OpenSRF Perl service Listener process fails to fork a child process for a new Drone, the Listener incorrectly interprets the return value of fork() to mean that it is the child process.
Symptoms include: the service in question will stop responding to requests, and you'll see that what was formerly the Listener is now calling itself a Drone (pid 18205 in the following output):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
opensrf 18205 0.0 0.1 222536 19612 ? Ss Feb16 0:04 OpenSRF Drone [open-ils.search]
opensrf 19392 0.0 0.1 223284 22412 ? S Feb16 0:02 \_ OpenSRF Drone [open-ils.search]
opensrf 19396 0.0 0.1 223024 22608 ? S Feb16 0:01 \_ OpenSRF Drone [open-ils.search]
osrf_control --diagnostic output will include:
* ERR open-ils.search Has PID file entry [18205], which matches no running open-ils.search processes
(that ERR message is a bit misleading -- it might be more precise to say "matches no running open-ils.search Listener process")
This entire situation is probably quite rare, but I was able to trigger it by exhausting available memory. It can also likely be triggered by limiting the number of permitted processes (using ulimit or the like).
In OpenSRF: :Server: :spawn_ child:
# from perl docs:
# [fork()] returns the child pid to the parent process, 0 to the
# child process, or "undef" if the fork is unsuccessful.
$child->{pid} = fork();
if($child->{pid}) { # parent process
#[...]
} else { # child process <-- OR unsuccessful fork() !
#[...]
}
I'm thinking that if we fail to fork() when attempting to spawn a child, we should log the error and exit -- similar to how we log "server: child process died" when the eval block indicates trouble.
I'm not yet sure if we should retry the fork() or simply give up.
If we retry the fork, there should be at least a short delay as well as a max retry or timeout -- otherwise I think we'll tie up the Listener indefinitely.
If we give up on the fork() attempt, we should probably add some additional handling to OpenSRF: :Server: :write_ child to handle the situation where we have no child to send to.
I'd also like to test to ensure that we're returning an error to the client as quickly as we can, and not relying on an implicit timeout.
There is some existing fork() retry logic in OpenSRF: :Utils: :safe_fork, but I'm not sure that it's directly applicable.