fork() failure results in Perl service Listeners becoming Drones

Bug #1546683 reported by Jeff Godin on 2016-02-17
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

When an OpenSRF Perl service Listener process fails to fork a child process for a new Drone, the Listener incorrectly interprets the return value of fork() to mean that it is the child process.

Symptoms include: the service in question will stop responding to requests, and you'll see that what was formerly the Listener is now calling itself a Drone (pid 18205 in the following output):

opensrf 18205 0.0 0.1 222536 19612 ? Ss Feb16 0:04 OpenSRF Drone []
opensrf 19392 0.0 0.1 223284 22412 ? S Feb16 0:02 \_ OpenSRF Drone []
opensrf 19396 0.0 0.1 223024 22608 ? S Feb16 0:01 \_ OpenSRF Drone []

osrf_control --diagnostic output will include:

* ERR Has PID file entry [18205], which matches no running processes

(that ERR message is a bit misleading -- it might be more precise to say "matches no running Listener process")

This entire situation is probably quite rare, but I was able to trigger it by exhausting available memory. It can also likely be triggered by limiting the number of permitted processes (using ulimit or the like).

Jeff Godin (jgodin) wrote :

In OpenSRF::Server::spawn_child:

# from perl docs:
# [fork()] returns the child pid to the parent process, 0 to the
# child process, or "undef" if the fork is unsuccessful.
$child->{pid} = fork();

if($child->{pid}) { # parent process
} else { # child process <-- OR unsuccessful fork() !

I'm thinking that if we fail to fork() when attempting to spawn a child, we should log the error and exit -- similar to how we log "server: child process died" when the eval block indicates trouble.

I'm not yet sure if we should retry the fork() or simply give up.

If we retry the fork, there should be at least a short delay as well as a max retry or timeout -- otherwise I think we'll tie up the Listener indefinitely.

If we give up on the fork() attempt, we should probably add some additional handling to OpenSRF::Server::write_child to handle the situation where we have no child to send to.

I'd also like to test to ensure that we're returning an error to the client as quickly as we can, and not relying on an implicit timeout.

There is some existing fork() retry logic in OpenSRF::Utils::safe_fork, but I'm not sure that it's directly applicable.

Bill Erickson (berick) wrote :

Perhaps instead of treating a fork failure like an error (and reporting it to the client), the code should treat this the same as reaching max children. That is, if a fork fails, keep the request queued and wait for a child to become available.

Jason Boyer (jboyer) wrote :

Is it likely that a recovery is possible in this situation? Short of setting your process limit to a ludicrous limit just for testing, what are the chances that the machine isn't already hosed at this point? Throwing everything to the floor isn't always a bad option, provided the circumstances (potential data corruption, just falling down anyway in another X milliseconds, etc.) warrant it.

Jason Stephenson (jstephenson) wrote :

Typically, when your system is so busy/bad that it can't fork, you're about to be completely hosed anyway. I think the process should exit at that point.

If you start queuing messages, they are probably going nowhere and other parts of the system will start falling down, too.

Jeff Godin (jgodin) wrote :

Bill- I like that suggestion better than my original ideas.

Jason, Jason- In a low memory situation, subsequent fork() calls might work once the Out of memory killer has destroyed a large Apache process or something. Also, if you already have at least one active drone you'll be able to process requests in a degraded fashion, even though you can't fork the optimal number of drones.

I'll submit a working branch incorporating Bill's idea and give it some testing.

Complaining loudly in logs upon failure to fork is probably something we'd all agree with -- yes? :-)

Jason Boyer (jboyer) wrote :

I know that the OOM will eventually free up something, see my caveat about "falling down anyway in another X milliseconds." :) I'd consider any machine touched by the OOM to be untrusted until it's completely restarted; I'm not a fan of allowing it to limp along because the OOM killer took out something that seemed unnecessary at the time. Also, it's extra code (mo' code, mo' bugs) to try to deal with what will almost certainly be a fatal error very soon.

But yes, if it can get a log message off, that is a good idea. (Odds are it may trigger the OOM killer at that point, addressing my concern a certain percentage of the time, heh.)

Mike Rylander (mrylander) wrote :


Did you ever submit a branch to handle failed forks, either with a log message and/or treating that the same as max-kids?


Jeff Davis (jdavis-sitka) wrote :

I've seen this issue a few times on a test server running 2.12.

Changed in opensrf:
status: New → Confirmed
Jeff Davis (jdavis-sitka) wrote :

I'm still seeing this sometimes on a test server with 8GB RAM running OpenSRF 3.0.1 + EG 3.1.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers