Service Listeners Crash when Using an Undefined Value for Syscalls

Bug #1697029 reported by Jason Boyer on 2017-06-09
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenSRF
High
Unassigned

Bug Description

OpenSRF 2.5
Evergreen 2.12

Since the upgrade to OSRF2.5 and Eg2.12 I and at least one other system admin have noticed open-ils.search (primarily, though sometimes .auth) service listeners dying with this error message: server: died with error Can't use an undefined value as a symbol reference at /usr/local/share/perl/5.18.2/OpenSRF/Server.pm line 307.

Line 307 is pretty basic: syswrite($child->{pipe_to_child}, $write_size . $xml); Current assumptions are that somehow an undefined value is making it into the idle child list, or that the call to spawn_child on line 185 is returning undef somehow.

There isn't anything unusual in the logs prior. There isn't much additional detail to supply currently, this bug is primarily intended to raise awareness and request more eyes on the issue; additional logging will be added to Server.pm locally in hopes of finding out what's really going on. (This is made more difficult by the fact that it only occurs in production, though not necessarily under high load.)

> admin have noticed open-ils.search (primarily, though sometimes .auth)

Could you clarify what you mean about open-ils.auth? If it's crashing,
it's crashing with a different error message given that it's written
in C, not Perl.

Bill Erickson (berick) wrote :

Here's a branch that checks for various undef scenarios and logs errors (for Perl anyway). The goal was to avoid adding new logs for normal/success scenarios, so it's a little more complex than I had initially imagined. It won't tell us everything, but should provide some useful info. It has the secondary benefit of exiting early from write_child() if needed to avoid listener crashes.

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1697029-server-syswrite-undef

Jason Boyer (jboyer) wrote :

RE: the auth service: that appears to be unrelated. I've seen instances where for some reason it's just not started as part of a --start-all, not that it crashes. Haven't been able to track that one down yet.

Jeff Davis (jdavis-sitka) wrote :

I added some logging to Server.pm after we started seeing this issue (see attached patch).

Last time we had this problem, this extra logging showed the following:

open-ils.search: [ERR :2738:Server.pm:329:] RT46742: child has been reaped!
open-ils.search: [ERR :2738:System.pm:122:] server: died with error Can't use an undefined value as a symbol reference at /usr/local/share/perl/5.18.2/OpenSRF/Server.pm line 331.

So, if I understand correctly, the child process is getting reaped sometime between the start of write_child() and the actual attempt to syswrite to $child->{pipe_to_child}.

I'll apply berick's patch (thanks, Bill!) and report on any additional incidents.

Bill Erickson (berick) wrote :

Thanks, Jeff! Any other ERR/WARN logs in that time frame for PID 2738?

Next obvious question is why the child died. Also, was it previously idle or newly spawned? One theory (from IRC) is the child was newly spawned, then failed to connect to jabber, leading to an asynchronous death.

My logging patch should prevent the service from crashing, but answering those questions will probably require new logging and/or log diving.

Jeff Davis (jdavis-sitka) wrote :

No other ERR or WARN logs for PID 2738 -- it was running smoothly and continuously for at least 30 minutes prior to the error. This is the last log entry before the error:

open-ils.search: [INFO:2738:Server.pm:309:] RT46742: writing to child 29658

Looking at that child PID in the logs, I see this interesting sequence:

open-ils.search: [INFO:2738:Server.pm:309:] RT46742: writing to child 29658
open-ils.search: [INFO:29658:Application.pm:159:149762023328860628] CALL: open-ils.search open-ils.search.facet_cache.retrieve open-ils.search_5ccd5bc2eb51c170cb61d443575c39b0_facets
open-ils.search: [WARN:29658:Application.pm:624:149762023328860628] open-ils.search.facet_cache.retrieve: No such signal: SIGALARM at /usr/local/share/perl/5.18.2/OpenILS/Application/Search/Biblio.pm line 1411.
open-ils.search: [INFO:29658:Transport.pm:163:149762023328860628] Message processing duration: 0.035
open-ils.search: [INFO:2738:Server.pm:309:] RT46742: writing to child 29658

The "child has been reaped!" error message immediately follows these messages. So perhaps the "No such signal: SIGALARM" error points to the underlying cause? I haven't applied Jason Boyer's fix for bug 1689576 on our system yet - maybe I should try that?

Galen Charlton (gmc) wrote :

This suggests that an uncaught SIGALRM would result in an unexpected reaping of the child process: http://www.perlmonks.org/?node_id=692349, so I suspect that applying the bug 1689576 patch would be productive.

Jason Stephenson (jstephenson) wrote :

I have not dug into this at all, but the proper terminology is SIGALRM (no second "A"). It's a fairly common typo when dealing with signals. I suggest fixing that and then trying Jason Boyer's patch.

Galen Charlton (gmc) wrote :

And if Jason Boyer's patch does indeed solve the immediate issue, then that suggests the following bit of defensive coding:

diff --git a/src/perl/lib/OpenSRF/Server.pm b/src/perl/lib/OpenSRF/Server.pm
index dcf44fe..5170dd8 100644
--- a/src/perl/lib/OpenSRF/Server.pm
+++ b/src/perl/lib/OpenSRF/Server.pm
@@ -143,6 +143,7 @@ sub run {

     $SIG{$_} = sub { $self->cleanup; } for (qw/INT QUIT/);
     $SIG{TERM} = sub { $self->cleanup(0, 1); };
+ $SIG{ALRM} = sub { $self->cleanup(0, 1); };
     $SIG{CHLD} = sub { $self->reap_children(); };
     $SIG{HUP} = sub { $self->handle_sighup(); };
     $SIG{USR1} = sub { $self->unregister_routers; };

Jason Boyer (jboyer) wrote :

I apologize for the lack of updates. I have some additional information on this bug that I hope will be helpful, though.

First re: that SIGALRM patch; I've been running that since I reported it since the log noise was frustrating and the search service was still crashing from time to time. (That's not to say that Galen's suggestion from comment 9 won't help address that.)

Ben Shum and I were talking in IRC and he reminded me of 2 old links about ejabberd: one about excessive file descriptor usage: https://metajack.im/2008/09/23/file-descriptors-are-yummy-or-common-pitfalls-of-ejabberd/ and removing the message queue / rate limiting features: http://amilascrapbook.blogspot.com/2013/06/configure-ejabber-for-higher-data-rates.html (this link is a busted Blogger site, the actual story is at the very bottom of the page.)

Since doing both of these things I haven't seen the search service crash again. (We didn't have much need to do these steps prior to the chunking changes in OSRF 2.5, but now that they've been made everything is rather smooth.)

Jason Boyer (jboyer) wrote :

None of that is to say, though, that there aren't changes needed in Server.pm to make these types of surprises less likely. Galen's ALRM handler and Bill's syscall protection patches may go a long way to making these kinds of surprises a thing of the past even on a non-optimally configured system. (I have a note down to report both of these links as missing docs bugs because there's no reason to not make them part of standard Evergreen installation.)

> I apologize for the lack of updates. I have some additional information
> on this bug that I hope will be helpful, though.
>
> First re: that SIGALRM patch; I've been running that since I reported it
> since the log noise was frustrating and the search service was still
> crashing from time to time. (That's not to say that Galen's suggestion
> from comment 9 won't help address that.)

I think it is especially important with this one to be clear whether
we're dealing with one bug here, or multiple ones. When you say
open-ils.search was continuing to crash occasionally after you applied
the SIGALRM patch, was it specifically with the "Can't use an
undefined value as a symbol reference" error, or something else?

Jason Boyer (jboyer) wrote :

Yes, it was still the same undefined symbol reference crash, not an additional issue.

Martha Driscoll (mjdriscoll) wrote :

I have been seeing this problem as well since upgrading to Opensrf 2.5.0 and Evergreen 2.12.2 last week. This morning one of my bricks was not returning search results. The log file showed these errors:

2017-06-22 05:24:29 brick2 open-ils.search: [ERR :1821:System.pm:122:] server: d
ied with error Can't use an undefined value as a symbol reference at /usr/local/
share/perl/5.20.2/OpenSRF/Server.pm line 307.

2017-06-22 05:24:36 brick2 gateway: [ERR :24153:EX.pm:66:14980738542415364] Exce
ption: OpenSRF::EX::Session 2017-06-22T05:24:36 OpenSRF::Transport /usr/local/sh
are/perl/5.20.2/OpenSRF/Transport.pm:83 Session Error: router@brick2-private/ope
n-ils.search IS NOT CONNECTED TO THE NETWORK!!!

Regarding the patch to fix the spelling of SIGALRM, that is already part of 2.12.2 and my system has that. I have also noticed a lot (50 or so) messages in the error logs per day for the various bricks with the NOT CONNECTED TO THE NETWORK message. The bricks are otherwise still online.

For what it's worth, I ran debian Jessie updates just prior to the upgrade. There were a number of erlang updates: erlang-asn1 erlang-base erlang-crypto erlang-eunit erlang-inets erlang-mnesia erlang-odbc erlang-public-key erlang-runtime-tools erlang-ssl erlang-syntax-tools erlang-tools erlang-webtool erlang-xmerl.

Bill Erickson (berick) wrote :

Here's a cleaned up version of my previous patch. I boiled it down to the essential proof-of-life test just prior to calling syswrite:

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/user/berick/lp1697029-avoid-dead-child-write

Galen Charlton (gmc) on 2017-09-06
tags: added: pullrequest
Changed in opensrf:
milestone: none → 2.5.1
Galen Charlton (gmc) wrote :

I've gone ahead and pushed this to master for inclusion in 2.5.1. Thanks, Bill!

Changed in opensrf:
status: New → Fix Committed
Changed in opensrf:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers