Perl service drone can fail to read entire message from listener

Bug #883155 reported by Galen Charlton
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenSRF
Fix Released
Medium
Unassigned

Bug Description

When passing a large message from a listener process to a child drone via the Unix-domain socket that is established for that purpose, the child can occasionally fail to read the entire message due to unhandled SIGPIPE signals.

This bug has been observed to manifest itself on VMWare guests, but could occur sporadically on any platform. In the context of Evergreen, a common consequence of this bug is failures saving large MARC records from the staff client.

A patch to fix this is available in working/OpenSRF.git, branch working/user/berick/sysread-sigpipe-protection

Tags: pullrequest
Revision history for this message
Galen Charlton (gmc) wrote :

Fix committed to master and rel_2_0

Changed in opensrf:
milestone: none → 2.0.2
importance: Undecided → Medium
tags: added: pullrequest
Changed in opensrf:
assignee: nobody → Galen Charlton (gmc)
status: New → In Progress
status: In Progress → Fix Committed
Revision history for this message
Galen Charlton (gmc) wrote :

Opening back up - the patch, while harmless, didn't actually solve the problem (and also, looks like SIGPIPE is raised only when writing, not when reading). There appears to be a timing issue -- on the Evergreen system under examination, the patch plus attaching strace to the open-ils.cat drone alters the timing enough to avoid the problem.

Changed in opensrf:
status: Fix Committed → In Progress
Revision history for this message
Galen Charlton (gmc) wrote :

Another fix now available in working/OpenSRF.git, branch collab/berick/perl-server-read-write-ipc-lock

Revision history for this message
Bill Erickson (berick) wrote :

I added a follow-up patch to collab/berick/perl-server-read-write-ipc-lock to address an issue seen by Dan Scott, where in some cases IPC::ShareLite->new will return undef. The patch adds some logging to better detect the situation and it also avoids killing the process when this happens, opting instead to carry forth without the lock (until we determine the source of the undef).

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=commitdiff;h=3066c9734713ae64350e991a8028cef274658358

Revision history for this message
Bill Erickson (berick) wrote :

For reference, the error seen looked like this in the opensrf / sylog logs.

 [ERR :19849:System.pm:108:] server: died with error Can't call method "lock" on an undefined value at /usr/local/share/perl/5.10.0/OpenSRF/Server.pm line 246"

Revision history for this message
Bill Erickson (berick) wrote :

The problem mentioned above seems to have something to do with the latest changes leaving shared memory segments on the server. With simple tests, the shared memory is cleaned up, but when run in opensrf, they tend to stick around (marked as destroyed for future cleanup) until the process goes away. (not sure yet if parent or child or both processes have to exit). I'll research some alternative locking approaches. Options include semaphores and flock. More to follow.

Revision history for this message
Bill Erickson (berick) wrote :

I've pushed an alternate solution to the problem using flock() instead of shared memory segments to remove the possibility of leaking resources. It creates the new lock file (per Perl process) which resides in the configured pid dir. Note, this patch makes a necessary change to opensrf-perl.pl (which is easy to miss if your manually patching your system instead of installing from the branch).

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/collab/berick/perl-server-read-write-flock

Needs testing in an environment that suffered from the original issue...

Revision history for this message
Bill Erickson (berick) wrote :

In the comment above, that should read "It creates a new lock file per Perl /service/" not "Perl process".

Revision history for this message
Bill Erickson (berick) wrote :

I've had a chance to test collab/berick/perl-server-read-write-flock on a server that suffered from the race condition and can confirm the patch solves the original problem. Will continue to solicit testers.

Revision history for this message
Bill Erickson (berick) wrote :

We have confirmation that this same problem occurs and is resolved by collab/berick/perl-server-read-write-flock in a Xen image. Requesting additional testing and merging for opensrf 2.0.2.

Revision history for this message
Dan Scott (denials) wrote :

Pushed to master. Thanks Bill!

Changed in opensrf:
status: In Progress → Fix Committed
assignee: Galen Charlton (gmc) → nobody
Dan Scott (denials)
Changed in opensrf:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.