OpenSRF

Perl service drone can fail to read entire message from listener

Bug #883155 reported by Galen Charlton on 2011-10-28

12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenSRF	Fix Released	Medium	Unassigned	OpenSRF 2.0.2

Bug Description

When passing a large message from a listener process to a child drone via the Unix-domain socket that is established for that purpose, the child can occasionally fail to read the entire message due to unhandled SIGPIPE signals.

This bug has been observed to manifest itself on VMWare guests, but could occur sporadically on any platform. In the context of Evergreen, a common consequence of this bug is failures saving large MARC records from the staff client.

A patch to fix this is available in working/OpenSRF.git, branch working/user/berick/sysread-sigpipe-protection

Tags:

Revision history for this message

Galen Charlton (gmc) wrote on 2011-10-28:

#1

Fix committed to master and rel_2_0

Changed in opensrf:
milestone:	none → 2.0.2
importance:	Undecided → Medium
tags:	added: pullrequest
Changed in opensrf:
assignee:	nobody → Galen Charlton (gmc)
status:	New → In Progress
status:	In Progress → Fix Committed

Revision history for this message

Galen Charlton (gmc) wrote on 2011-10-28:

#2

Opening back up - the patch, while harmless, didn't actually solve the problem (and also, looks like SIGPIPE is raised only when writing, not when reading). There appears to be a timing issue -- on the Evergreen system under examination, the patch plus attaching strace to the open-ils.cat drone alters the timing enough to avoid the problem.

Changed in opensrf:
status:	Fix Committed → In Progress

Revision history for this message

Galen Charlton (gmc) wrote on 2011-10-28:

#3

Another fix now available in working/OpenSRF.git, branch collab/berick/perl-server-read-write-ipc-lock

Revision history for this message

Bill Erickson (berick) wrote on 2011-11-04:

#4

I added a follow-up patch to collab/berick/perl-server-read-write-ipc-lock to address an issue seen by Dan Scott, where in some cases IPC::ShareLite->new will return undef. The patch adds some logging to better detect the situation and it also avoids killing the process when this happens, opting instead to carry forth without the lock (until we determine the source of the undef).

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=commitdiff;h=3066c9734713ae64350e991a8028cef274658358

Revision history for this message

Bill Erickson (berick) wrote on 2011-11-04:

#5

For reference, the error seen looked like this in the opensrf / sylog logs.

[ERR :19849:System.pm:108:] server: died with error Can't call method "lock" on an undefined value at /usr/local/share/perl/5.10.0/OpenSRF/Server.pm line 246"

Revision history for this message

Bill Erickson (berick) wrote on 2011-11-07:

#6

The problem mentioned above seems to have something to do with the latest changes leaving shared memory segments on the server. With simple tests, the shared memory is cleaned up, but when run in opensrf, they tend to stick around (marked as destroyed for future cleanup) until the process goes away. (not sure yet if parent or child or both processes have to exit). I'll research some alternative locking approaches. Options include semaphores and flock. More to follow.

Revision history for this message

Bill Erickson (berick) wrote on 2011-11-08:

#7

I've pushed an alternate solution to the problem using flock() instead of shared memory segments to remove the possibility of leaking resources. It creates the new lock file (per Perl process) which resides in the configured pid dir. Note, this patch makes a necessary change to opensrf-perl.pl (which is easy to miss if your manually patching your system instead of installing from the branch).

http://git.evergreen-ils.org/?p=working/OpenSRF.git;a=shortlog;h=refs/heads/collab/berick/perl-server-read-write-flock

Needs testing in an environment that suffered from the original issue...

Revision history for this message

Bill Erickson (berick) wrote on 2011-11-08:

#8

In the comment above, that should read "It creates a new lock file per Perl /service/" not "Perl process".

Revision history for this message

Bill Erickson (berick) wrote on 2011-11-09:

#9

I've had a chance to test collab/berick/perl-server-read-write-flock on a server that suffered from the race condition and can confirm the patch solves the original problem. Will continue to solicit testers.

Revision history for this message

Bill Erickson (berick) wrote on 2012-01-03:

#10

We have confirmation that this same problem occurs and is resolved by collab/berick/perl-server-read-write-flock in a Xen image. Requesting additional testing and merging for opensrf 2.0.2.

Revision history for this message

Dan Scott (denials) wrote on 2012-01-04:

#11

Pushed to master. Thanks Bill!

Changed in opensrf:
status:	In Progress → Fix Committed
assignee:	Galen Charlton (gmc) → nobody

Dan Scott (denials) on 2012-11-15

Changed in opensrf:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.