Browse Hold Shelf crashes with "large" number of copies

Bug #701208 reported by Jason Stephenson
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Evergreen
Won't Fix
Low
Unassigned

Bug Description

Evergreen version: rel_2_0 & trunk
OpenSRF version: 1.6.2
PostgreSQL version: 8.4.6

When a user chooses Browse Hold Shelf from the Circulation menu, if the library has a moderately large number of items on the hold shelf, many crash dialogs appear on the screen repeatedly. By moderately large, I mean somewhere around 100 and up. I can reproduce this reliably with 100+ copies on the hold shelf, but it never happens for a location with 25 copies on the hold shelf.

The first and largest of the error dialogs contains information like the following:

Network or server failure. Please check your Internet connection to theory.biblio.org and choose Retry Network. If you need to enter Offline Mode, choose Ignore Errors in this and subsequent dialogs. If you believe this error is due to a bug in Evergreen and not network problems, please contact your help desk or friendly Evergreen administrators, and give them this information:
method=open-ils.circ.hold.details.retrieve.authoritative
params=["42031130d32f573b13e69f61d8f205f6",6077]
THROWN:
{"payload":[],"debug":"osrfMethodException : *** Call to [open-ils.circ.hold.details.retrieve.authoritative] failed for session [1294678238.834972.12946782381651], thread trace [1]:\nCan't use an undefined value as an ARRAY reference at /openils/lib/perl5/OpenILS/Application/Circ/Holds.pm line 1085.\n\n","status":500}
STATUS:

The smaller contain messages like these:

!! This software has encountered an error. Please tell your friendly system administrator or software developer the following:
fancy_prompt.xul
ReferenceError: js2JSON is not defined

Error adjusting the font size: ReferenceError: js2JSON is not defined

TypeError: Strings is null

Hitting the debug information eventually reveals something like:

Please open a helpdesk ticket and include the following text:

Mon Jan 10 2011 11:41:04 GMT-0500 (EST)

Error retrieving details for hold #7093

{
        "ilsevent":"2002",
        "textcode":"DATABASE_QUERY_FAILED",
        "servertime":"Mon Jan 10 11:40:32 2011",
        "stacktrace":"/openils/lib/perl5/OpenILS/Utils/CStoreEditor.pm:745 (eval 1774):1 /openils/lib/perl5/OpenILS/Application/Circ/Holds.pm:2658",
        "pid":"24672",
        "desc":"The attempt to query to the DB failed",
        "debug":"Exception: OpenSRF::EX::ERROR 2011-01-10T11:40:32 OpenILS::Utils::CStoreEditor /openils/lib/perl5/OpenILS/Utils/CStoreEditor.pm:745 System ERROR: CStore connection timed out - transaction cannot continue\n",
        "payload":
        [
                7093,

                {
                        "flesh_fields":
                        {
                                "ahr":
                                [
                                        "current_copy",
                                        "usr",
                                        "notes"
                                ]

                        }
                        ,
                        "flesh":1
                }

        ]

}

Occasionally, the crashiness is so severe that I have to terminate the staff client using the operating system utilities before I can initiate any other work with it.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

With Dan Scott's help in IRC, we were able to make this problem go away by adjusting open-ils.cstore settings in /openils/conf/opensrf.xml.

By raising the max_requests to 2000 and the max_children to 45, the problem disappeared. It seems the real problem is that we were running out of cstore processes, so the max_children is probably the key setting here.

While doing Browse Hold Shelf for our largest library, 32 cstore child processes were created. This library has 943 copies on the hold shelf as of this writing. If you are a larger consortium with several large members, you'll want to experiment with the cstore max_children setting to find one that works for you. I expect ours will go larger than 45 as multiple libraries browse the hold shelf simultaneously.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

Dan Scott suggested that this be tagged "documentation" so that the documentation team can document the need to adjust these settings.

tags: added: documentation
Revision history for this message
Justin Hopkins (hopkinsju) wrote :

I made the changes suggested by Jason, actually I did 1000/45, and the problem has gone away. It doesn't seem unreasonable to have a fair number of holds (I've got 116) - I think the client should maybe be better about how it loads them. Choking out cstore drones seems like it could be avoided. Perhaps we could do paging? Maybe we could handle timeouts by throttling the request rate rather than exploding into flaming error messages? I wonder what the cost to system resources would be by having a significantly higher number of max_children... If I'm seeing this on a server running a smallish single library then I'd guess it's actually fairly common.

Revision history for this message
Galen Charlton (gmc) wrote :

Additional notes for documentation purposes:

Each cstore (and storage, and IIRC reporter-store and qstore) backend uses a Postgres connection, so one thing to note is that the total number of database connections that Evergreen at its peak (by summing up cstore, storage, etc. max_children values) does not exceed the max_connections value you set in PostgreSQL's configuration file.

In turn, each connection consumes some shared memory and system semaphores on the database server, so if you bump up max_connections, you may need to bump up OS parameters like kernel.shmmax.

no longer affects: evergreen/2.0
Ben Shum (bshum)
no longer affects: evergreen/master
Revision history for this message
Blake GH (bmagic) wrote :

Should this be closed?

Revision history for this message
Kathy Lussier (klussier) wrote :

If the documentation has been created, we could close it. Otherwise, I would suggest leaving it open.

Revision history for this message
Andrea Neiman (aneiman) wrote :

Can this be considered "fixed in webby" / Won't Fix, or is this still an issue?

Revision history for this message
Jason Stephenson (jstephenson) wrote :

It's really more of a question than a bug and could be converted to a question. There was some sentiment for this being a documentation bug, that we should improve the documentation surrounding the OpenSRF settings for Evergreen drones.

I'm willing to go either way: this being a question or this being a documentation bug.

I'm not able to write the documentation right now owing to time constraints, and I may not be the best person to write it, either.

Changed in evergreen:
importance: Undecided → Low
Revision history for this message
Chris Sharp (chrissharp123) wrote :

Marking Won't Fix, since this is a Postgresql/OpenSRF configuration issue which is necessarily set differently on a per-site basis. Not sure documentation would help here.

Changed in evergreen:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.