memory leak when doing fleshed Fieldmapper searches from C DB services

Bug #1974195 reported by Galen Charlton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Evergreen
Fix Committed
High
Unassigned
3.7
Fix Committed
High
Unassigned
3.8
Fix Committed
High
Unassigned

Bug Description

When open-ils.cstore, open-ils.pcrud, and open-ils.rstore do a fleshed Fieldmapper-based search of an IDL class, there is a point where it serializes the fleshed response for the sake of a debug-level log entry. However, the resulting JSON string gets leaked whether or not the log entry is actually sent.

This leak has been observed to significantly waste memory in PCRUD drones that are either long-lived or were tasked with processing a query that includes fleshing and returns a big response.

Evergreen 3.7+

Galen Charlton (gmc)
Changed in opensrf:
importance: Undecided → High
Revision history for this message
Galen Charlton (gmc) wrote :

A patch is available in user/gmcharlt/lp1974195_fix_flesh_fm_search_leak / https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/gmcharlt/lp1974195_fix_flesh_fm_search_leak

The patch simply removes the debug log entry in question, since under normal circumstances the fleshed response is immediately returned anyway. However, if folks feel strongly that it needs to stay in, somethi8ng like this would fix the leak:

char* debug_str = jsonObjectToJSON( cur );
osrfLogDebug( OSRF_LOG_MARK, "%s", debug_str );
free( debug_str );

Marking this as high priority since this has been observed to really chew through memory.

tags: added: leaks pullrequest
Galen Charlton (gmc)
affects: opensrf → evergreen
description: updated
Revision history for this message
Evergreen Bug Maintenance (bugmaster) wrote :

For testers, an example of a request that can exercise the memory leak is

open-ils.pcrdu open-ils.pcrud.search.bre "AUTHTOKEN" {"id":{">=":-1}},{"flesh":2,"flesh_fields":{"bre":["call_numbers"],"acn":["copies"]}}

This should NOT be run in a production database as is, but it's effective for testing of Concerto-sized datasets.

Changed in evergreen:
assignee: nobody → Jason Stephenson (jstephenson)
Changed in evergreen:
status: New → Confirmed
Revision history for this message
Jason Stephenson (jstephenson) wrote :

Confirmed on master and 3.7.

Changed in evergreen:
milestone: none → 3.9.1
Revision history for this message
Jason Stephenson (jstephenson) wrote (last edit ):

I found a variation of the query from comment #2 useful for testing on production:

request open-ils.pcrud open-ils.pcrud.search.bre "AUTHTOKEN" {"id":{"<":1000},"deleted":"f"}, {"flesh":2, "flesh_fields":{"bre":["call_numbers"], "acn":["copies"]}}

Depending on how many bib records you have with low id numbers, it should return enough records to demonstrate the bug. You might want to play with the id number, etc.

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I tested the branch on this bug using the methodology described in https://bugs.launchpad.net/opensrf/+bug/1974193/comments/3.

I want to add that I only tested this patch with the patch from the other bug also applied, though they can be test separately.

With both patches applied, I saw even better results.

Doing the srfsh script to request a single bib record with fleshed call numbers did not increase the drone size in any appreciable way.

The query from comment #2 on this bug only increased the drone size by a few hundred bytes at a time, suggesting that there is still a small memory leak somewhere. However, I think this patch makes a significant improvement, so I have pushed a signoff branch to user/dyrcona/lp1974195_fix_flesh_fm_search_leak-signoff:

https://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dyrcona/lp1974195_fix_flesh_fm_search_leak-signoff

NOTE: My signoff has been cherry-picked on top of current master as that is how I tested it.

tags: added: signedoff
Changed in evergreen:
assignee: Jason Stephenson (jstephenson) → nobody
Changed in evergreen:
milestone: 3.9.1 → none
Changed in evergreen:
milestone: none → 3.9.1
Revision history for this message
Jason Stephenson (jstephenson) wrote :

I've been running this on the CWMARS training server as well as my main development/test VM for the past week with no issues.

I plan to push this on Tuesday (5/31) unless someone objects or beats me to it.

I had hoped to get more eyes on it, but I'm confident this works and saves a lot of RAM,

Revision history for this message
Jason Stephenson (jstephenson) wrote :

I pushed the change to master, rel_3_9, rel_3_8, and rel_3_7.

I think this patch would warrant a 3.7.4 release as a pseudo-security update. This bug could lead to crashing of services, though it seems to be rare in practice.

Thanks, Galen!

Changed in evergreen:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers