Staff client memory leaks in 2.3 and later

Bug #1086458 reported by Lebbeous Fogle-Weekley on 2012-12-04
80
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Evergreen
High
Unassigned
2.3
High
Unassigned
2.4
High
Unassigned

Bug Description

Evergreen 2.3.1
Staff client, on Windows, maybe especially 32 bit XP

There's talk about problems with the staff client using up too much memory, especially since 2.3.0. To be clear, it seems to be a problem for some systems and not others.

I don't know of anybody who's sure whether there are actually memory leaks at this point, or whether xulrunner allocates memory too aggressively, or what. Efforts to figure out this problem are still in the early phase.

I'm linking to this recent email thread, but I have a vague impression the topic has come up once or twice before, at least on IRC.

http://libmail.georgialibraries.org/pipermail/open-ils-general/2012-December/007800.html

Inherent memory management problems in Mozilla code may or may not be a factor here. There has been work done to address similar complaints of holding onto memory directed at Firefox over the past 24 months or so. The issues are complex, and I don't claim to be able to evaluate whether they might apply in this situation. The 2.3.1 staff client uses Mozilla xulrunner 14.0.1. Using a newer version of xulrunner at this point may or may not help with Evergreen problems.

*** Taking a different tack, Jason Etheridge suggests asking circ staff who seem to be hitting this problem the worst this question: by chance are you keeping you entire morning check-in history in the on-screen list (not using Trim List)? ***

I don't particularly have specialized knowledge of the staff client, and I'm not personally spear-heading this. I'm just collecting information here because there's been too much talk about this problem not to have a discoverable focus point for ongoing discussion.

Ben Shum (bshum) wrote :

Marking triaged as discussion and investigation continues.

It is a problem for us, I'll add more details as we work to uncover them. One thing we're specifically testing now is to see if there's any marked improvement using a newer xulrunner version.

Changed in evergreen:
status: New → Triaged
importance: Undecided → High
Anoop Atre (anoop-atre) wrote :

Evergreen: 2.3.1
Client OS: Windows7
RAM: 4GB

One of our libraries reported that they are experiencing this issue. The information they provided suggests that it's the same memory access issue (?) nothing really to add to what we know.

Event Viewer Entry:

Date: 12/28/2012 1:58:44 PM
Event ID: 1000
Task Category: (100)
Level: Error
Keywords: Classic
User: N/A
Computer: NHA01
Description:
Faulting application name: evergreen.exe, version: 1.9.2.4079, time stamp: 0x4d6f886d
Faulting module name: mozalloc.dll, version: 14.0.1.4577, time stamp: 0x50008d68
Exception code: 0x80000003
Fault offset: 0x00001122
Faulting process id: 0x8dc
Faulting application start time: 0x01cde508fe10618a
Faulting application path: C:\Program Files (x86)\Evergreen Staff Client 2.3\evergreen.exe
Faulting module path: C:\Program Files (x86)\Evergreen Staff Client 2.3\xulrunner\mozalloc.dll
Report Id: 99f6116b-5120-11e2-9115-7845c40a702f
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Application Error" />
<EventID Qualifiers="0">1000</EventID>
<Level>2</Level>
<Task>100</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2012-12-28T18:58:44.000000000Z" />
<EventRecordID>7907</EventRecordID>
<Channel>Application</Channel>
<Computer>NHA01</Computer>
<Security />
</System>
<EventData>
<Data>evergreen.exe</Data>
<Data>1.9.2.4079</Data>
<Data>4d6f886d</Data>
<Data>mozalloc.dll</Data>
<Data>14.0.1.4577</Data>
<Data>50008d68</Data>
<Data>80000003</Data>
<Data>00001122</Data>
<Data>8dc</Data>
<Data>01cde508fe10618a</Data>
<Data>C:\Program Files (x86)\Evergreen Staff Client 2.3\evergreen.exe</Data>
<Data>C:\Program Files (x86)\Evergreen Staff Client 2.3\xulrunner\mozalloc.dll</Data>
<Data>99f6116b-5120-11e2-9115-7845c40a702f</Data>
</EventData>
</Event>

Susan Hopkins (shopkins-c) wrote :

Here is some more information regarding pc’s crashing after the 2.3.1 upgrade. The pc’s that had the following error messages had previously (2 days prior) had Evergreen completely uninstalled and made sure to delete all OpenILS folders before re-installing the 2.3.1 client. Since the library was very busy the only information I can get is the following:

                XUL Runner stopped working – memory usage of evergreen.exe at the time of the error was 1.5 GB
                “script: resource://gre/modules/consolAPLstorage.jsm:176” - memory usage of evergreen.exe at the time of the error 1.7GB

Ben Shum (bshum) wrote :

Noting for others. We have tested xulrunner 17 and had reports from libraries complaining of worsened symptoms: Evergreen locked up quicker than before, even crashed out completely for no reason. We are now moving to test xulrunner 15 as an intermediate solution.

Galen Charlton (gmc) wrote :

One of our customers will be testing with XULRunner 3.6.25 this evening and/or tomorrow.

Dan Wells (dbw2) wrote :

I can't reproduce this problem at my library, so this is little more than a shot in the dark, but is it possible a misbehaving Plugin is contributing to this problem? I was a little surprised to see that the staff client had automatically inherited all the installed Plugins from Firefox. It certainly can't hurt to try disabling them and see if it helps. To do so:

1. In the staff client menus, click 'Admin'->'For developers'->'Extension Manager'
2. On the left side of the screen, click Plugins.
3. On the right side of the screen, click 'Disable' for each Plugin.

I suppose there is an off chance you might want or need one of these Plugins enabled, but at least for testing sake I would disable them all, then restart the client and see if anything improves.

Dan

Tim Spindler (tspindler-cwmars) wrote :
  • Memory Test Data Edit (12.7 KiB, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)

Dan Pearl has done some testing for us and here are his results:

Dan performed a short script of some typical operations on the client with XULrunner 3.6.25, 12.0, and 14.0.1. The results suggest:

· 3.6.25 and 12.0 curves of memory usage closely track each other with a constant offset.
· 14.0.1 results sort of track the usage of the others, but as the tasks are done, the offset increases indicating a potential memory mis-management situation in 14.0.1.

Dan did not test 15+, as anecdotes suggest that they perform even worse than 14.0.1.

Galen Charlton (gmc) wrote :

Tim: thanks for the numbers. What platform were the tests done on (i.e, Windows or Linux)?

Windows 7

On Thu, Jan 10, 2013 at 11:23 AM, Galen Charlton <email address hidden> wrote:

> Tim: thanks for the numbers. What platform were the tests done on (i.e,
> Windows or Linux)?
>
> --
> You received this bug notification because you are subscribed to
> Evergreen.
> https://bugs.launchpad.net/bugs/1086458
>
> Title:
> Staff client memory leaks in 2.3 and later
>
> Status in Evergreen - Open ILS:
> Triaged
> Status in Evergreen 2.3 series:
> Triaged
>
> Bug description:
> Evergreen 2.3.1
> Staff client, on Windows, maybe especially 32 bit XP
>
> There's talk about problems with the staff client using up too much
> memory, especially since 2.3.0. To be clear, it seems to be a problem
> for some systems and not others.
>
> I don't know of anybody who's sure whether there are actually memory
> leaks at this point, or whether xulrunner allocates memory too
> aggressively, or what. Efforts to figure out this problem are still
> in the early phase.
>
> I'm linking to this recent email thread, but I have a vague impression
> the topic has come up once or twice before, at least on IRC.
>
> http://libmail.georgialibraries.org/pipermail/open-ils-
> general/2012-December/007800.html
>
> Inherent memory management problems in Mozilla code may or may not be
> a factor here. There has been work done to address similar complaints
> of holding onto memory directed at Firefox over the past 24 months or
> so. The issues are complex, and I don't claim to be able to evaluate
> whether they might apply in this situation. The 2.3.1 staff client
> uses Mozilla xulrunner 14.0.1. Using a newer version of xulrunner at
> this point may or may not help with Evergreen problems.
>
> *** Taking a different tack, Jason Etheridge suggests asking circ
> staff who seem to be hitting this problem the worst this question: by
> chance are you keeping you entire morning check-in history in the on-
> screen list (not using Trim List)? ***
>
> I don't particularly have specialized knowledge of the staff client,
> and I'm not personally spear-heading this. I'm just collecting
> information here because there's been too much talk about this problem
> not to have a discoverable focus point for ongoing discussion.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/evergreen/+bug/1086458/+subscriptions
>

--

Tim Spindler

Manager of Library Applications

<email address hidden>

508-755-3323 x20

C/W MARS, Inc.

67 Millbrook St, Suite 201

Worcester, MA 01606
http://www.cwmars.org

*P** Go Green - **Save a tree! Please don't print this e-mail unless it's
really necessary.*

Tim Spindler (tspindler-cwmars) wrote :

Dan Pearl's testing was done on a Windows 7 computer.

Kathy Lussier (klussier) wrote :

Hi Tim,

As far as I know, nobody has reported any results from xulrunner 15 testing yet, so it may be worthwhile to test that version as well. The only anecdotes I have heard have been from Ben's testing of xulrunner 17 that were mentioned above.

Kathy

Dan did another test with xulrunner 15 and no improvement. It was worse. See attached spreadsheet for details

Dan Scott (denials) wrote :

Tim: 250 MB (per XUL 15) doesn't seem like anything in the vicinity of the problem reports that were reported, which were 1.5 GB and up. Does the pattern persist over a longer period of time?

There's a possiblity that later XUL versions just uses available RAM to cache more aggressively, for example, and that about:config entries like browser.memory.cache.enabled and browser.memory.max_entry.size might come into play here too.

I started working on a list of info of interest here: http://www.evergreen-ils.org/dokuwiki/doku.php?id=dev:xul_memory_consumption - there's more of interest, if you can get to Admin -> For Developers -> JavaScript Console and type "open('about:support')" there's an interesting "about:memory" option there, with an even more interesting "Minimize memory usage" button at the bottom of that screen; would be nice to see if that makes any difference.

Dan, I should explain how Dan Pearl was testing. He was doing this on his development server and not in our production environment to see if one version of xulrunner was more intensive than another. I think we can pull together more details and data from our member libraries but we have a number of libraries with very active circulation desks that are having the client crash as frequently as every 60-90 minutes or the memory usage is exceeding 2 GB in that time. This, of course, is anecdotal and we can work on getting specifics.

Here is some data from one of our libraries. They had to shut down the client at each of the times indicated below because evergreen slowed to a crawl or crashed. See the memory usage at these points. As a practice, some of our libraries are restarting the client every hour or 2 hours to prevent crashes or slowdowns at inopportune times when there are long lines at the circ desk. Both of the computers below are on circ desk at a heavily used library.

OS: Windows XP
Firefox Installed

Monday, 1/7/13
11:30 - 1,069,624
2:30 - 1,691,700 -- 98 circulation transactions, 1 patron record created
4:30 - 1,062,092 -- 87 circulation transactions, 4 patron records created
5:30 - 1,009,248 -- 49 circulation transactions, 2 patron records created
8:55 - 988,256

Tuesday, 1/8/13
2:24 - 1,112,800
4:03 - 1,020,980
5:49 - 1,205,308
8:15 - 1,466,756

Wednesday, 1/9/13
1:15 - 1,768,204
3:36 - 1,140,104
5:30 - 887,132
7:45 - 868,004

OS: Win7 (64-bit)
RAM: 4 GB
Firefox Installed

Monday, 1/7/13
12:05 - 1,150,804
4:40 - 1,144,601 -- 264 circulation transactions
6:23 - 1,011,692 -- 58 circulation transactions

Tuesday, 1/8/13
11:40 - 885,116 -- 77 circulation transactions
1:40 - 1,036,020 -- 111 circulation transactions
4:00 - 1,732,620
7:00 - 1,040,308

Wednesday, 1/9/13
11:15 - 1,868,212
2:00 - 1,035,268
4:00 - 1,460,028
5:41 - 1,167,000
7:45 - 1,215,480

I had one of my staff do some testing checking out a number of transactions to see what may come up.

Logging into evergreen and watched the memory usage level out to just under 100,000k once Evergreen had loaded entirely. He followed three different procedures to try to get the memory to increase past the stable running load:

· Checking 3 items out to one patron then immediately checking them in from the items out screen. He would not refresh the patron record but simply went back and forth from the checkout to the items out screens and the result is that the memory usage only went up to 150,000k. He did this for 30 transactions.

· Next, he started exiting the patron record after each transaction: I would check the items out, hit F8 to refresh the patron and then check them in and again hit F8. Memory usage increased to 570,000k after 36 transactions.

· Finally, he brought in another patron record and 2 more items and alternately checked out the set of 3 to one patron and the 2 to another refreshing constantly and the memory usage dropped 466,000k after the final 35 transactions.

The memory usage may have dropped but it is now holding at 450,000k and showing no signs of dropping though I have closed out of all of the tabs.

Michele Morgan (mmorgan) wrote :

We've done some testing of this issue on our training server and have been able to see high memory usage in the client just from the using the Patron Search screen to search for patrons. We have seen the mem usage spike to over 1,000m in a client that has not performed a single circulation transaction.

We're testing on our training server running release 2.3.2 using the stock 2.3.2 client.

Our workstations have various characteristics:

Windows XP
3 gigs ram

Windows 7 32 bit
1 gig ram

Windows xp
1.73 gigs ram

We monitored the mem usage while doing the following:

Open the patron search screen and perform a fairly general search on last name, one that retrieves several hits.
Click on a hit to highlight it and allow the patron information to fill in.
Use the arrow keys to navigate the list of hits and let the patron info fill in on the top of the screen. Our training server is set to the horizontal patron display.

We noted that using the arrow keys to navigate seemed to cause the mem usage to rise quickly.

Retrieving a patron opens it in another tab, leaving the search tab still open.
When closing both the patron tab and the search tab back to the portal page, mem usage decreases, but never goes down to the level it was at login.

On our 1.73 gig workstation, opening five tabs in patron search and performing searches in all tabs pushed the mem usage up to about 1400m, rendering the client essentially unusable. Granted this is a highly unusual situation in a test environment, but on a busy circ desk, staff may be constantly retrieving patrons by searching for them, and may potentially have multiple tabs open.

We have also seen the mem use rise steadily on an idle workstation that has the search screen open populated with a list of hits. On this workstation we have used the up/down arrows to navigate the list of hits.

So maybe the issue is related less to performing transactions, and more to retrieving user information. Can others confirm similar climbing memory usage when searching for patrons?

I'm attaching a screencasts of our 1.73 gig workstation slowed to a crawl with 5 patron search tabs open.

Michele Morgan (mmorgan) wrote :

Another screencast of mem use climbing on an idle workstation displaying the patron search screen.

Download full text (17.0 KiB)

I am working with the C/WMARS' production server and I followed Michelle's Testing of retrieving patrons using the patron search function, closing them, and retrieving more. I have not found that my memory usage has moved much above 300,000k on my workstation. However, my workstation has not gone down below 300,000k even after several hours. Here is my report following Dan Scott's form located here: http://evergreen-ils.org/dokuwiki/doku.php?id=dev:xul_memory_consumption

I do not normally experience memory problems. Evergreen’s mem usage can run up to around 600,000k but it does not seem to slow Evergreen down.

We are running Evergreen v. 2.3.1 and we use a custom staff client that Auto-checks for upgrades.

My staff client has been running for about 2 hours now and I have been running reports and looking at patron records but this is not the workload that normal circulation desks see. Here is the rest of the information Dan's form asks for:

SYSTEM INFO

 System Info
 OS Name Microsoft Windows 7 Professional
 Version 6.1.7601 Service Pack 1 Build 7601
 Other OS Description Not Available
 OS Manufacturer Microsoft Corporation
 System Name COWBOYS-L6
 System Manufacturer Hewlett-Packard
 System Model HP Compaq dc5800 Microtower
 System Type X86-based PC
 Processor Intel(R) Pentium(R) Dual CPU E2220 @ 2.40GHz, 2400 Mhz, 2 Core(s), 2 Logical Processor(s)
 BIOS Version/Date Hewlett-Packard 786F2 v01.04, 1/31/2008
 SMBIOS Version 2.5
 Windows Directory C:\Windows
 System Directory C:\Windows\system32
 Boot Device \Device\HarddiskVolume1
 Locale United States
 Hardware Abstraction Layer Version = "6.1.7601.17514"
 User Name CWCENTRAL\ckolasinski
 Time Zone Eastern Standard Time
 Installed Physical Memory (RAM) 4.00 GB
 Total Physical Memory 3.48 GB
 Available Physical Memory 1.25 GB
 Total Virtual Memory 6.96 GB
 Available Virtual Memory 2.55 GB
 Page File Space 3.48 GB
 Page File C:\pagefile.sys

SERVICES RUNNING

 Adobe Acrobat Update Service AdobeARMservice Running Auto Own Process "c:\program files\common files\adobe\arm\1.0\armsvc.exe" Ignore LocalSystem 0
 apache2.4 apache2.4 Running Auto Own Process "c:\xampp\apache\bin\httpd.exe" -k runservice Normal LocalSystem 0
 Application Information Appinfo Running Manual Share Process c:\windows\system32\svchost.exe -k netsvcs Normal LocalSystem 0
 Background Intelligent Transfer Service BITS Running Manual Share Process c:\windows\system32\svchost.exe -k netsvcs Normal LocalSystem 0
 Base Filtering Engine BFE Running Auto Share Process c:\windows\system32\svchost.exe -k localservicenonetwork Normal NT AUTHORITY\LocalService 0
 COM+ Event System EventSystem Running Auto Share Process c:\windows\system32\svchost.exe -k localservice Normal NT AUTHORITY\LocalService 0
 Computer Browser Browser Running Manual Share Process c:\windows\system32\svchost.exe -k netsvcs Normal LocalSystem 0
 Cryptographic Services CryptSvc Running Auto Share Process c:\windows\system32\svchost.exe -k networkservice Normal NT Authority\NetworkService 0
 DCOM Server Process Launcher DcomLaunch Running Auto Share Process c:\windows\system32\svchost.exe -k dcomlaunch Normal LocalSystem 0
 Desktop Window Manager Se...

I did some other testing today and our problems seem to strongly relate to the patron screens. Each time I opened a patron record it added 20-25 MB of memory. This means for a busy circ desk that does 30 transactions an hour they will reach 1.5 GB of RAM usage in an hour using an arithmetic projection of memory usage. This is fairly consistent with what we see in our busy libraries that after about 2 hours of use at busy times they are slowing down or crashing and having to restart the client.

Ben Shum (bshum) wrote :

Received the following crash message from a library helping us test the xulrunner 15 client:

Problem signature:
  Problem Event Name: APPCRASH
  Application Name: evergreen.exe
  Application Version: 1.9.2.4079
  Application Timestamp: 4d6f886d
  Fault Module Name: mozalloc.dll
  Fault Module Version: 15.0.0.4619
  Fault Module Timestamp: 5038072d
  Exception Code: 80000003
  Exception Offset: 00001126
  OS Version: 6.1.7601.2.1.0.256.48
  Locale ID: 1033
  Additional Information 1: 0a9e
  Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
  Additional Information 3: 0a9e
  Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

The odd part of that to me is the lines Application Version: 1.9.2.4079 vs. Fault Module version: 15.0.0.4619. The 15.0.0 part I expected since we're testing xulrunner 15, but the 1.9.2 part looks like a relic from our old 1.6/2.0 days while we were still on xulrunner 1.9.2. I wonder how it's putting that information together and if there might be remnant bits from past xulrunner affecting my newer xul staff client tests. Or maybe it's just a wacky Windows thing that's totally unrelated.

Ben Shum (bshum) wrote :

And to report back...

xulrunner 15 testing so far has produced the added bonus of Evergreen crashing periodically for no apparent reason. I have to check on memory details we're collecting based on the templates Dan suggested, but so far, the crash reports have occurred more often than any reports of slowdown. So maybe instead of slowing down and dying from memory issues, it just crashes. After crashing spontaneously, starting a fresh staff client runs smoothly without having to restart the PC, seeming to indicate that the memory is immediately freed up once Evergreen crashes. So maybe it's a good sign...

Ben Shum (bshum) wrote :

We're testing dbwells' suggestion of disabling plugins that apparently were inherited from Firefox. I'm asking our libs to document which plugins they had installed and disabled. Some are going to test in parallel with machines that don't have plugins disabled to try getting a sense of any perceptible performance gain.

Another idea I encountered just now was to enable the Incremental Garbage Collector (see http://techlogon.com/2012/07/18/enable-improved-memory-management-in-firefox-14/).

In about:config, that article suggests to set "javascript.options.mem.gc_incremental" to true to turn it on for memory management for Firefox 14. Saw some other articles which seems to indicate that Firefox 16 ships with this option set to true by default and the feature has been available since Firefox 13. Not sure if enabling it has any effect for Evergreen, but we'll try testing this too.

Galen Charlton (gmc) wrote :

I have found a short sequence of actions that appears to invoke the memory leak. I've tried this so far with the stock 2.3.1 and 2.3.2 clients and a auto-updating but otherwise uncustomized 2.31 client on two different databases.

[1] Start the Evergreen client

[2] Open a new (second) tab.

[3] In the new tab, do a patron search. The specific search doesn't appear to matter.

[4] When the results come up, select a patron.

[5] Click the retrieve button. This opens the record in a new (third) tab.

[6] Go back to the results list and select a second patron.

[7] Click the retrieve button. This opens the record in a new (fourth) tab.

[8] Close the fourth tab (patron record #2)

[9] Close the third tab (patron record #1)

[10] Close the second tab (patron search).

At this point, a leak has occurred. The following steps have to do with gathering a cycle collector dump.

[11] In the first tab, bring up the JavaScript console. Wait a few seconds for XULRunner to complete garbage collection.

[12] In the Evaluate box, paste in the following and hit the Evaluate button. This will cause a cycle graph dump to be made.

window.QueryInterface(Components.interfaces.nsIInterfaceRequestor).getInterface(Components.interfaces.nsIDOMWindowUtils).garbageCollect(Components.classes["@mozilla.org/cycle-collector-logger;1"].createInstance(Components.interfaces.nsICycleCollectorListener))

[13] Retrieve the dump. On Windows, this should be found in:

C;\Users\USER\AppData\Local\Temp

and will be called something like cc-edges-N.XXX.log

That log can be parsed using tools described in https://wiki.mozilla.org/Performance:Leak_Tools#Cycle_collector_heap_dump, but the salient bits *appear* to be that it's hanging on to various patron display navigation buttons and a pile of nsIDOMEventListeners.

Galen Charlton (gmc) wrote :

I have pushed a couple patches to the collab/gmcharlt/lp1086458_clean_up_after_some_event_listeners branch in the working/Evergreen repository that in *very* preliminary testing appear to avoid the memory leaks associated with the patron search form.

They operate by making sure that two sets of event listeners are removed when the relevant form is unloaded:

- double-click handlers for patron search results
- focus handlers for the patron display form

Of the two, the first seems to be the most important.

What lead me to this was analysis of cycle collector heap dumps. After running through the process described in my previous comment, the raw dump included the following line:

10919000 [rc=3] nsDocument normal (XUL) oils://remote/xul/rel_2_3_1/server/patron/summary.xul?id=7

Since that tab had been closed and shouldn't have been hanging around, I used Andrew McCreight's heapgraph tools (https://github.com/amccreight/heapgraph/tree/master/cc) to take a look:

python heapgraph/cc/find_roots.py /home/gmc/cc-edges-2.7900.log 10919000

The attachment is the output of that. The first section appears to indicate that an event listener from the search results is hanging on to the patron summary:

1425C4C0 [nsGenericElement (XUL) treecol id='billing_aua_within_city_limits' oils://remote/xul/rel_2_3_1/server/patron/search_resu
lt.xul]
    --[[via hash] mListenerManager]-> 1425C740 [nsEventListenerManager]
    --[mListeners[i] mListener]-> 14256BC8 [nsXPCWrappedJS (nsIDOMEventListener)]
    --[mJSObj]-> 1435E4E0 [JS Object (Function)]
    --[]-> 12E52B80 [JS Object (Call)]
    --[]-> 1430EF40 [JS Object (Object)]
    --[]-> 139B21C0 [JS Object (XULElement)]
...
    --[mNodeInfo]-> 157278B0 [nsNodeInfo (XUL) grid]
    --[mOwnerManager]-> 0B7C38E0 [nsNodeInfoManager]
    --[mDocument]-> 10919000 [nsDocument normal (XUL) oils://remote/xul/rel_2_3_1/server/patron/summary.xul?id=7]

After applying the patches, this stopped happening.

The double-click handler was added in 2.2, but I speculate that something changed in XULRunner's garbage collector. In any event, if I'm on the right track, a complete fix will entail making sure that all event listeners are cleaned up with removeEventListener() when they're no longer required.

Testing of these patches would be greatly appreciated .. I know they're not elegant.

Jason Etheridge (phasefx) wrote :

Galen++

Hrmm, we should probably wrap all event listener creation in a function that does something similar to you're doing with the patron nav bar. Something like:

function addEventListener(node,type,listener,useCapture) {
  if (typeof window.oils_listeners == 'undefined') { window.oils_listeners = []; }
  try {
    node.addEventListener(type,listener,useCapture);
    window.oils_listeners.push( { 'node' : node, 'type' : type, 'listener' : listener, 'useCapture' : useCapture { );
  } catch(E) {
    alert(E);
  }
}

function my_cleanup() {
  if (typeof window.oils_listeners != 'undefined') {
    for (var i = 0; i < window.oils_listeners.length; i++) {
      window.oils_listeners[i].node.removeEventListener(
        window.oils_listeners[i].type,
        window.oils_listeners[i].listener,
        window.oils_listeners[i].useCapture
      );
      delete window.oils_listeners;
    }
  }
}

Jason Etheridge (phasefx) wrote :

With that delete command outside of the for loop :)

Jason Etheridge (phasefx) wrote :

I'm able to duplicate Galen's findings with his branch. My first attempt a shotgun approach using what I posted earlier did not help. There's still some sort of leak with the patron search sequence, but Galen's patch really improves things. I can iterate over a similar test (involving 17 patron tabs) and gain about 10-20 mb's each time after all tabs are closed and things settle down.

Jason Stephenson (jstephenson) wrote :

In case you missed it, like I did, Galen's branch is here:

collab/gmcharlt/lp1086458_clean_up_after_some_event_listeners

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/gmcharlt/lp1086458_clean_up_after_some_event_listeners

I'll take a look at this on our development and test servers if you want. If it works, it would make a nice bandaid until all of the underlying issues are addressed.

Galen Charlton (gmc) wrote :

I've started a new collab branch, collab/gmcharlt/lp1086458_comprehensive_experimentation:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/gmcharlt/lp1086458_comprehensive_experimentation

So far, this branch works off an idea by Jason and adds a class to keep track of and cleanup event listeners. It applies this to most of the patron and circ pages so far.

Other avenues I'm investigating:

- ccdump is showing that the oncommand event listeners for menuitems in util_overlay.xul don't seem to be cleaned up automatically
- event listeners associated with XBL bindings don't seem to be cleaned up automatically other.

Galen Charlton (gmc) wrote :

I've further updated the collab/gmcharlt/lp1086458_comprehensive_experimentation with additional patches to clean up two more sources of dangling references:

- util.list members hanging on to DOM elements
- references to windows created during set up of the patron search and display form

So far, testing and CC analysis seems to be showing that the patron search/display form is no longer being leaked by performing common circ operations, then closing the tab.

Galen Charlton (gmc) wrote :

At the moment, the current state of the collab branch breaks staff client built from it -- my testing involved pointing stock clients to the test system. Jason Etheridge is working on a patch to unbreak that.

Jason Etheridge (phasefx) wrote :

I pushed a commit which appears to fix things.

FYI, one of our libraries runs things on Linux and has far fewer problems. The following is from Rob Heller

        We run CentOS 5 (the GPL/Free version of RedHat Enterprise
Linux) on all but one of our machines (we have one token XP box for patrons who might need to run MS-Windows software), including *both* of the staff client machines (one on the circ desk and one in the office [for cataloging, etc.]). We run the staff client with the *native* xulrunner (10.0.12) under Linux. No crashes ("Linux: It Just Plain Works")! I am not sure about memory leaks, but I have collected detailed stats. The machines are diskless P4s (refurbished Samba small form factor boxes we got cheap from Geeks.com). The machines have 2gig of RAM (not LTSP, these are full fledged workstations with NFS mounted file systems). Our *main* problem is the slow internet connection causes a problem for the people doing cataloging, but circulation does have slowness problems.

Jason Stephenson (jstephenson) wrote :

Tim,

Knowing that typos and omissions are common in email, I think it would be good if you could get clarification on this last sentence:

Our *main* problem is the slow internet connection causes a problem for the people doing cataloging, but circulation does have slowness problems.

I am inclined to read that last phrase with a "not" between "does" and "have" because of the use of "but" as the conjunction. However, it is always best to clarify these things rather than to make assumptions.

Jason S.

Galen Charlton (gmc) wrote :

The collab/gmcharlt/lp1086458_comprehensive_experimentation branch in the working repository now contains patches to clean up after event listeners more thoroughly (particularly since util.list has a lot of closures being used as event listeners), adding a destructor to the messagecatalog XBL binding, and removing the caption one.

My current favorite branch is collab/gmcharlt/lp1086458_cleanup_only, which does more cleanup without (so far) requiring that the XBL caption binding go away:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/collab/gmcharlt/lp1086458_cleanup_only

Galen Charlton (gmc) wrote :

As a side note, user/gmcharlt/remove_messagecatalog contains a start at replacing messagecatalog with stringbundle -- however, I'm not sure that doing so will necessary lead to any improvement in the memory situation per se.

Jason Etheridge (phasefx) wrote :

Wanted to mention that printing receipts is a source of memory leaks as well.

Ben Shum (bshum) wrote :

Seeing good results using the cleanup_only branch. Picked all commits into master and rel_2_3.

My recommendation at this point is that we begin coordinating additional memory leak issues into their own separate bug tickets so that we can take further action on various issues on a case-by-case basis. Specifically from this thread, I can see some suspicion of issues in printing receipts and patron search (which already has its own ticket, see bug 1110817).

Changed in evergreen:
milestone: none → 2.4.0-beta
status: Triaged → Fix Committed
tags: added: memoryleak
removed: leak memory
Ben Shum (bshum) wrote :

Reported by Steven Chan, SITKA:

We have been receiving field reports of what appears to be memory leaks in the
the 2.4 staff client causing it to slow down, emit error messages, and then crash.

One good report is from a library which had a crash of their circ workstation
about three hours before closing time. Their IT staff restarted it on a Windows
computer alongside the Task Manager in order to report on memory usage.

Time Memory Usage
5:56p XULRunner crashed, forced to quit Evergreen and reboot
5:57p 62,668K
7:44p 460,520K
8:42p 648,768K

Checking the database for that day and time period, we saw checkouts for only 14
unique patrons.

In order to isolate the problem further I used a straightforward test sequence
to measure memory leaks in the staff client as it performs checkouts, as
follows.

Test Setup:

* On a Windows XP running current production staff client, log into test
  server. The test server runs 2.4 and a snapshot of the production database.

* Press Ctl-t to show a 2nd tab and press F1 to show the patron barcode
  search form.

* Use a test patron account. Ensure that no alert message was entered in the
  patron account so that the checkout screen will show immediately apfter the
  patron search.

* Ensure that Autoprint is selected.
* Choose an item to check out.
* Use a 2nd workstation to check in the test item.

Test Sequence:

1. Search for test account.
2. Check out the test item.
3. Press Done and cancel the print attempt; the F1 search form is re-
displayed.
4. On 2nd workstation, check in the test item, so that it may be re-
used.

Repeat steps 1 to 4.

Using Task Manager, record the memory usage after step 3 and after it
stabilizes. With each checkout, the usage increases, never dropping to
the starting level.

A typical sequence of memory leak is (in MB) 17, 27, 25, 20, 22. If the
window is closed to leave only the login window active, memory usage
drops by only 4 MB. After 5 checkouts, we have a net increase of 107 MB
or about 20 MB per checkout.

The leak does not seem to depend on OS platform, Xulrunner version, or our fork
of software. Similar leaks are seen on Windows XP or Ubuntu 12.04; for
Xulrunner version 14, 21, 22, or 23; and on the official version of staff
client.

Changed in evergreen:
status: Fix Committed → Triaged
milestone: 2.4.0-beta → none
Galen Charlton (gmc) wrote :

I've been poking away at this. A report on my progress and some findings can be found at

http://nox.esilibrary.com/~gmc/gpls/memleak/GPLS_Staff_Client_Circulation_Memory_Leak_Report1.html

The link will last for the life of this bug, but isn't guaranteed to be eternal. It can be treated as CC-BY-SA.

Some further experimentation shows that sending a custom event to a listener defined in the chrome menu handler that invokes set_tab() sidesteps the leak ... provided that the message is sent *outside* of circ.checkout.print() callback or even just a SetTimeout(). Unfortunately, some sort of delay is required to avoid closing the tab before the print window is created.

Galen Charlton (gmc) wrote :

Some additional experimentation today indicates that there are actually two, related leaks.

- one caused by calling xulG.set_tab() from inside a page-level command handler
- one caused by invoking circ.checkout.print()

Clicking the 'Done' button does both. The test where I commend out the call to circ,checkout.print() revealed the first type of leak, but hid that there is actually a leak from the latter.

In other words, the leak cannot be worked around by using the print receipt button, then using F1 to reset the checkout session.

Steven Chan (schan2) wrote :

Here is a follow-up on my earlier test case and on Galen's progress report.

I find I am able to remove the memory leak by reducing the number of parameters
when calling the set_tab() function. However, the leak goes away only when I
set the library setting 'Do not attempt auto-print' to 'Checkout'. This will
cause the software to not call the print method before calling set-tab().

If I try making the code change when 'Do not attempt auto-print' is not set to
'Checkout', the leak remains, which supports Galen's observation that the print
method leaks as well.

In my tests, I built a staff client out of Evergreen 2.4 trunk and ran it on a
32-bit Ubuntu 12.04 machine. For the Xulrunner, I set the option
'javascript.options.mem.gc_incremental' to true, although I don't believe it
has an effect on the leak.

In the file server/circ/checkout.js, line 237, the call to set_tab() is as
follows.

xulG.set_tab(urls.XUL_PATRON_BARCODE_ENTRY,{},{});

The 2nd parameter is an object to override default options; the 3rd parameter
is an object that is dynamically allocated by set_tab() to contain all kinds of things for
the tab. If parameters 2 or 3 are not defined, set_tab() will initialize them
to empty objects.

I made 2 changes, the 1st to remove parameter 3, the 2nd to remove
parameters 2 and 3.

Change A
xulG.set_tab(urls.XUL_PATRON_BARCODE_ENTRY, {});

Change B
xulG.set_tab(urls.XUL_PATRON_BARCODE_ENTRY);

I made 3 test runs. Each run consists of entering the barcode of a test
account and then pressing Done, repeated three times.

First before any changes:

Before
110 MB, start
176 MB, end of three barcodes

The leak is about 22MB per barcode entry.

Then after making Change A:

After Change A
 98 MB, start
165 MB, 3 barcodes
235 MB, 3 barcdoes
304 MB, 3 barcodes

The code change did not change the leak.

Then, after making Change B:

After Change B
 99 MB, start
103 MB, 3 barcodes
139 MB, 3 barcodes
103 MB, 3 barcodes

The first run of three induces a leak of only 4MB; after the 3rd run, there is no
further leak.

I next reverted the setting of 'Do not attempt auto-print' to none so that
receipts will auto-print.

After Change B
103 MB, start
169 MB, 3 barcodes
237 MB, 3 barcodes

The result is a resumption of the memory leak, to about the same level as
before.

Galen Charlton (gmc) wrote :

A set of four patches that either fix or greatly reduce the memory leak during printing is available in the working/Evergreen repository in the user/gmcharlt/lp1086458_fix_receipt_printing_leak branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/gmcharlt/lp1086458_fix_receipt_printing_leak

The patch series does two things:

[1] Serializes the object used for the reprint last receipt function to JSON before caching it. This avoids a problem where the global cache was hanging onto references to the context of the checkout code that initiates receipt printing.

[2] Creates a chrome event handler to look for 'refresh_checkout' events rather than calling xulG.set_tab() directly. Using an event to request the tab refresh avoids the possibility of any objects or executation contexts getting leaked.

Steven Chan's observations are also interesting -- it appears that passing anonymous hashes that xulG.set_tab() some is causing execution contexts to be leaked -- possibly because the call was wrapped in an anonymous function? I've take this into account for the fourth patch in the series.

tags: added: pullrequest
Galen Charlton (gmc) wrote :

The patch series is trivially backportable, so a rel_2_3 version is available in the user/gmcharlt/lp1086458_fix_receipt_printing_leak_rel_2_3 branch:

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/gmcharlt/lp1086458_fix_receipt_printing_leak_rel_2_3

Galen Charlton (gmc) on 2013-10-14
Changed in evergreen:
status: Triaged → Confirmed
Galen Charlton (gmc) wrote :

The second report can be found at http://nox.esilibrary.com/~gmc/gpls/memleak/GPLS_Staff_Client_Circulation_Memory_Leak_Report2.html.

As before, the link is not permanent, and the content should be considered CC-BY-SA.

Martha Driscoll (mjdriscoll) wrote :

I built a new client based on Galen's code and tested using our training server running 2.3.6 and an XP client machine. I used Galen's test method of entering the barcode of a test account and then pressing done, repeated three times. Auto print is checked.

I am definitely seeing an improvement if not elimination of the memory leak.

2.3.6 stock client
69,612 KB memory usage at login
147,684 KB after bringing up 3 patrons
196,792 KB after bringing up 3 more patrons
212,372 KB after bringing up 3 more patrons
268,068 KB after bringing up 3 mroe patrons

2.3.6 memory leak fix client
68,144 KB memory usage at login
85,908 KB after bringing up 3 patrons
95,144 KB after bringing up 3 more patrons
94,192 KB after bringing up 3 more patrons
94,260 KB after bringing up 3 more patrons

I'm going to test this on our production server (2.3.7) and have one of our libraries test the client on an actual circulation desk.

Dan Wells (dbw2) wrote :

After some light testing, I have pushed this to master for inclusion in 2.5-rc1. Thanks, Galen!

Given the time constraints, I was not able to test this to the point where I am comfortable pushing to the stable releases (2.4 and 2.3), so that is still to-do.

Changed in evergreen:
status: Confirmed → Fix Committed
milestone: none → 2.5.0-rc
Mike Rylander (mrylander) wrote :

I've backported this to 2.4 (it picked clean, so "backport" is a bit strong, I guess) and used Galen's branch for 2.3. Thanks, Galen!

Dan Wells (dbw2) on 2013-11-11
Changed in evergreen:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers