lpnet app servers are leaking memory again

Bug #531071 reported by Steve McInerney on 2010-03-03
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Gary Poster

Bug Description

A non exhaustive analysis suggests they're gaining around 200-250Mb per day.
But we also see "events" that cause sudden memory gains. eg:

1000 20262 1 0 19 19:41:12 11.3 692776 794852 1049884 Sl 0000000000000000 6 Tue Feb 23 13:33:55 2010 2-00:21:28 /usr/bin/python2.5 bin/run -i lpnet12
1000 20262 1 0 19 19:41:41 13.3 813460 925924 1180956 Sl 0000000000000000 6 Tue Feb 23 13:33:55 2010 2-00:22:07 /usr/bin/python2.5 bin/run -i lpnet12

ie, a view of lpnet12 on wampee, one snapshot @ 2010-02-25:13:55:23, followed ~ 40 seconds later @ 2010-02-25:13:56:02 where lpnet12 has grabbed ~ 100Mb more RSS.

The end result is that the processes eventually head into swap land and need restarting.

Related branches

Gary Poster (gary) on 2010-03-03
Changed in launchpad-foundations:
importance: Undecided → High
status: New → Triaged
milestone: none → 10.03
assignee: nobody → Stuart Bishop (stub)
Gary Poster (gary) on 2010-03-03
Changed in launchpad-foundations:
assignee: Stuart Bishop (stub) → Guilherme Salgado (salgado)
Guilherme Salgado (salgado) wrote :

I checked the logs (https://devpad.canonical.com/~salgado/leak.txt) from the period when we saw the spike in memory usage and created this spreadsheet to be able to easily sort them: https://devpad.canonical.com/~salgado/leak.ods

We were hoping to find some suspiciously-looking requests there, and there are a few indeed, but when a tried hitting (a freshly restarted) staging with them, we didn't see any increase in memory usage: https://pastebin.canonical.com/28709/

Tom Haddon (mthaddon) wrote :

I've just had to restart app servers on three different physical boxes because they were getting low on free swap. See https://wiki.canonical.com/InformationInfrastructure/OSA/LPIncidentLog

Changed in launchpad-foundations:
assignee: Guilherme Salgado (salgado) → nobody
Gary Poster (gary) on 2010-03-30
Changed in launchpad-foundations:
milestone: 10.03 → 10.04
Guilherme Salgado (salgado) wrote :

We now have a hook to get a memory dump out of a running app server, so if this happens again we'll be able to get that dump and debug later.

Gary Poster (gary) on 2010-05-13
Changed in launchpad-foundations:
milestone: 10.04 → 10.05
Gary Poster (gary) wrote :

Thanks to spm, we have seven memory dumps so far.

I have tried to analyze all of them with meliae. meliae failed to process all of them. The following are summaries of their failures.

launchpad-memory.20566.dump: During loader.load, got "ValueError: Expecting object: line 1 column 17 (char 17)"

launchpad-memory.20610.dump: During loader.load, got "ValueError: Unterminated string starting at: line 1 column 36 (char 36)"

launchpad-memory.20654.dump: During om.compute_parents(), process died with no explanation ("Killed") and no core dump.

lpnet11.21094.dump: During om.collapse_instance_dicts(), process died with no explanation ("Killed") and no core dump.

lpnet12.21146.dump: ValueError: Expecting object: line 1 column 79 (char 79)

lpnet13.21203.dump: ValueError: Expecting , delimiter: line 1 column 186 (char 186)

lpnet14.21255.dump: ValueError: Expecting property name: line 1 column 64 (char 64)

I suppose next steps are three. (1) Contact John Meinel and see if he has advice (or wants to investigate himself). (2) Pursue the ValueErrors by hacking meliae to give me a better idea of what line failed (the "Traceback" line overwrites the information at the moment) and see if I can hack the dump to be loadable while still keeping value. (3) Investigate with Salgado if any of these sound familiar to him, and if he has a workaround.

Steve McInerney (spm) wrote :

We do see a similar issue on U1 as well; there may be possibilities for cross team collaboration on solving for all/reduce duplicated effort?
I have no idea if the *root* cause is the same; but the symptoms sure are.


Gary Poster (gary) wrote :

Given the dissimilarity of the software stacks, my first guess is that it is unlikely that there is a direct overlap, though of course it is possible.

The best chance to reduce duplicated effort I see is by agreeing on the tool to analyze the problem. By using meliae, we are at least aligning with bzr, I think (since it comes from John Meinel).

I've investigated the dumps further.

The five problems that are not simply "Killed" appear to be instances of bug 428165. I have added to that report.

I'll now see if I can get any information about the killed loads.

Gary Poster (gary) wrote :

I wonder if the five instances of bug 428165 could be alleviated in Launchpad by not using dump_all_objects, but dump_gc_objects instead. I'll try to be in touch with John Meinel about this later.

The two killed loads were out of memory errors within the meliae loader on my local machine.

I could address part of the problem in meliae's code, and opened bug 581918 with a patch. The patch plus adding some RAM pushed me over the hump so I could actually analyze one of the files.

The meliae output has in fact pointed to one possible memory leak. It would be very, very slow. Essentially, I think every unique 404 (or every 404 of some sort?) looks like they produce a value in a cache in the Zope component architecture. We never discard these. This produces the largest dict in the system.

I've attached a fictional meliae session that was "based on a true story". In particular, the identification of the file I used is accurate, the output of the summary are the real results, and the analysis of the beginning of the big dict is real.

Where is this big dict, in the scheme of things? The zope.component global site manager (``sm = zope.component.getSiteManager()``) has an instance of a GlobalAdapterRegistry for its adaptations (``sm.adapters``). This registry has a cache, the majority of which is implemented in C (``sm.adapters._v_lookup``). The C implementation has several Python dicts it uses internally for the cache. In one of them, the key for zope.interface.Interface resolves to a value of another mapping. This is The Big Dict, with keys for names, as mocked up in the last result of the example. This is the one I'm suspicious of. It may not be The Leak, because it would be *really* slow, but it is *a* leak.

I'll think about what to do about this later.

Gary Poster (gary) wrote :

By the way, if you look at the summary values in the example, you'll see a Very Big List. After investigating, I'm 99% sure that this is the list produced by meliae in Launchpad when calling ``gc.get_objects()``, and therefore a red herring.

Gary Poster (gary) wrote :

The dict in the example analyzed for the previous two comments had 176721 members, and total memory was 541.4MiB.

I looked at the other non-broken dump. It was slightly smaller. Results are in the example2.txt attachment.

Looking more carefully at this cache (a by-design memory leak, as stub called it), each new end-of-traversal (when the traversal code looks for a potential view) request adds the following:

- if path element not seen:
  * string
  * dict
- if combination of request type and context object not seen:
  * tuple
  * security proxy

The "provides" objects appear to be reused (they have the same identity across requests). I'm assuming the "implements" objects are as well.

In the other memory dump, the dict I am suspicious of has 170106 members, and total memory of 505.8MiB. That's 6,615 fewer entries, and about 35.6 MiB less.

It has 226852 tuples and security proxies in the dict (a factor of about 1.333 of the string keys in the main dict, fwiw.)

The bigger dump, analyzed before, has 9950 more security proxies. That looks about right numerically for this to be the cause.

However, in other numbers, there are 8904 more unicode values in the larger instance; 34,883 more dicts; 42,977 more tuples; and 5067 *fewer* strings. So...we don't have anything conclusive yet. The differences in dicts and tuples in particular are not yet explained. Maybe the unicode/string differences are just noise.

BTW, for the curious, the two big unicode values are the WADL 1.0 and beta docs, stored in memory.

I'm now seeing if I can find a cause for the differences in dicts and tuples, which seem too big to be explained by my existing hypothesis.

Gary Poster (gary) wrote :

This looks *much* more promising.

Near the bottom of lib/lp/translations/browser/translationmessage.py we have the following:

def convert_translationmessage_to_submission(
    message, current_message, plural_form, pofile, legal_warning_needed,
    is_empty=False, packaged=False):
    """Turn a TranslationMessage to an object used for rendering a submission.

    :param message: A TranslationMessage.
    :param plural_form: A plural form to prepare a submission for.
    :param pofile: A containing PO file where suggestion is being rendered.
    :param legal_warning_needed: Whether a warning check is needed.
    :param is_empty: Is the submission empty or not.

    class Submission:

    submission = Submission()

These Submission types, created on-the-fly, are being kept around. On the smaller instance, there are 18420 of these. On the larger, there are 22248 of these. Each has lots of associated bits, such as their own ClassProvides object. They could easily explain lots of extra dicts, as well as tuples and other objects.

Why are they kept around? Ultimately, because of the same cache data structure described for the other "leak by design," the AdapterLookup. At some point, some code somewhere apparently tries to traverse the Submission objects. This tries to adapt it to ITraversable. The result is the DefaultTraversable. This cacheing will produce yet more tuples.

The end result is that each one of the Submission objects results in several dicts, several tuples, a ClassProvides, an Implements, and a couple of getset_descriptors. Adding up, each one is responsible for around 5000B at my rough guess. That would make it responsible for about 18M of the 35M difference between the two memory dumps.

That's not conclusive--it's about half--but I'll take it. it looks reasonable. There may be yet more memory leaks, but I'll stop here until given evidence we need to go farther.

The first thing to do is very easy: let's move that little Submission class out of the function. That should solve the immediate problem, or at least make it much better.

The second thing to do is harder, and can probably be parceled out to another bug and scheduled, rather than done immediately: we ought to make those inner dictionaries of the zope component cache have some kind of rotating behavior, so they can be kept at a relatively stable size. Both problems I found eventually implicated that cache.

The cache is implemented in C code, in zope.interface's _zope_interface_coptimizations.c file (as well as a pure-Python version maintained in parallel).

Gary Poster (gary) on 2010-05-20
Changed in launchpad-foundations:
status: Triaged → Fix Committed
assignee: nobody → Gary Poster (gary)
Gary Poster (gary) wrote :

I used ab to test the change, with help from danilo to discover a way to trigger on launchpad.dev. On the original version, the kernel killed the process because of an out of memory error after abut 28000 requests to a page that triggers the error. On a version with the branch that addresses the problem, it seemed to stabilize, and delivered 35000 without an OOM error.

I'm calling this Fix Committed, with the caveats I've given before. I'll check with the LOSAs as to whether they want a CP.

Gary Poster (gary) on 2010-05-28
tags: added: qa-untestable
removed: qa-needstesting
Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp
Curtis Hovey (sinzui) on 2010-06-02
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers