lpnet app servers are leaking memory again
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Launchpad itself |
High
|
Gary Poster |
Bug Description
A non exhaustive analysis suggests they're gaining around 200-250Mb per day.
But we also see "events" that cause sudden memory gains. eg:
USER PID PPID NI PRI TIME %MEM RSS SZ VSZ STAT BLOCKED NLWP STARTED ELAPSED CMD
2010-02-
1000 20262 1 0 19 19:41:12 11.3 692776 794852 1049884 Sl 0000000000000000 6 Tue Feb 23 13:33:55 2010 2-00:21:28 /usr/bin/python2.5 bin/run -i lpnet12
2010-02-
1000 20262 1 0 19 19:41:41 13.3 813460 925924 1180956 Sl 0000000000000000 6 Tue Feb 23 13:33:55 2010 2-00:22:07 /usr/bin/python2.5 bin/run -i lpnet12
ie, a view of lpnet12 on wampee, one snapshot @ 2010-02-
The end result is that the processes eventually head into swap land and need restarting.
Related branches
- Данило Шеган (community): Approve on 2010-05-19
-
Diff: 33 lines (+5/-4)1 file modifiedlib/lp/translations/browser/translationmessage.py (+5/-4)
Changed in launchpad-foundations: | |
importance: | Undecided → High |
status: | New → Triaged |
milestone: | none → 10.03 |
assignee: | nobody → Stuart Bishop (stub) |
Changed in launchpad-foundations: | |
assignee: | Stuart Bishop (stub) → Guilherme Salgado (salgado) |
Guilherme Salgado (salgado) wrote : | #1 |
Tom Haddon (mthaddon) wrote : | #2 |
I've just had to restart app servers on three different physical boxes because they were getting low on free swap. See https:/
Changed in launchpad-foundations: | |
assignee: | Guilherme Salgado (salgado) → nobody |
Changed in launchpad-foundations: | |
milestone: | 10.03 → 10.04 |
Guilherme Salgado (salgado) wrote : | #3 |
We now have a hook to get a memory dump out of a running app server, so if this happens again we'll be able to get that dump and debug later.
Changed in launchpad-foundations: | |
milestone: | 10.04 → 10.05 |
Gary Poster (gary) wrote : | #4 |
Thanks to spm, we have seven memory dumps so far.
I have tried to analyze all of them with meliae. meliae failed to process all of them. The following are summaries of their failures.
launchpad-
launchpad-
launchpad-
lpnet11.21094.dump: During om.collapse_
lpnet12.21146.dump: ValueError: Expecting object: line 1 column 79 (char 79)
lpnet13.21203.dump: ValueError: Expecting , delimiter: line 1 column 186 (char 186)
lpnet14.21255.dump: ValueError: Expecting property name: line 1 column 64 (char 64)
I suppose next steps are three. (1) Contact John Meinel and see if he has advice (or wants to investigate himself). (2) Pursue the ValueErrors by hacking meliae to give me a better idea of what line failed (the "Traceback" line overwrites the information at the moment) and see if I can hack the dump to be loadable while still keeping value. (3) Investigate with Salgado if any of these sound familiar to him, and if he has a workaround.
Steve McInerney (spm) wrote : | #5 |
We do see a similar issue on U1 as well; there may be possibilities for cross team collaboration on solving for all/reduce duplicated effort?
I have no idea if the *root* cause is the same; but the symptoms sure are.
??
Gary Poster (gary) wrote : | #6 |
Given the dissimilarity of the software stacks, my first guess is that it is unlikely that there is a direct overlap, though of course it is possible.
The best chance to reduce duplicated effort I see is by agreeing on the tool to analyze the problem. By using meliae, we are at least aligning with bzr, I think (since it comes from John Meinel).
I've investigated the dumps further.
The five problems that are not simply "Killed" appear to be instances of bug 428165. I have added to that report.
I'll now see if I can get any information about the killed loads.
Gary Poster (gary) wrote : | #7 |
I wonder if the five instances of bug 428165 could be alleviated in Launchpad by not using dump_all_objects, but dump_gc_objects instead. I'll try to be in touch with John Meinel about this later.
The two killed loads were out of memory errors within the meliae loader on my local machine.
I could address part of the problem in meliae's code, and opened bug 581918 with a patch. The patch plus adding some RAM pushed me over the hump so I could actually analyze one of the files.
The meliae output has in fact pointed to one possible memory leak. It would be very, very slow. Essentially, I think every unique 404 (or every 404 of some sort?) looks like they produce a value in a cache in the Zope component architecture. We never discard these. This produces the largest dict in the system.
I've attached a fictional meliae session that was "based on a true story". In particular, the identification of the file I used is accurate, the output of the summary are the real results, and the analysis of the beginning of the big dict is real.
Where is this big dict, in the scheme of things? The zope.component global site manager (``sm = zope.component.
I'll think about what to do about this later.
Gary Poster (gary) wrote : | #8 |
By the way, if you look at the summary values in the example, you'll see a Very Big List. After investigating, I'm 99% sure that this is the list produced by meliae in Launchpad when calling ``gc.get_
Gary Poster (gary) wrote : | #9 |
The dict in the example analyzed for the previous two comments had 176721 members, and total memory was 541.4MiB.
I looked at the other non-broken dump. It was slightly smaller. Results are in the example2.txt attachment.
Looking more carefully at this cache (a by-design memory leak, as stub called it), each new end-of-traversal (when the traversal code looks for a potential view) request adds the following:
- if path element not seen:
* string
* dict
- if combination of request type and context object not seen:
* tuple
* security proxy
The "provides" objects appear to be reused (they have the same identity across requests). I'm assuming the "implements" objects are as well.
In the other memory dump, the dict I am suspicious of has 170106 members, and total memory of 505.8MiB. That's 6,615 fewer entries, and about 35.6 MiB less.
It has 226852 tuples and security proxies in the dict (a factor of about 1.333 of the string keys in the main dict, fwiw.)
The bigger dump, analyzed before, has 9950 more security proxies. That looks about right numerically for this to be the cause.
However, in other numbers, there are 8904 more unicode values in the larger instance; 34,883 more dicts; 42,977 more tuples; and 5067 *fewer* strings. So...we don't have anything conclusive yet. The differences in dicts and tuples in particular are not yet explained. Maybe the unicode/string differences are just noise.
BTW, for the curious, the two big unicode values are the WADL 1.0 and beta docs, stored in memory.
I'm now seeing if I can find a cause for the differences in dicts and tuples, which seem too big to be explained by my existing hypothesis.
Gary Poster (gary) wrote : | #10 |
This looks *much* more promising.
Near the bottom of lib/lp/
def convert_
message, current_message, plural_form, pofile, legal_warning_
is_empty=False, packaged=False):
"""Turn a TranslationMessage to an object used for rendering a submission.
:param message: A TranslationMessage.
:param plural_form: A plural form to prepare a submission for.
:param pofile: A containing PO file where suggestion is being rendered.
:param legal_warning_
:param is_empty: Is the submission empty or not.
"""
class Submission:
pass
submission = Submission()
...
These Submission types, created on-the-fly, are being kept around. On the smaller instance, there are 18420 of these. On the larger, there are 22248 of these. Each has lots of associated bits, such as their own ClassProvides object. They could easily explain lots of extra dicts, as well as tuples and other objects.
Why are they kept around? Ultimately, because of the same cache data structure described for the other "leak by design," the AdapterLookup. At some point, some code somewhere apparently tries to traverse the Submission objects. This tries to adapt it to ITraversable. The result is the DefaultTraversable. This cacheing will produce yet more tuples.
The end result is that each one of the Submission objects results in several dicts, several tuples, a ClassProvides, an Implements, and a couple of getset_descriptors. Adding up, each one is responsible for around 5000B at my rough guess. That would make it responsible for about 18M of the 35M difference between the two memory dumps.
That's not conclusive--it's about half--but I'll take it. it looks reasonable. There may be yet more memory leaks, but I'll stop here until given evidence we need to go farther.
The first thing to do is very easy: let's move that little Submission class out of the function. That should solve the immediate problem, or at least make it much better.
The second thing to do is harder, and can probably be parceled out to another bug and scheduled, rather than done immediately: we ought to make those inner dictionaries of the zope component cache have some kind of rotating behavior, so they can be kept at a relatively stable size. Both problems I found eventually implicated that cache.
The cache is implemented in C code, in zope.interface's _zope_interface
Changed in launchpad-foundations: | |
status: | Triaged → Fix Committed |
assignee: | nobody → Gary Poster (gary) |
Gary Poster (gary) wrote : | #11 |
I used ab to test the change, with help from danilo to discover a way to trigger on launchpad.dev. On the original version, the kernel killed the process because of an out of memory error after abut 28000 requests to a page that triggers the error. On a version with the branch that addresses the problem, it seemed to stabilize, and delivered 35000 without an OOM error.
I'm calling this Fix Committed, with the caveats I've given before. I'll check with the LOSAs as to whether they want a CP.
Fixed in stable r10897 <http://
tags: | added: qa-needstesting |
tags: |
added: qa-untestable removed: qa-needstesting |
tags: | added: canonical-losa-lp |
Changed in launchpad-foundations: | |
status: | Fix Committed → Fix Released |
I checked the logs (https:/ /devpad. canonical. com/~salgado/ leak.txt) from the period when we saw the spike in memory usage and created this spreadsheet to be able to easily sort them: https:/ /devpad. canonical. com/~salgado/ leak.ods
We were hoping to find some suspiciously- looking requests there, and there are a few indeed, but when a tried hitting (a freshly restarted) staging with them, we didn't see any increase in memory usage: https:/ /pastebin. canonical. com/28709/