sendbranchmail with lp:~vcs-imports/linux/trunk is eating memory
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Bazaar |
High
|
Unassigned | ||
| Launchpad itself |
Critical
|
Unassigned |
Bug Description
Found an oops for this: OOPS-1612BM1
branch: ~vcs-imports/
branch_job_id: 1658563
We've had to kill off a sendbranchmail run, as it was driving the server into swap/eating all of it.
bzrsyncd 1629 0.0 0.0 3944 468 ? Ss 20:52 0:00 /bin/sh -c $LP_PY /srv/bzrsyncd.
bzrsyncd 1630 53.0 61.4 6871152 5033008 ? Dl 20:52 6:43 \_ /usr/bin/python2.5 /srv/bzrsyncd.
5Gb RSS, is rather uncool.
Related branches
- Aaron Bentley (community): Approve on 2011-07-11
-
Diff: 388 lines (+117/-50)8 files modifiedcronscripts/sendbranchmail.py (+12/-15)
lib/lp/code/model/branchjob.py (+60/-11)
lib/lp/code/scripts/tests/test_sendbranchmail.py (+10/-10)
lib/lp/codehosting/vfs/transport.py (+10/-1)
lib/lp/registry/model/packagingjob.py (+2/-11)
lib/lp/services/utils.py (+20/-2)
lib/lp/translations/model/translationpackagingjob.py (+1/-0)
lib/lp/translations/tests/test_translationtemplatesbuildjob.py (+2/-0)
Changed in launchpad-code: | |
importance: | Undecided → Critical |
Tim Penhey (thumper) wrote : | #1 |
Changed in launchpad-code: | |
status: | New → Triaged |
assignee: | nobody → Tim Penhey (thumper) |
Martin Pool (mbp) wrote : | #2 |
Perhaps we should run this with a ulimit set to something the machine can tolerate.
To get a view of where it's using memory, try following http://
Tim Penhey (thumper) wrote : | #3 |
Hmm... as well as adding '-v' to the mail job, we need the following cowboy. We will then get information about the jobs as they are being run in the log file. With that we can debug more.
=== modified file 'lib/lp/
--- lib/lp/
+++ lib/lp/
@@ -183,6 +183,13 @@
def __init__(self, branch_job):
+ def __repr__(self):
+ branch = self.branch
+ return '<%(job_type)s job for %(branch)s>' % {
+ 'job_type': self.context.
+ 'branch': branch.unique_name,
+ }
+
# XXX: henninge 2009-02-20 bug=331919: These two standard operators
# should be implemented by delegates().
def __eq__(self, other):
Tim Penhey (thumper) wrote : | #4 |
Spectacular formatting fail there.
Tim Penhey (thumper) wrote : | #5 |
Changed in launchpad-code: | |
importance: | Critical → High |
Steve McInerney (spm) wrote : | #6 |
Hrm. Didn't realise you could set a hard memory limit thru ulimit. have done so via a funky wrapper script:
ulimit -v 1843200
fwiw.
Have added the -v in the script as well; but need approval for the cowboy to go ahead, then we can roll that.
Pls do be ware; we have had at least 2 repeat instances where this has nearly caused loganberry to faceplant.
On 27 May 2010 06:24, Steve McInerney <email address hidden> wrote:
> Hrm. Didn't realise you could set a hard memory limit thru ulimit. have done so via a funky wrapper script:
> ulimit -v 1843200
> fwiw.
>
> Have added the -v in the script as well; but need approval for the
> cowboy to go ahead, then we can roll that.
>
> Pls do be ware; we have had at least 2 repeat instances where this has
> nearly caused loganberry to faceplant.
'this' meaning this bug, or using ulimit?
--
Martin <http://
"'this' meaning this bug, or using ulimit?"
Oops. This meaning the bug.
Tim, you mentioned that the same job runs on the next iteration of sendbranchmail?
If so, we aren't seeing a repeat of this memory gobble, until a day ish later - so far. I have no idea if that's sheer co-incidence or expected. AFAIA,we've had 3 instances of this gobbling incident.
But will do a scan thru our ps(1) history looking for a more complete analysis. Details to follow.
Steve McInerney (spm) wrote : | #9 |
The memory jump is generally fairly sudden and quite rapid, as seen from ~ minute to minute.
a given process will be happily doing it's thing for a while (== minutes) then within 120 seconds has gone from ~ 500Mb RSS, to 5+ Gb RSS.
A longer history look (ifs buts maybes here) over ~ 9300 records of the sendbranchmail script tells us:
~ 97.5% of the time, it's using < 300Mb RSS.
~ 2% of the time, it's using > 1Gb RSS, most of which is > 2Gb.
be ware as I'd suggest that those higher numbers are slanted by the bigger gobbles running for longer periods; also the lower ones slanted in their favour by multiple attempts to startup when a gobbling is in progress.
Not sure if this'll help in *solving* but perhaps in defining the impact. :-)
On Thu, 27 May 2010 10:26:40 you wrote:
> The memory jump is generally fairly sudden and quite rapid, as seen from ~
> minute to minute. a given process will be happily doing it's thing for a
> while (== minutes) then within 120 seconds has gone from ~ 500Mb RSS, to
> 5+ Gb RSS.
>
> A longer history look (ifs buts maybes here) over ~ 9300 records of the
> sendbranchmail script tells us: ~ 97.5% of the time, it's using < 300Mb
> RSS.
> ~ 2% of the time, it's using > 1Gb RSS, most of which is > 2Gb.
>
> be ware as I'd suggest that those higher numbers are slanted by the
> bigger gobbles running for longer periods; also the lower ones slanted
> in their favour by multiple attempts to startup when a gobbling is in
> progress.
>
> Not sure if this'll help in *solving* but perhaps in defining the
> impact. :-)
The additional logging will at least tell us which branches were being
processed when the memory jump occurs.
There are some known big memory requirements for diff, like when the underlying
changed file is very large. There may also be leaks. Until we can get some
sample branches that are causing problems, it is hard to diagnose.
tags: | added: canonical-losa-lp |
Lets get the details of branch job 1658563:
select json_data from branchjob where id = 1658563
This will give us the revisions it was trying to generate diffs for.
With this info, we can pass this on to the bazaar team to investigate.
tags: | added: oops |
description: | updated |
description: | updated |
Changed in bzr: | |
status: | New → Incomplete |
Changed in bzr: | |
status: | Incomplete → Confirmed |
importance: | Undecided → Medium |
importance: | Medium → High |
Robert Collins (lifeless) wrote : | #13 |
So, what branch as revisions {"last_
(1 row) ?
I'm guessing linux.
Tim Penhey (thumper) wrote : | #14 |
Rob, the description was changed when I found the branch.
+ Found an oops for this: OOPS-1612BM1
+ branch: ~vcs-imports/
+ branch_job_id: 1658563
summary: |
- sendbranchmail is eating memory + sendbranchmail with lp:~vcs-imports/linux/trunk is eating memory |
Changed in launchpad: | |
importance: | High → Critical |
Changed in launchpad: | |
assignee: | Tim Penhey (thumper) → nobody |
Aaron Bentley (abentley) wrote : | #15 |
A quick note: "Tim, you mentioned that the same job runs on the next iteration of sendbranchmail?"
Actually, no. That job will be in the RUNNING state. (Or, if killed nicely with SIGINT, in the FAILED state.) Only WAITING jobs will be run.
Changed in launchpad: | |
assignee: | nobody → Aaron Bentley (abentley) |
Launchpad QA Bot (lpqabot) wrote : | #16 |
Fixed in stable r13292 <http://
tags: | added: qa-needstesting |
Changed in launchpad: | |
status: | Triaged → Fix Committed |
tags: |
added: qa-untestable removed: qa-needstesting |
Changed in launchpad: | |
status: | Fix Committed → Fix Released |
tags: |
added: qa-ok removed: qa-untestable |
Aaron Bentley (abentley) wrote : | #17 |
The branch that landed a fix for Launchpad was rolled back.
Changed in launchpad: | |
status: | Fix Released → Triaged |
Changed in launchpad: | |
assignee: | Aaron Bentley (abentley) → nobody |
tags: | added: bzr |
Changed in bzr: | |
status: | Confirmed → Fix Committed |
Changed in bzr: | |
status: | Fix Committed → Confirmed |
tags: | added: check-for-breezy |
tags: | removed: check-for-breezy |
OK, first step is to identify the branch culprit. This is almost certainly a bzr issue too, but lets work out the branch first.
Given the way jobs work, the same job would have been run shortly after, so we can't tell exactly which job it was, or that there may have been multiple jobs. We need to add a line to the job script that prints out the job it is running... although I have a gut feeling that I added this info before and we may just need an extra '-v' parameter.