sendbranchmail with lp:~vcs-imports/linux/trunk is eating memory

Reported by Steve McInerney on 2010-05-24
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar
High
Unassigned
Launchpad itself
Critical
Unassigned

Bug Description

Found an oops for this: OOPS-1612BM1
    branch: ~vcs-imports/linux/trunk
    branch_job_id: 1658563

We've had to kill off a sendbranchmail run, as it was driving the server into swap/eating all of it.

bzrsyncd 1629 0.0 0.0 3944 468 ? Ss 20:52 0:00 /bin/sh -c $LP_PY /srv/bzrsyncd.launchpad.net/production/launchpad/cronscripts/sendbranchmail.py >> /srv/bzrsyncd.launchpad.net/production-logs/sendbranchmail.log 2>&1
bzrsyncd 1630 53.0 61.4 6871152 5033008 ? Dl 20:52 6:43 \_ /usr/bin/python2.5 /srv/bzrsyncd.launchpad.net/production/launchpad/cronscripts/sendbranchmail.py

5Gb RSS, is rather uncool.

Steve McInerney (spm) on 2010-05-24
Changed in launchpad-code:
importance: Undecided → Critical
Tim Penhey (thumper) wrote :

OK, first step is to identify the branch culprit. This is almost certainly a bzr issue too, but lets work out the branch first.

Given the way jobs work, the same job would have been run shortly after, so we can't tell exactly which job it was, or that there may have been multiple jobs. We need to add a line to the job script that prints out the job it is running... although I have a gut feeling that I added this info before and we may just need an extra '-v' parameter.

Tim Penhey (thumper) on 2010-05-25
Changed in launchpad-code:
status: New → Triaged
assignee: nobody → Tim Penhey (thumper)
Martin Pool (mbp) wrote :

Perhaps we should run this with a ulimit set to something the machine can tolerate.

To get a view of where it's using memory, try following http://jam-bazaar.blogspot.com/2009/11/memory-debugging-with-meliae.html - though this may be a bit hard if it's run non-interactively.

Tim Penhey (thumper) wrote :

Hmm... as well as adding '-v' to the mail job, we need the following cowboy. We will then get information about the jobs as they are being run in the log file. With that we can debug more.

=== modified file 'lib/lp/code/model/branchjob.py'
--- lib/lp/code/model/branchjob.py 2010-04-23 05:29:30 +0000
+++ lib/lp/code/model/branchjob.py 2010-05-25 03:56:09 +0000
@@ -183,6 +183,13 @@
     def __init__(self, branch_job):
         self.context = branch_job

+ def __repr__(self):
+ branch = self.branch
+ return '<%(job_type)s job for %(branch)s>' % {
+ 'job_type': self.context.job_type.name,
+ 'branch': branch.unique_name,
+ }
+
     # XXX: henninge 2009-02-20 bug=331919: These two standard operators
     # should be implemented by delegates().
     def __eq__(self, other):

Tim Penhey (thumper) wrote :

Spectacular formatting fail there.

Tim Penhey (thumper) wrote :
Tim Penhey (thumper) on 2010-05-26
Changed in launchpad-code:
importance: Critical → High
Steve McInerney (spm) wrote :

Hrm. Didn't realise you could set a hard memory limit thru ulimit. have done so via a funky wrapper script:
ulimit -v 1843200
fwiw.

Have added the -v in the script as well; but need approval for the cowboy to go ahead, then we can roll that.

Pls do be ware; we have had at least 2 repeat instances where this has nearly caused loganberry to faceplant.

On 27 May 2010 06:24, Steve McInerney <email address hidden> wrote:
> Hrm. Didn't realise you could set a hard memory limit thru ulimit. have done so via a funky wrapper script:
> ulimit -v 1843200
> fwiw.
>
> Have added the -v in the script as well; but need approval for the
> cowboy to go ahead, then we can roll that.
>
> Pls do be ware; we have had at least 2 repeat instances where this has
> nearly caused loganberry to faceplant.

'this' meaning this bug, or using ulimit?

--
Martin <http://launchpad.net/~mbp/>

"'this' meaning this bug, or using ulimit?"

Oops. This meaning the bug.

Tim, you mentioned that the same job runs on the next iteration of sendbranchmail?
If so, we aren't seeing a repeat of this memory gobble, until a day ish later - so far. I have no idea if that's sheer co-incidence or expected. AFAIA,we've had 3 instances of this gobbling incident.

But will do a scan thru our ps(1) history looking for a more complete analysis. Details to follow.

Steve McInerney (spm) wrote :

The memory jump is generally fairly sudden and quite rapid, as seen from ~ minute to minute.
a given process will be happily doing it's thing for a while (== minutes) then within 120 seconds has gone from ~ 500Mb RSS, to 5+ Gb RSS.

A longer history look (ifs buts maybes here) over ~ 9300 records of the sendbranchmail script tells us:
~ 97.5% of the time, it's using < 300Mb RSS.
~ 2% of the time, it's using > 1Gb RSS, most of which is > 2Gb.

be ware as I'd suggest that those higher numbers are slanted by the bigger gobbles running for longer periods; also the lower ones slanted in their favour by multiple attempts to startup when a gobbling is in progress.

Not sure if this'll help in *solving* but perhaps in defining the impact. :-)

On Thu, 27 May 2010 10:26:40 you wrote:
> The memory jump is generally fairly sudden and quite rapid, as seen from ~
> minute to minute. a given process will be happily doing it's thing for a
> while (== minutes) then within 120 seconds has gone from ~ 500Mb RSS, to
> 5+ Gb RSS.
>
> A longer history look (ifs buts maybes here) over ~ 9300 records of the
> sendbranchmail script tells us: ~ 97.5% of the time, it's using < 300Mb
> RSS.
> ~ 2% of the time, it's using > 1Gb RSS, most of which is > 2Gb.
>
> be ware as I'd suggest that those higher numbers are slanted by the
> bigger gobbles running for longer periods; also the lower ones slanted
> in their favour by multiple attempts to startup when a gobbling is in
> progress.
>
> Not sure if this'll help in *solving* but perhaps in defining the
> impact. :-)

The additional logging will at least tell us which branches were being
processed when the memory jump occurs.

There are some known big memory requirements for diff, like when the underlying
changed file is very large. There may also be leaks. Until we can get some
sample branches that are causing problems, it is hard to diagnose.

Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp

Lets get the details of branch job 1658563:

select json_data from branchjob where id = 1658563

This will give us the revisions it was trying to generate diffs for.

With this info, we can pass this on to the bazaar team to investigate.

tags: added: oops
description: updated
description: updated
Changed in bzr:
status: New → Incomplete
Martin Pool (mbp) on 2010-06-24
Changed in bzr:
status: Incomplete → Confirmed
importance: Undecided → Medium
importance: Medium → High
Robert Collins (lifeless) wrote :

So, what branch as revisions {"last_revision_id": "git-v1:67a3e12b05e055c0415c556a315a3d3eb637e29e", "last_scanned_id": "git-v1:b3f2f6cd1ff935ecac9a5346904b899d7af689fe", "from_address": "<email address hidden>"}
(1 row) ?

I'm guessing linux.

Tim Penhey (thumper) wrote :

Rob, the description was changed when I found the branch.

+ Found an oops for this: OOPS-1612BM1
+ branch: ~vcs-imports/linux/trunk
+ branch_job_id: 1658563

summary: - sendbranchmail is eating memory
+ sendbranchmail with lp:~vcs-imports/linux/trunk is eating memory
Changed in launchpad:
importance: High → Critical
Tim Penhey (thumper) on 2011-02-24
Changed in launchpad:
assignee: Tim Penhey (thumper) → nobody
Aaron Bentley (abentley) wrote :

A quick note: "Tim, you mentioned that the same job runs on the next iteration of sendbranchmail?"

Actually, no. That job will be in the RUNNING state. (Or, if killed nicely with SIGINT, in the FAILED state.) Only WAITING jobs will be run.

Aaron Bentley (abentley) on 2011-06-23
Changed in launchpad:
assignee: nobody → Aaron Bentley (abentley)
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: Triaged → Fix Committed
Aaron Bentley (abentley) on 2011-06-24
tags: added: qa-untestable
removed: qa-needstesting
William Grant (wgrant) on 2011-06-27
Changed in launchpad:
status: Fix Committed → Fix Released
Aaron Bentley (abentley) on 2011-07-18
tags: added: qa-ok
removed: qa-untestable
Aaron Bentley (abentley) wrote :

The branch that landed a fix for Launchpad was rolled back.

Changed in launchpad:
status: Fix Released → Triaged
Curtis Hovey (sinzui) on 2012-09-11
Changed in launchpad:
assignee: Aaron Bentley (abentley) → nobody
William Grant (wgrant) on 2012-11-20
tags: added: bzr
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers