Launchpad itself

scan_branches terminated for excessive memory abuse

Bug #690021 reported by Steve McInerney on 2010-12-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	Critical	Unassigned	Launchpad itself 11.05

Bug Description

bzrsyncd 20 0 5681m 4.5g 2432 D 1 57.0 4:19.95 /usr/bin/python /srv/bzrsyncd.launchpad.net/production/launchpad/cronscripts/scan_branches.py

4.5Gb RSS, 5Gb Virt.
terminated, as that was driving us into swap.

https://pastebin.canonical.com/40916/ for a log extract.

This was while scanning ~vcs-imports/linux/btrfs

See original description

Tags:

Related branches

lp:~mbp/launchpad/690021-rlimit

Merged into lp:launchpad at revision 12991

Robert Collins (community): Approve on 2011-04-06

Martin Pool (community): Disapprove on 2010-12-22

Tim Penhey (community): Needs Information on 2010-12-20

Michael Hudson-Doyle: Approve on 2010-12-15

Steve McInerney (spm) on 2010-12-14

tags:

added: canonical-losa-lp

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-15: Re: [Bug 690021] [NEW] scan_branches terminated for excessive memory abuse

On 14 December 2010 16:07, Steve McInerney
<email address hidden> wrote:
> Public bug reported:
>
> bzrsyncd 20 0 5681m 4.5g 2432 D 1 57.0 4:19.95 /usr/bin/python
> /srv/bzrsyncd.launchpad.net/production/launchpad/cronscripts/scan_branches.py
>
> 4.5Gb RSS, 5Gb Virt.
> terminated, as that was driving us into swap.

Perhaps, rather than relying on it being manually terminated, we
should set a ulimit on it so that it's consistent and doesn't harm
anything else?

If we wanted to make such a change, by what technical means could we
do so (is there a branch that controls how it's run?) and who ought to
be involved in authorizing it?

--
Martin

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-15:

After some discussion with spm and mwh:

There are really (at least) two bugs here: whatever was using the memory, and that it wasn't automatically stopped.

Other scripts set an rlimit on themselves at startup; scan_branches should too.

Changed in launchpad-code:
importance:	Undecided → Medium
status:	New → Confirmed

Martin Pool (mbp) on 2010-12-15

description:

updated

Revision history for this message

Martin Pool (mbp) wrote on 2010-12-15:

https://code.launchpad.net/~mbp/launchpad/690021-rlimit/+merge/43733 to impose an rlimit, and bug 690512 for the real underlying fix.

Changed in launchpad-code:
assignee:	nobody → Martin Pool (mbp)
status:	Confirmed → In Progress

Revision history for this message

Robert Collins (lifeless) wrote on 2011-05-06:

Just happened to catch a nagios alert this afternoon: 0% swap free. AAAAAAA
Investigations following revealed:
This has probably been happening since at least Sat Apr 9 16:26:13 UTC 2011
It's happening around once or twice a day.
The process is getting to around 7Gb RSS before exhausting system memory and being killed.
It seems to be spinning somewhere to get there, is only after 7-10 minutes of run time that it achieves this.

Recent examples:
[19641951.662123] Out of memory: kill process 5894 (sh) score 2130172 or a child
[19641951.682454] Killed process 5903 (python2.6)

is for Thu May 5 04:37:19 UTC 2011
https://pastebin.canonical.com/47138/

----
[19622312.833309] Out of memory: kill process 18559 (sh) score 2333130 or a child
[19622312.873941] Killed process 18560 (python2.6)

is for Wed May 4 23:10:01 UTC 2011
https://pastebin.canonical.com/47137/

----
logs from the ps_dumper show:
https://pastebin.canonical.com/47140/
Roughly a week of processes with > 5Gb RSS.

fields are:
USER PID PPID NI PRI TIME %MEM RSS SZ VSZ STAT BLOCKED NLWP STARTED ELAPSED CMD

----
dmesg history shows:
https://pastebin.canonical.com/47139/

Changed in launchpad:
importance:	Medium → Critical

Revision history for this message

Robert Collins (lifeless) wrote on 2011-05-06:

This is effectively an OOPS (if its not killed by hand it will OOM eventually)

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-05-06:

Fixed in stable r12991 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/12991) by a commit, but not testable.

Changed in launchpad:
milestone:	none → 11.05
tags:	added: qa-untestable
Changed in launchpad:
status:	In Progress → Fix Committed

William Grant (wgrant) on 2011-05-11

Changed in launchpad:
status:	Fix Committed → Triaged

William Grant (wgrant) on 2011-05-16

tags:

removed: qa-untestable

Martin Pool (mbp) on 2011-05-18

Changed in launchpad:
assignee:	Martin Pool (mbp) → nobody

Revision history for this message

Martin Pool (mbp) wrote on 2011-05-18:

I'm unassigning myself because I do not intend to do any more on this right now, but I am happy to help with this.

My landing should make this fail cleanly/safely without hitting the kernel oomkiller. We may even get a traceback/oops out of it, which would make it easier to understand just what is going wrong. If we don't get that, it would at least be nice to identify a particular branch that blows up.

lp is currently running 2.2.3dev, and there have been a lot of memory usage improvements since then. It is possible this problem will be fixed by upgrading to bzr2.4b for which other work is underway.

Revision history for this message

Aaron Bentley (abentley) wrote on 2011-06-23:

Branch scan jobs now have a memory limit of 2GB. Is anything else needed to mark this fixed?

Changed in launchpad:
status:	Triaged → Incomplete

Revision history for this message

Robert Collins (lifeless) wrote on 2011-06-23:

Aaron says we will get an OOPS now when the job hits the memory limit and fails; so this bug is fixed - we'll get separate ones for specific branches that trigger the excessive memory use. So this bug (failures take the machine down) is fixed.

Changed in launchpad:
status:	Incomplete → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #777598

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.