scan_branches terminated for excessive memory abuse

Bug #690021 reported by Steve McInerney
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Unassigned

Bug Description

bzrsyncd 20 0 5681m 4.5g 2432 D 1 57.0 4:19.95 /usr/bin/python /srv/bzrsyncd.launchpad.net/production/launchpad/cronscripts/scan_branches.py

4.5Gb RSS, 5Gb Virt.
terminated, as that was driving us into swap.

https://pastebin.canonical.com/40916/ for a log extract.

This was while scanning ~vcs-imports/linux/btrfs

Related branches

Steve McInerney (spm)
tags: added: canonical-losa-lp
Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 690021] [NEW] scan_branches terminated for excessive memory abuse

On 14 December 2010 16:07, Steve McInerney
<email address hidden> wrote:
> Public bug reported:
>
> bzrsyncd  20   0 5681m 4.5g 2432 D    1 57.0   4:19.95 /usr/bin/python
> /srv/bzrsyncd.launchpad.net/production/launchpad/cronscripts/scan_branches.py
>
> 4.5Gb RSS, 5Gb Virt.
> terminated, as that was driving us into swap.

Perhaps, rather than relying on it being manually terminated, we
should set a ulimit on it so that it's consistent and doesn't harm
anything else?

If we wanted to make such a change, by what technical means could we
do so (is there a branch that controls how it's run?) and who ought to
be involved in authorizing it?

--
Martin

Revision history for this message
Martin Pool (mbp) wrote :

After some discussion with spm and mwh:

There are really (at least) two bugs here: whatever was using the memory, and that it wasn't automatically stopped.

Other scripts set an rlimit on themselves at startup; scan_branches should too.

Changed in launchpad-code:
importance: Undecided → Medium
status: New → Confirmed
Martin Pool (mbp)
description: updated
Revision history for this message
Martin Pool (mbp) wrote :

https://code.launchpad.net/~mbp/launchpad/690021-rlimit/+merge/43733 to impose an rlimit, and bug 690512 for the real underlying fix.

Changed in launchpad-code:
assignee: nobody → Martin Pool (mbp)
status: Confirmed → In Progress
Revision history for this message
Robert Collins (lifeless) wrote :

Just happened to catch a nagios alert this afternoon: 0% swap free. AAAAAAA
Investigations following revealed:
This has probably been happening since at least Sat Apr 9 16:26:13 UTC 2011
It's happening around once or twice a day.
The process is getting to around 7Gb RSS before exhausting system memory and being killed.
It seems to be spinning somewhere to get there, is only after 7-10 minutes of run time that it achieves this.

Recent examples:
[19641951.662123] Out of memory: kill process 5894 (sh) score 2130172 or a child
[19641951.682454] Killed process 5903 (python2.6)

is for Thu May 5 04:37:19 UTC 2011
https://pastebin.canonical.com/47138/

----
[19622312.833309] Out of memory: kill process 18559 (sh) score 2333130 or a child
[19622312.873941] Killed process 18560 (python2.6)

is for Wed May 4 23:10:01 UTC 2011
https://pastebin.canonical.com/47137/

----
logs from the ps_dumper show:
https://pastebin.canonical.com/47140/
Roughly a week of processes with > 5Gb RSS.

fields are:
USER PID PPID NI PRI TIME %MEM RSS SZ VSZ STAT BLOCKED NLWP STARTED ELAPSED CMD

----
dmesg history shows:
https://pastebin.canonical.com/47139/

Changed in launchpad:
importance: Medium → Critical
Revision history for this message
Robert Collins (lifeless) wrote :

This is effectively an OOPS (if its not killed by hand it will OOM eventually)

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r12991 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/12991) by a commit, but not testable.

Changed in launchpad:
milestone: none → 11.05
tags: added: qa-untestable
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Triaged
William Grant (wgrant)
tags: removed: qa-untestable
Martin Pool (mbp)
Changed in launchpad:
assignee: Martin Pool (mbp) → nobody
Revision history for this message
Martin Pool (mbp) wrote :

I'm unassigning myself because I do not intend to do any more on this right now, but I am happy to help with this.

My landing should make this fail cleanly/safely without hitting the kernel oomkiller. We may even get a traceback/oops out of it, which would make it easier to understand just what is going wrong. If we don't get that, it would at least be nice to identify a particular branch that blows up.

lp is currently running 2.2.3dev, and there have been a lot of memory usage improvements since then. It is possible this problem will be fixed by upgrading to bzr2.4b for which other work is underway.

Revision history for this message
Aaron Bentley (abentley) wrote :

Branch scan jobs now have a memory limit of 2GB. Is anything else needed to mark this fixed?

Changed in launchpad:
status: Triaged → Incomplete
Revision history for this message
Robert Collins (lifeless) wrote :

Aaron says we will get an OOPS now when the job hits the memory limit and fails; so this bug is fixed - we'll get separate ones for specific branches that trigger the excessive memory use. So this bug (failures take the machine down) is fixed.

Changed in launchpad:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.