Memory usage of codehosting processes is excessive

Reported by Tom Haddon on 2009-08-21
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Bazaar
High
Unassigned
Launchpad itself
High
Unassigned

Bug Description

Each process for codehosting us using approx 60m RES memory and 250m VIRT memory. As an example:

codehost 7249 84.0 0.7 260616 61028 ? S 16:11 0:01 \_ /usr/bin/python2.4 /srv/bazaar.launchpad.net/production/launchpad-rev-8330/eggs/bzr-1.17-py2.4-linux-x86_64.egg/EGG-INFO/scripts/bzr lp-serve --inet 14302

At the moment, for instance, we have 52 such processes, so we're looking at a lot of memory usage overall, and this doesn't seem scalable. We're frequently forcing crowberry into swap usage even though it has 8GB of RAM.

Tom Haddon (mthaddon) wrote :

Currently seeing one of the bzr lp-serve processes using 1.3G RES, 2.5G VIRT:

https://pastebin.canonical.com/21400/

Michael Hudson-Doyle (mwhudson) wrote :

The problem in the description is really a bzr issue. We just run the smart server.

If the memory usage per process doesn't go down (and I don't think it will significantly in the near term) and usage increases (as we surely hope it will), then we'll need to start using multiple front-end boxes (something I've been banging on about for months). I don't think we have much choice really.

We know bzr can use a lot of memory, but telling us that won't help us
fix it - we already know.

To fix it we need to know what *that specific process was doing*, so
that we can reproduce, test, analyse and fix.

Getting:
 - the branch being operated on
 - how many packs and indices the branch had, its format
 - what smart server verbs were called

would be great.

The first is essential to have any hope of debugging. the latter is also
extremely important. The middle detail would be gravy.

-Rob

 tags memory
 importance high
 status confirmed
 done

Changed in bzr:
importance: Undecided → High
status: New → Confirmed
Jonathan Lange (jml) on 2009-08-26
Changed in launchpad-code:
status: Confirmed → Triaged
Tom Haddon (mthaddon) wrote :

On Wed, 2009-08-26 at 04:49 +0000, Robert Collins wrote:
> We know bzr can use a lot of memory, but telling us that won't help us
> fix it - we already know.
>
> To fix it we need to know what *that specific process was doing*, so
> that we can reproduce, test, analyse and fix.
>
> Getting:
> - the branch being operated on
> - how many packs and indices the branch had, its format

Here's an example (taking the first process listing in that top output
as an example):

https://pastebin.canonical.com/21515/

> - what smart server verbs were called

I asked Aaron about this and he said he doesn't think we log what verbs
were used. Let me know if that's wrong and how to find it.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tom Haddon wrote:
> On Wed, 2009-08-26 at 04:49 +0000, Robert Collins wrote:
>> We know bzr can use a lot of memory, but telling us that won't help us
>> fix it - we already know.
>>
>> To fix it we need to know what *that specific process was doing*, so
>> that we can reproduce, test, analyse and fix.
>>
>> Getting:
>> - the branch being operated on
>> - how many packs and indices the branch had, its format
>
> Here's an example (taking the first process listing in that top output
> as an example):
>
> https://pastebin.canonical.com/21515/
>
>> - what smart server verbs were called
>
> I asked Aaron about this and he said he doesn't think we log what verbs
> were used. Let me know if that's wrong and how to find it.
>

Well, you could run with "-Dhpss" which would log them. I'm not sure how
to inject that via the setup. Perhaps setting the config entry:

debug_flags = hpss

In ~/.bazaar/bazaar.conf

However, when you encounter one of these processes, if you could a
terminal attached, then you could use SIGQUIT to break into the
debugger, and figure out what is currently happening.

I wonder if we would want to hook up SIGHUP or something along those
lines to just dump the current backtrace into .bzr.log, as a method for
debugging service processes...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqVjk0ACgkQJdeBCYSNAAPcCACfUMTWljiDKh1WIgortHZxjJhZ
TiwAn0ai68WwQbLY7UN7Y8i/lq2r+A8T
=heI6
-----END PGP SIGNATURE-----

John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...

>> Getting:
>> - the branch being operated on
>> - how many packs and indices the branch had, its format
>
> Here's an example (taking the first process listing in that top output
> as an example):
>
> https://pastebin.canonical.com/21515/

So that tells us:

1) It is a mysql maria branch, presumably: ~maria-captains/maria/1.5
2) It is a 1.9 format branch (so not --2a)
3) It currently has open one of the very large text index files:

python2.4 13024 codehost 4r REG 254,0 14913650 97550603
   /srv.../indices/05ff4b81be41c0ce966dc98a45e87208.tix
python2.4 13024 codehost 12r FIFO 0,5 325267728 pipe

But that seems to be the only file open as part of the actual branch.
All the rest appear to just be python eggs or source, or extensions, etc.

4) You are running on python2.4

I know of at least one fairly serious memory related fix in python 2.5:
http://evanjones.ca/python-memory-part3.html

Whether that is at play here or not, I don't know. But long-lived
processes and python2.4 don't play nicely with memory.

5) Individually no process seems to be more that 7-800MB, but since you
have several running at once, it adds up.

6) The process has been running for ~6min. My guess is that it is
someone doing a full checkout (from scratch) of the code.

7) 28046 codehost 20 0 410m 96m 2300 R 44 1.2 293:03.42
/usr/bin/python2.4
/srv/bazaar.launchpad.net/production/launchpad/bin/twistd --pidfile=/srv/baz

The service process that is spawning all of this has a surprising (to
me) amount of memory consumed. I honestly don't know what it is doing,
but it has allocated 410MB of ram. (If I understand VIRT correctly.)
96MB resident is reasonable, though.

8) 542MB resident == 6.8% ram, so you have approx 8GB of ram on the
system. So you can run <20 of these processes before having very serious
issues. (And probably more like <10 before you are no longer caching
disk buffers, and starting to have problems.) (I guess that was also in
the meminfo at the beginning.)

9) I assume 'bzr lp-serve' is used to send/receive data from the outside
world, not for the mirroring process?

I'll try a little bit of experimenting here, to see if I can pinpoint
the memory consumption a little bit better. It certainly would help if
we knew what actual action is being performed...

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqVmCsACgkQJdeBCYSNAAPRawCdEdgYGU7QfxpRrDQ2RtyOrOE6
hfcAn2Vp3We4SmZ6q3LmEdNLTDpWC6kM
=g2sQ
-----END PGP SIGNATURE-----

On Wed, 2009-08-26 at 20:16 +0000, John A Meinel wrote:
> >> Getting:
> >> - the branch being operated on
> >> - how many packs and indices the branch had, its format
> >
> > Here's an example (taking the first process listing in that top output
> > as an example):
> >
> > https://pastebin.canonical.com/21515/
>
>
> So that tells us:
>
> 1) It is a mysql maria branch, presumably: ~maria-captains/maria/1.5
> 2) It is a 1.9 format branch (so not --2a)
> 3) It currently has open one of the very large text index files:
>
> python2.4 13024 codehost 4r REG 254,0 14913650 97550603
> /srv.../indices/05ff4b81be41c0ce966dc98a45e87208.tix
> python2.4 13024 codehost 12r FIFO 0,5 325267728 pipe
>
>
> But that seems to be the only file open as part of the actual branch.
> All the rest appear to just be python eggs or source, or extensions, etc.
>
> 4) You are running on python2.4
>
> I know of at least one fairly serious memory related fix in python 2.5:
> http://evanjones.ca/python-memory-part3.html
>
> Whether that is at play here or not, I don't know. But long-lived
> processes and python2.4 don't play nicely with memory.

Ouch. All of LP runs on 2.4...

> 5) Individually no process seems to be more that 7-800MB, but since you
> have several running at once, it adds up.

I'm not sure 7-800MB is reasonable for even a single process, but fwiw,
we've seen individual processes with up to 1.4GB before.

> 6) The process has been running for ~6min. My guess is that it is
> someone doing a full checkout (from scratch) of the code.
>
>
> 7) 28046 codehost 20 0 410m 96m 2300 R 44 1.2 293:03.42
> /usr/bin/python2.4
> /srv/bazaar.launchpad.net/production/launchpad/bin/twistd --pidfile=/srv/baz
>
> The service process that is spawning all of this has a surprising (to
> me) amount of memory consumed. I honestly don't know what it is doing,
> but it has allocated 410MB of ram. (If I understand VIRT correctly.)
> 96MB resident is reasonable, though.
>
> 8) 542MB resident == 6.8% ram, so you have approx 8GB of ram on the
> system. So you can run <20 of these processes before having very serious
> issues. (And probably more like <10 before you are no longer caching
> disk buffers, and starting to have problems.) (I guess that was also in
> the meminfo at the beginning.)

Yep, the server has 8GB RAM.

> 9) I assume 'bzr lp-serve' is used to send/receive data from the outside
> world, not for the mirroring process?
>
>
> I'll try a little bit of experimenting here, to see if I can pinpoint
> the memory consumption a little bit better. It certainly would help if
> we knew what actual action is being performed...
>
> John
> =:->

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John A Meinel wrote:
...

> 5) Individually no process seems to be more that 7-800MB, but since you
> have several running at once, it adds up.

So in doing a local bzr:// branch of a mysql-5.1 repo, I see the server
hit a peak of 299MB, and the client hit a peak of 178MB. (using
python2.5 and bzr.dev 4595.

Some other things come to mind, like they are using an older bzr client
which is causing more VFS style operations than expected.

299 is still a bit higher than "strictly necessary", though I wonder by
how much...

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqVnpsACgkQJdeBCYSNAAMhfQCgvep99aSn6hGneliEsBMZh/X8
U+AAn2X4LGPgDTdzXsmCXRIBnWinsnj9
=+bQ8
-----END PGP SIGNATURE-----

Jonathan Lange (jml) wrote :

John, 'bzr lp-serve' is almost exactly the same thing as 'bzr serve'. The difference is that we set up the smart server to have our virtual file system.

The process that spawns these bzr processes is also an SFTP server. This might explain the high memory usage.

I'd be more than happy to talk about ways to get better logging.

On Wed, 2009-08-26 at 20:44 +0000, John A Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> John A Meinel wrote:
> ...
>
> > 5) Individually no process seems to be more that 7-800MB, but since you
> > have several running at once, it adds up.
>
> So in doing a local bzr:// branch of a mysql-5.1 repo, I see the server
> hit a peak of 299MB, and the client hit a peak of 178MB. (using
> python2.5 and bzr.dev 4595.

Can you do the same with the versions we're using for LP (of python and
bzr) to confirm there's a difference and not just that it's a difference
of environment?

> Some other things come to mind, like they are using an older bzr client
> which is causing more VFS style operations than expected.
>
> 299 is still a bit higher than "strictly necessary", though I wonder by
> how much...
>
> John
> =:->
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkqVnpsACgkQJdeBCYSNAAMhfQCgvep99aSn6hGneliEsBMZh/X8
> U+AAn2X4LGPgDTdzXsmCXRIBnWinsnj9
> =+bQ8
> -----END PGP SIGNATURE-----
>

Robert Collins (lifeless) wrote :

Also, do you have the C extensions built for bzr?

Tom Haddon (mthaddon) wrote :

We do have the C extensions built:

https://pastebin.canonical.com/21534/

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tom Haddon wrote:
> On Wed, 2009-08-26 at 20:44 +0000, John A Meinel wrote:
> John A Meinel wrote:
> ...
>
>>>> 5) Individually no process seems to be more that 7-800MB, but since you
>>>> have several running at once, it adds up.
> So in doing a local bzr:// branch of a mysql-5.1 repo, I see the server
> hit a peak of 299MB, and the client hit a peak of 178MB. (using
> python2.5 and bzr.dev 4595.
>
>> Can you do the same with the versions we're using for LP (of python and
>> bzr) to confirm there's a difference and not just that it's a difference
>> of environment?

I cannot trivially do so.

Assuming you are using bzr 1.17 on LP, we don't really have any idea
what version the client is, without a log. But we could also assume 1.17
for now.

Going further, I'm also not using the exact same branch. I'm using a
mysql-5.1 not mysql-maria. Nor will it have exactly the same
packs-on-disk layout, etc.

But the real reason is that for whatever reason python2.4 doesn't like
my bzr. In fact my python 2.4 has some real problems:

$ python2.4 -c 'import locale'
Traceback (most recent call last):
  File "<string>", line 1, in ?
ImportError: No module named locale

Maybe I'll look closer at it tomorrow, but I'd rather just look at the
current situation.

I just realized... if you have 8GB of ram you almost certainly are
running a 64-bit platform. And memory consumption on 64-bit python is
pretty much always 2x. Not exactly, but it gives a really good ballpark.
(Almost every object is a pointer or a 'long' both of which go from
32-bit => 64-bit...)

So probably my 299MB fits the 540MB resident that you were seeing.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqV9C4ACgkQJdeBCYSNAAO8EQCePHpdDzrWIAbjJbGzAhWfyqRN
dGMAoIjB1xNucqbZqm/o3C/HZaZP9t/p
=CY3B
-----END PGP SIGNATURE-----

On Thu, 2009-08-27 at 02:49 +0000, John A Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Tom Haddon wrote:
> > On Wed, 2009-08-26 at 20:44 +0000, John A Meinel wrote:
> > John A Meinel wrote:
> > ...
> >
> >>>> 5) Individually no process seems to be more that 7-800MB, but since you
> >>>> have several running at once, it adds up.
> > So in doing a local bzr:// branch of a mysql-5.1 repo, I see the server
> > hit a peak of 299MB, and the client hit a peak of 178MB. (using
> > python2.5 and bzr.dev 4595.
> >
> >> Can you do the same with the versions we're using for LP (of python and
> >> bzr) to confirm there's a difference and not just that it's a difference
> >> of environment?
>
> I cannot trivially do so.
>
> Assuming you are using bzr 1.17 on LP, we don't really have any idea
> what version the client is, without a log. But we could also assume 1.17
> for now.
>
> Going further, I'm also not using the exact same branch. I'm using a
> mysql-5.1 not mysql-maria. Nor will it have exactly the same
> packs-on-disk layout, etc.
>
> But the real reason is that for whatever reason python2.4 doesn't like
> my bzr. In fact my python 2.4 has some real problems:
>
> $ python2.4 -c 'import locale'
> Traceback (most recent call last):
> File "<string>", line 1, in ?
> ImportError: No module named locale
>
> Maybe I'll look closer at it tomorrow, but I'd rather just look at the
> current situation.
>
> I just realized... if you have 8GB of ram you almost certainly are
> running a 64-bit platform.

Yes, pretty much all our servers are 64-bit.

> And memory consumption on 64-bit python is
> pretty much always 2x. Not exactly, but it gives a really good ballpark.
> (Almost every object is a pointer or a 'long' both of which go from
> 32-bit => 64-bit...)
>
> So probably my 299MB fits the 540MB resident that you were seeing.

In that case, that suggests that there's no improvement from moving from
python2.4 -> python2.5, right?

>
> John
> =:->
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (Cygwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkqV9C4ACgkQJdeBCYSNAAO8EQCePHpdDzrWIAbjJbGzAhWfyqRN
> dGMAoIjB1xNucqbZqm/o3C/HZaZP9t/p
> =CY3B
> -----END PGP SIGNATURE-----
>

Andrew Bennetts (spiv) wrote :

John's just landed some work that cuts down memory use a fair bit for large branches. IIRC we expect savings of roughly 20MB for a branch the size of Launchpad, depending on factors like exactly how much of the repository indices are loaded. So if we're lucky it might cut loggerhead's memory consumption by 10%.

So this bug certainly isn't fixed yet, but hopefully when Launchpad rolls out 2.1 the problem will be eased a little bit. It would be interesting to know just how much the new version helps, I guess if we monitor the lp graphs we'll be able to see what difference (if any) it makes?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John's just landed some work that cuts down memory use a fair bit for
> large branches. IIRC we expect savings of roughly 20MB for a branch the
> size of Launchpad, depending on factors like exactly how much of the
> repository indices are loaded. So if we're lucky it might cut
> loggerhead's memory consumption by 10%.
>
> So this bug certainly isn't fixed yet, but hopefully when Launchpad
> rolls out 2.1 the problem will be eased a little bit. It would be
> interesting to know just how much the new version helps, I guess if we
> monitor the lp graphs we'll be able to see what difference (if any) it
> makes?
>

Just to mention, the fixes won't be in the 2.1.0b1 release, but will be
in 2.1.0b2. There has been a fair amount of fallout between different
platforms, etc. So I'm going to be conservative and land it at the
beginning of 2.1.0b2, rather than at the end of 2.1.0b1.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrUl/IACgkQJdeBCYSNAAOYZQCg0yBZ/TrOfveTPRumXokVEtHZ
B34An01X/VQBCh1AKNJAqDxfcwICMld4
=8xSl
-----END PGP SIGNATURE-----

We've landed the change to use bzr 2.1b3 so this should improve a bunch after the next rollout.

We can also probably put a cap in place on the consumption of each process like we do for the puller. The user experience won't be great when their pull which will presumably have been running for a long time suddenly croaks because the process at the far end got aborted though.

I don't see the issue that each process takes a noticeable hunk of ram and we run one process per connection going away.

The long term solution still has to be running the bzr lp-serve process on multiple hosts.

Tom Haddon (mthaddon) wrote :

This is becoming more of a problem - see https://pastebin.canonical.com/25430/. We've had numerous swap alerts for crowberry over the last week or so.

Chris Jones (cmsj) wrote :

We have actually seen a few instances now of the OOM Killer stepping in:

Dec 4 22:08:49 crowberry kernel: [1335795.418553] Out of memory: kill process 16826 (python2.5) score 661096 or a child
Dec 6 14:17:47 crowberry kernel: [1480269.669812] Out of memory: kill process 25020 (python2.5) score 483323 or a child
Dec 7 09:48:23 crowberry kernel: [1550427.592971] Out of memory: kill process 28630 (python2.5) score 382509 or a child
Dec 7 09:50:54 crowberry kernel: [1550630.331084] Out of memory: kill process 1500 (python2.5) score 270699 or a child
Dec 7 09:54:51 crowberry kernel: [1550855.178426] Out of memory: kill process 26136 (twistd) score 270708 or a child

Ugh, I had hoped that the situation wasn't this bad yet :(

I had also hoped that the last rollout would help.

Can we add more RAM to the box? I realize this is Not A Solution, but it might give us some breathing room.

I sense a reprioritization in my future...

John A Meinel (jameinel) wrote :

Obviously the bzr version looks like it includes the latest memory improvements. Are we sure that all of the extensions got compiled properly? I wouldn't be surprised if memory consumption was >= before if _static_tuple_c.c didn't get compiled properly. There were more fixes than just that, which should have reduced memory consumption.

As for OOM... isn't this python in 64bit mode? If it was 32-bit python then I could certainly see us hitting the limit. If this is python2.4, then I'll also not guarantee too much. SInce in python2.4 allocations of "integer" objects are never released back to the system. And I would offer that there are probably some other memory allocation improvements in python 2.5 and newer.

Also, is there any way to probe what actual command is running on a process that hits 1GB of RAM? (What branch is it accessing? What action is it performing, etc.)

I wonder if it would be useful to hook up a SIGHUP handler that would dump the current stack trace into .bzr.log. Do you think it would help debugging this? (It seems like something that we could do in a small plugin if it would make it easier.)

John A Meinel wrote:
> Obviously the bzr version looks like it includes the latest memory
> improvements. Are we sure that all of the extensions got compiled
> properly? I wouldn't be surprised if memory consumption was >= before if
> _static_tuple_c.c didn't get compiled properly. There were more fixes
> than just that, which should have reduced memory consumption.

It looks like everything got built correctly:
https://pastebin.canonical.com/25447/

> As for OOM... isn't this python in 64bit mode? If it was 32-bit python
> then I could certainly see us hitting the limit.

The machine ran out of swap.

> If this is python2.4,
> then I'll also not guarantee too much.

You don't have to look very hard at the log of the OOM killer Chris
provided to see that it's clearly python 2.5 being killed.

> SInce in python2.4 allocations of
> "integer" objects are never released back to the system. And I would
> offer that there are probably some other memory allocation improvements
> in python 2.5 and newer.
>
> Also, is there any way to probe what actual command is running on a
> process that hits 1GB of RAM? (What branch is it accessing? What action
> is it performing, etc.)

No :( Unless bzr serve does some of this. Is there a server side
equivalent of -Dhpss?

> I wonder if it would be useful to hook up a SIGHUP handler that would
> dump the current stack trace into .bzr.log. Do you think it would help
> debugging this? (It seems like something that we could do in a small
> plugin if it would make it easier.)

It would be nice if we could have each lp-serve process not dump
everything into the same .bzr.log file I guess, but otherwise that
sounds helpful.

Cheers,
mwh

Download full text (3.5 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Hudson wrote:
> John A Meinel wrote:
>> Obviously the bzr version looks like it includes the latest memory
>> improvements. Are we sure that all of the extensions got compiled
>> properly? I wouldn't be surprised if memory consumption was >= before if
>> _static_tuple_c.c didn't get compiled properly. There were more fixes
>> than just that, which should have reduced memory consumption.
>
> It looks like everything got built correctly:
> https://pastebin.canonical.com/25447/

This doesn't look right to me: vv
- -rwxr-xr-x 1 ... 285 ... _simple_set_pyx.py*
- -rw-r--r-- 1 ... 699 ... _simple_set_pyx.pyc
- -rw-r--r-- 1 ... 699 ... _simple_set_pyx.pyo
- -rwxr-xr-x 1 ... 99534 ... _simple_set_pyx.so*

On my system I see:
$ ls bzrlib/_simple_set_pyx*

bzrlib/_simple_set_pyx.c
bzrlib/_simple_set_pyx.pxd
bzrlib/_simple_set_pyx.pyx
bzrlib/_simple_set_pyx.h
bzrlib/_simple_set_pyx.pyd
bzrlib/_simple_set_pyx_api.h

Similarly here:
- -rwxr-xr-x 1 ... 285 ... _static_tuple_c.py*
- -rw-r--r-- 1 ... 699 ... _static_tuple_c.pyc
- -rw-r--r-- 1 ... 699 ... _static_tuple_c.pyo
- -rwxr-xr-x 1 ... 56062 ... _static_tuple_c.so*
- -rwxr-xr-x 1 ... 1557 ... static_tuple.py*
- -rw-r--r-- 1 ... 1301 ... static_tuple.pyc
- -rw-r--r-- 1 ... 1301 ... static_tuple.pyo
- -rwxr-xr-x 1 ... 2760 ... _static_tuple_py.py*
- -rw-r--r-- 1 ... 3529 ... _static_tuple_py.pyc
- -rw-r--r-- 1 ... 3529 ... _static_tuple_py.pyo

I have 0 idea what is creating a 285 byte .py file for a compiled
extension. However, it could very easily cause problems using the .so...

>
>> As for OOM... isn't this python in 64bit mode? If it was 32-bit python
>> then I could certainly see us hitting the limit.
>
> The machine ran out of swap.
>
>> If this is python2.4,
>> then I'll also not guarantee too much.
>
> You don't have to look very hard at the log of the OOM killer Chris
> provided to see that it's clearly python 2.5 being killed.
>
>> SInce in python2.4 allocations of
>> "integer" objects are never released back to the system. And I would
>> offer that there are probably some other memory allocation improvements
>> in python 2.5 and newer.
>>
>> Also, is there any way to probe what actual command is running on a
>> process that hits 1GB of RAM? (What branch is it accessing? What action
>> is it performing, etc.)
>
> No :( Unless bzr serve does some of this. Is there a server side
> equivalent of -Dhpss?

No, and it has been something we wanted to get for the lp guys. I think
jml was originally wanting more than that. (It is one thing to be able
to tell that 'get_stream' was requested, it is another to know that this
is caused by 'bzr branch', versus 'bzr pull', versus ...)

>
>> I wonder if it would be useful to hook up a SIGHUP handler that would
>> dump the current stack trace into .bzr.log. Do you think it would help
>> debugging this? (It seems like something that we could do in a small
>> plugin if it would make it easier.)
>
> It would be nice if we could have each lp-serve process not dump
> everything into the same .bzr.log file I guess, but otherwise that
> sounds helpful.
...

Read more...

John A Meinel (jameinel) wrote :

How hard is it to run a 32-bit python on that machine? (BTW). In my testing, memory consumption for 32-bit python was close to half that of 64-bit. (Branching launchpad pre fixes was 1.3GB -64bit down to 954MB post, and 901MB-32bit down to 548MB. Which is 1.4:1 pre, and 1.7:1 post fixes.)

I don't know the sysadmin overhead. But

1) No process specifically should need 64-bit. If we hit >3GB (or 2GB even), then killing that process is probably a kindness :)
2) 1.7:1 is certainly a nice improvement versus having to buy 2x the memory

Download full text (3.2 KiB)

John A Meinel wrote:
> Michael Hudson wrote:
>> John A Meinel wrote:
>>> Obviously the bzr version looks like it includes the latest memory
>>> improvements. Are we sure that all of the extensions got compiled
>>> properly? I wouldn't be surprised if memory consumption was >= before if
>>> _static_tuple_c.c didn't get compiled properly. There were more fixes
>>> than just that, which should have reduced memory consumption.
>> It looks like everything got built correctly:
>> https://pastebin.canonical.com/25447/
>
> This doesn't look right to me: vv
> -rwxr-xr-x 1 ... 285 ... _simple_set_pyx.py*
> -rw-r--r-- 1 ... 699 ... _simple_set_pyx.pyc
> -rw-r--r-- 1 ... 699 ... _simple_set_pyx.pyo
> -rwxr-xr-x 1 ... 99534 ... _simple_set_pyx.so*
>
>
> On my system I see:
> $ ls bzrlib/_simple_set_pyx*
>
> bzrlib/_simple_set_pyx.c
> bzrlib/_simple_set_pyx.pxd
> bzrlib/_simple_set_pyx.pyx
> bzrlib/_simple_set_pyx.h
> bzrlib/_simple_set_pyx.pyd
> bzrlib/_simple_set_pyx_api.h
>
>
> Similarly here:
> -rwxr-xr-x 1 ... 285 ... _static_tuple_c.py*
> -rw-r--r-- 1 ... 699 ... _static_tuple_c.pyc
> -rw-r--r-- 1 ... 699 ... _static_tuple_c.pyo
> -rwxr-xr-x 1 ... 56062 ... _static_tuple_c.so*
> -rwxr-xr-x 1 ... 1557 ... static_tuple.py*
> -rw-r--r-- 1 ... 1301 ... static_tuple.pyc
> -rw-r--r-- 1 ... 1301 ... static_tuple.pyo
> -rwxr-xr-x 1 ... 2760 ... _static_tuple_py.py*
> -rw-r--r-- 1 ... 3529 ... _static_tuple_py.pyc
> -rw-r--r-- 1 ... 3529 ... _static_tuple_py.pyo
>
>
> I have 0 idea what is creating a 285 byte .py file for a compiled
> extension.

This is what it contains: https://pastebin.canonical.com/25455/ Which
is a bit strange.

> However, it could very easily cause problems using the .so...

I got mbarnett to run lsof -p on an lp-serve process, it looks from
https://pastebin.canonical.com/25454/ like the .so's are loaded.

>>> SInce in python2.4 allocations of
>>> "integer" objects are never released back to the system. And I would
>>> offer that there are probably some other memory allocation improvements
>>> in python 2.5 and newer.
>>>
>>> Also, is there any way to probe what actual command is running on a
>>> process that hits 1GB of RAM? (What branch is it accessing? What action
>>> is it performing, etc.)
>> No :( Unless bzr serve does some of this. Is there a server side
>> equivalent of -Dhpss?
>
> No, and it has been something we wanted to get for the lp guys. I think
> jml was originally wanting more than that. (It is one thing to be able
> to tell that 'get_stream' was requested, it is another to know that this
> is caused by 'bzr branch', versus 'bzr pull', versus ...)

Well, let's not let the perfect get in the way of the good. I'd just be
happy to know which branch was being accessed.

>>> I wonder if it would be useful to hook up a SIGHUP handler that would
>>> dump the current stack trace into .bzr.log. Do you think it would help
>>> debugging this? (It seems like something that we could do in a small
>>> plugin if it would make it easier.)
>> It would be nice if we could have each lp-serve process not dump
>> everything into the same .bzr.log file I guess, but otherwise that
>> s...

Read more...

Michael Hudson wrote:
>
> Well, let's not let the perfect get in the way of the good. I'd just be
> happy to know which branch was being accessed.

Install a Branch.open hook, or maybe BzrDir.pre_open (so you can catch
repo-only accesses)? That won't catch VFS accesses, but then those
shouldn't be causing memory pressure.

Steve McInerney (spm) wrote :

We managed to catch 2 processes in near memory run-away mode earlier today on codehost.

https://pastebin.canonical.com/29225/

is a ps(1) view over time of one of the two. Note the peak RSS usage of ~ 18.5Gb.
When killed; both were around 13Gb of RSS and causing swap alerts.

This would suggest that an alternate wording for the title of this bug may be necessary. 'excessive' doesn't seem to quite capture the full horror. ;-)

I believe mwhudson has started working on the bug asking for a memory
limit on bzr processes. When that lands we should start seeing
failures if things do get into a bloated state, and hopefully the
failure will include some meaningful traceback.

Martin Pool (mbp) wrote :

That's bug 532213

Martin Pool wrote:
> I believe mwhudson has started working on the bug asking for a memory
> limit on bzr processes. When that lands we should start seeing
> failures if things do get into a bloated state, and hopefully the
> failure will include some meaningful traceback.

I doubt the approach I've taken -- setrlimit -- will result in
tracebacks, I think it will be termination with rather more prejudice
than that. But maybe, we'll see.

Cheers,
mwh

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Hudson wrote:
> Martin Pool wrote:
>> I believe mwhudson has started working on the bug asking for a memory
>> limit on bzr processes. When that lands we should start seeing
>> failures if things do get into a bloated state, and hopefully the
>> failure will include some meaningful traceback.
>
> I doubt the approach I've taken -- setrlimit -- will result in
> tracebacks, I think it will be termination with rather more prejudice
> than that. But maybe, we'll see.
>
> Cheers,
> mwh
>

Could you do something silly like spawn a thread at startup that polls
vmstat and issues SIGINT to the running process if it exceeds a soft
threshold, and then have a hard threshold above that?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkukWzAACgkQJdeBCYSNAANbfACdHIW11WqDOUXTxvnd1a1LcGHf
LX0AoIlup91SOAOf07/Hov36HMauD74f
=B52z
-----END PGP SIGNATURE-----

Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp

FWIW, I see this problem too. In my case, I'm working with a repo that was converted from Subversion. It has some rather large binary data in it too (~400mb file that has a custom toolchain). Several versions of it were checked in over time. I'd expect that we shouldn't need to do much server-side. I mean we need the graph, but we shouldn't revisions be transferred in a more streamy fashion?

Oh, and just to provide some numbers, the repository is ~3.5GB in size, with the 400mb file being the largest committed, and has changed several times.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers